tl; dr:对文章的简化分析,作者提出了两个有趣的定理,在此基础上,他找到了一种从嵌入矩阵中提取隐藏的含义向量的方法。有关如何重现结果的指南。笔记本电脑可以在github上找到。
介绍
在本文中,我想谈一谈研究人员Sanjev Arora在《词义的线性代数结构及其对多义性的应用》一文中发现的一件令人惊奇的事情。这是他尝试为词嵌入的属性提供理论依据的一系列文章之一。在同一作品中,Arora假设简单的嵌入(例如word2vec或Glove)实际上包含一个单词的几种含义,并提供了一种还原它们的方法。在整篇文章中,我将尽量坚持原始示例。
更正式地说, 我们表示“ 领带 ”一词的某个嵌入矢量,它可能具有打结或“领带”的含义,也可能是动词“ tie up”。Arora建议将此向量写成以下线性组合
哪里 这是tie一词的可能含义之一,并且-系数。让我们尝试弄清楚这是怎么发生的。
理论
由非数学家撰写,请报告所有错误,尤其是在垫子上。术语。
阿拉拉理论简述
由于Arora的启动工作比这复杂得多,因此我尚未准备完整的评估。但是,我们将简要了解一下它的样子。
因此,Arora提供了一个想法,即任何文本都是由生成模型生成的。在她工作的每个步骤中 产生一个词 ... 该模型由上下文向量组成 和嵌入向量 . (dimensions), , . , , - (, ), — (, ), , , — .
, .. - , . . , . : " " , " ". , "": , .
, . , , , .
: , . ,
— , — , — partition function. , , .
. , , : , , , . Y, X .
. - , - .
, , . , , "". :
, ", , , ". , , , ", , , " , " " .
, . , , . , ( , ). , , . .
1
, . ,
, . . . . , , . , , ( ). , , out-of-vocabulary , , .
, . , SIF . , , , . , SIF k, , , TF-IDF.
, , 1, . , - , , .
. , - , , . :
- . .
- , , SIF 20 , . , n — .
- SIF .
- SIF
, .. . 1/3 , A 2\3 . . .
#paragraphs | 250k | 500k | 750k | 1 million |
---|---|---|---|---|
cos similarity | 0.94 | 0.95 | 0.96 | 0.96 |
2
, . - , . , , .. , , tie_1 tie_2, tie_1 — , tie2 — .
, , $<!-- math>$inline$ \upsilon{w{s1} } \upsilon{w_{s2} } $inline$</math -->$. , , , ,
and . , , .
, , , , ? , . . , . , , . , , , , , . , , , (inner product) . , , - (, , , ), , ! .
. ? . , , ,
— .
, k (sparsity parameter), m — .. , . k-SVD. , . , , ( , ). , , - , , , . .
, , .
import numpy as np
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from scipy.spatial.distance import cosine
import warnings
warnings.filterwarnings('ignore')
tmp_file = get_tmpfile("test_word2vec.txt")
_ = glove2word2vec("/home/astromis/Embeddings/glove.6B.300d.txt", tmp_file)
model = KeyedVectors.load_word2vec_format(tmp_file)
embeddings = model.wv
index2word = embeddings.index2word
embedds = embeddings.vectors
print(embedds.shape)
(400000, 300)
400000 .
2. k-svd
. ksvd.
!pip install ksvd
from ksvd import ApproximateKSVD
Requirement already satisfied: ksvd in /home/astromis/anaconda3/lib/python3.6/site-packages (0.0.3)
Requirement already satisfied: numpy in /home/astromis/anaconda3/lib/python3.6/site-packages (from ksvd) (1.14.5)
Requirement already satisfied: scikit-learn in /home/astromis/anaconda3/lib/python3.6/site-packages (from ksvd) (0.19.1)
, 2000 5.
: 10000 . , , , , .
%time
aksvd = ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, )
embedding_trans = embeddings.vectors
dictionary = aksvd.fit(embedding_trans).components_
gamma = aksvd.transform(embedding_trans)
CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 9.54 µs
#gamma = np.load('./data/mats/.npz')
# dictionary_glove6b_300d.np.npz - whole matrix file
dictionary = np.load('./data/mats/dictionary_glove6b_300d_10000.np.npz')
dictionary = dictionary[dictionary.keys()[0]]
#print(gamma.shape)
print(dictionary.shape)
(2000, 300)
#np.savez_compressed('gamma_glove6b_300d.npz', gamma)
#np.savez_compressed('dictionary_glove6b_300d.npz', dictionary)
3.
, . .
embeddings.similar_by_vector(dictionary[1354,:])
[('slave', 0.8417330980300903),
('slaves', 0.7482961416244507),
('plantation', 0.6208109259605408),
('slavery', 0.5356900095939636),
('enslaved', 0.4814416170120239),
('indentured', 0.46423888206481934),
('fugitive', 0.4226764440536499),
('laborers', 0.41914862394332886),
('servitude', 0.41276970505714417),
('plantations', 0.4113745093345642)]
embeddings.similar_by_vector(dictionary[1350,:])
[('transplant', 0.7767853736877441),
('marrow', 0.699995219707489),
('transplants', 0.6998592615127563),
('kidney', 0.6526087522506714),
('transplantation', 0.6381147503852844),
('tissue', 0.6344675421714783),
('liver', 0.6085026860237122),
('blood', 0.5676015615463257),
('heart', 0.5653558969497681),
('cells', 0.5476219058036804)]
embeddings.similar_by_vector(dictionary[1546,:])
[('commons', 0.7160810828208923),
('house', 0.6588335037231445),
('parliament', 0.5054076910018921),
('capitol', 0.5014163851737976),
('senate', 0.4895153343677521),
('hill', 0.48859673738479614),
('inn', 0.4566132128238678),
('congressional', 0.4341348707675934),
('congress', 0.42997264862060547),
('parliamentary', 0.4264637529850006)]
embeddings.similar_by_vector(dictionary[1850,:])
[('okano', 0.2669774889945984),
('erythrocytes', 0.25755012035369873),
('windir', 0.25621023774147034),
('reapportionment', 0.2507009208202362),
('qurayza', 0.2459488958120346),
('taschen', 0.24417680501937866),
('pfaffenbach', 0.2437630295753479),
('boldt', 0.2394050508737564),
('frucht', 0.23922981321811676),
('rulebook', 0.23821482062339783)]
! , . . , , . "tie" "spring" .
itie = index2word.index('tie')
ispring = index2word.index('spring')
tie_emb = embedds[itie]
string_emb = embedds[ispring]
simlist = []
for i, vector in enumerate(dictionary):
simlist.append( (cosine(vector, tie_emb), i) )
simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:15]]
for atoms_idx in six_atoms_ind:
nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
nearest_words = [word[0] for word in nearest_words]
print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))
Atom #162: win victory winning victories wins won 2-1 scored 3-1 scoring
Atom #58: game play match matches games played playing tournament players stadium
Atom #237: 0-0 1-1 2-2 3-3 draw 0-1 4-4 goalless 1-0 1-2
Atom #622: wrapped wrap wrapping holding placed attached tied hold plastic held
Atom #1899: struggles tying tied inextricably fortunes struggling tie intertwined redefine define
Atom #1941: semifinals quarterfinals semifinal quarterfinal finals semis semi-finals berth champions quarter-finals
Atom #1074: qualifier quarterfinals semifinal semifinals semi finals quarterfinal champion semis champions
Atom #1914: wearing wore jacket pants dress wear worn trousers shirt jeans
Atom #281: black wearing man pair white who girl young woman big
Atom #1683: overtime extra seconds ot apiece 20-17 turnovers 3-2 halftime overtimes
Atom #369: snap picked snapped pick grabbed picks knocked picking bounced pulled
Atom #98: first team start final second next time before test after
Atom #1455: after later before when then came last took again but
Atom #1203: competitions qualifying tournaments finals qualification matches qualifiers champions competition competed
Atom #1602: hat hats mask trick wearing wears sunglasses trademark wig wore
simlist = []
for i, vector in enumerate(dictionary):
simlist.append( (cosine(vector, string_emb), i) )
simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:15]]
for atoms_idx in six_atoms_ind:
nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
nearest_words = [word[0] for word in nearest_words]
print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))
Atom #528: autumn spring summer winter season rainy seasons fall seasonal during
Atom #1070: start begin beginning starting starts begins next coming day started
Atom #931: holiday christmas holidays easter thanksgiving eve celebrate celebrations weekend festivities
Atom #1455: after later before when then came last took again but
Atom #754: but so not because even only that it this they
Atom #688: yankees yankee mets sox baseball braves steinbrenner dodgers orioles torre
Atom #1335: last ago year months years since month weeks week has
Atom #252: upcoming scheduled preparations postponed slated forthcoming planned delayed preparation preparing
Atom #619: cold cool warm temperatures dry cooling wet temperature heat moisture
Atom #1775: garden gardens flower flowers vegetable ornamental gardeners gardening nursery floral
Atom #21: dec. nov. oct. feb. jan. aug. 27 28 29 june
Atom #84: celebrations celebration marking festivities occasion ceremonies celebrate celebrated celebrating ceremony
Atom #98: first team start final second next time before test after
Atom #606: vacation lunch hour spend dinner hours time ramadan brief workday
Atom #384: golden moon hemisphere mars twilight millennium dark dome venus magic
! , , , .
, , . , , .
. fastText, RusVectores. 300.
fasttext_model = KeyedVectors.load('/home/astromis/Embeddings/fasttext/model.model')
embeddings = fasttext_model.wv
index2word = embeddings.index2word
embedds = embeddings.vectors
embedds.shape
(164996, 300)
%time
aksvd = ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, )
embedding_trans = embeddings.vectors[:10000]
dictionary = aksvd.fit(embedding_trans).components_
gamma = aksvd.transform(embedding_trans)
CPU times: user 1 µs, sys: 2 µs, total: 3 µs
Wall time: 6.2 µs
dictionary = np.load('./data/mats/dictionary_rus_fasttext_300d.npz')
dictionary = dictionary[dictionary.keys()[0]]
embeddings.similar_by_vector(dictionary[1024,:], 20)
[('', 0.6854609251022339),
('', 0.6593252420425415),
('', 0.6360634565353394),
('', 0.5998549461364746),
('', 0.5971367955207825),
('', 0.5862340927124023),
('', 0.5788886547088623),
('', 0.5788123607635498),
('', 0.5623885989189148),
('', 0.5610565543174744),
('', 0.5551878809928894),
('', 0.551397442817688),
('', 0.5356274247169495),
('', 0.531707227230072),
('', 0.5174376368522644),
('', 0.5131562948226929),
('', 0.5120065212249756),
('', 0.5077806115150452),
('', 0.5074601173400879),
('', 0.5068254470825195)]
embeddings.similar_by_vector(dictionary[1582,:], 20)
[('', 0.45191124081611633),
('', 0.4515378475189209),
('', 0.4478364586830139),
('', 0.4280813932418823),
('', 0.41220104694366455),
('', 0.40772825479507446),
('', 0.4047147035598755),
('', 0.4030646085739136),
('', 0.39368513226509094),
('', 0.39012178778648376),
('', 0.3866344690322876),
('', 0.37968817353248596),
('', 0.3728911876678467),
('', 0.3663109242916107),
('', 0.3640827238559723),
('', 0.3474290072917938),
('', 0.3473641574382782),
('', 0.3468908369541168),
('', 0.34586742520332336),
('', 0.34555742144584656)]
embeddings.similar_by_vector(dictionary[500,:], 20)
[('', 0.6874514222145081),
('-', 0.5172050595283508),
('', 0.46720415353775024),
('', 0.44713956117630005),
('', 0.4144558310508728),
('', 0.40545403957366943),
('', 0.4030636250972748),
('-', 0.4016447067260742),
('', 0.38331469893455505),
('', 0.37292781472206116),
('', 0.3625457286834717),
('', 0.35121074318885803),
('', 0.3504621088504791),
('', 0.34097471833229065),
('', 0.33320850133895874),
('', 0.3277249336242676),
('', 0.3266661763191223),
('', 0.31865227222442627),
('::', 0.30150306224823),
('', 0.2975207567214966)]
itie = index2word.index('')
ispring = index2word.index('')
tie_emb = embedds[itie]
string_emb = embedds[ispring]
simlist = []
for i, vector in enumerate(dictionary):
simlist.append( (cosine(vector, string_emb), i) )
simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:10]]
for atoms_idx in six_atoms_ind:
nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
nearest_words = [word[0] for word in nearest_words]
print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))
Atom #185:
Atom #1217: -
Atom #1213:
Atom #1978:
Atom #1796:
Atom #839:
Atom #989:
Atom #414:
Atom #1140: -
Atom #878:
simlist = []
for i, vector in enumerate(dictionary):
simlist.append( (cosine(vector, tie_emb), i) )
simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:10]]
for atoms_idx in six_atoms_ind:
nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
nearest_words = [word[0] for word in nearest_words]
print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))
Atom #883: -
Atom #40:
Atom #215:
Atom #688:
Atom #386:
Atom #676:
Atom #414:
Atom #127:
Atom #592:
Atom #703: - -
#np.savez_compressed('./data/mats/gamma_rus_fasttext_300d.npz', gamma)
#np.savez_compressed('./data/mats/dictionary_rus_fasttext_300d.npz', dictionary)
.
, (Word sense indection), , 1. — , . , , . , , , . , .
UPD: knagaev .