分析一篇有关如何从嵌入中提取含义的文章

tl; dr:对文章的简化分析,作者提出了两个有趣的定理,在此基础上,他找到了一种从嵌入矩阵中提取隐藏的含义向量的方法。有关如何重现结果的指南。笔记本电脑可以在github上找到



介绍



在本文中,我想谈一谈研究人员Sanjev Arora在《词义的线性代数结构及其对多义性的应用》一文中发现的一件令人惊奇的事情这是他尝试为词嵌入的属性提供理论依据的一系列文章之一。在同一作品中,Arora假设简单的嵌入(例如word2vec或Glove)实际上包含一个单词的几种含义,并提供了一种还原它们的方法。在整篇文章中,我将尽量坚持原始示例。



更正式地说, υtie我们表示“ 领带 ”一词的某个嵌入矢量,它可能具有打结或“领带”的含义,也可能是动词“ tie up”。Arora建议将此向量写成以下线性组合



υtieα1υtie1+α2υtie2+α3υtie3+...



哪里 υtien这是tie一词的可能含义之一,并且α-系数。让我们尝试弄清楚这是怎么发生的。



理论



免责声明

由非数学家撰写,请报告所有错误,尤其是在垫子上。术语。



阿拉拉理论简述



由于Arora的启动工作比这复杂得多,因此我尚未准备完整的评估。但是,我们将简要了解一下它的样子。



因此,Arora提供了一个想法,即任何文本都是由生成模型生成的。在她工作的每个步骤中t 产生一个词 w... 该模型由上下文向量组成 和嵌入向量 uw. (dimensions), , . , , - (, ), — (, ), , , — .



, .. - , . . , . : " " , " ". , "": , .



, . , , , .

: , . , t w



P(w|ct)=1Zcexp<ct,υw>



ctt, υww, Zc=wexp<c,υw> — partition function. , , .



. , , : , , , . Y, X .



. - , - .



, , . , , "". :



, ", , , ". , , , ", , , " , " " .





, . , , . , ( , ). , , . .



1



, s n . A ,



υwAE[1nwisυwi|ws]



, . . w . S. , υs sS, u. , , u υw A ( ). , , out-of-vocabulary , , .



, . , SIF . , , , . , SIF υSIF k, , w, TF-IDF.



υSIF=1kn=1kυntf_idf(wn)



, , 1, c. , - , , .



. , - w, υw , . :



  1. . V.
  2. wV, , SIF 20 w, . wV (νw1,νw2,,...νwn,), n — w .
  3. uw SIF wV uw=1nt=1nνwt.
  4. argminAA||Auwυw||22
  5. SIF υw=Auw


, .. . 1/3 , A 2\3 . . .



#paragraphs 250k 500k 750k 1 million
cos similarity 0.94 0.95 0.96 0.96


2



, w s1 s2. υw - , . , , .. , , tie_1 tie_2, tie_1 — , tie2 — .

, , $<!-- math>$inline$ \upsilon
{w{s1} } </math>$$<!math>\upsilon{w_{s2} } $inline$</math -->$. , , , υwυ0, υ



υw=f1f1+f2υs1+f2f1+f2υs2=αυs1+βυs2



f1 f2 s1 and s2 . , , .



, , , , ? , alpha. . , c . , , . , , , , , . , , , (inner product) . , , - (, , , ), υtie1 , ! .



. ? d k,n. k<n, A1,A2,...,Am, ,



υw=j=1mαw,jAj+μw



k α μw — .



wυwj=1mαw,jAj22



, k (sparsity parameter), m — .. , . k-SVD. , . , A , ( , A ). , , - Ai , , , m . .





, , .



import numpy as np

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from scipy.spatial.distance import cosine
import warnings
warnings.filterwarnings('ignore')


1. Gensim

GloVe.

, 300- .



tmp_file = get_tmpfile("test_word2vec.txt")
_ = glove2word2vec("/home/astromis/Embeddings/glove.6B.300d.txt", tmp_file)
model = KeyedVectors.load_word2vec_format(tmp_file)


embeddings = model.wv

index2word = embeddings.index2word
embedds = embeddings.vectors


print(embedds.shape)


(400000, 300)


400000 .



2. k-svd

. ksvd.



!pip install ksvd
from ksvd import ApproximateKSVD


Requirement already satisfied: ksvd in /home/astromis/anaconda3/lib/python3.6/site-packages (0.0.3)
Requirement already satisfied: numpy in /home/astromis/anaconda3/lib/python3.6/site-packages (from ksvd) (1.14.5)
Requirement already satisfied: scikit-learn in /home/astromis/anaconda3/lib/python3.6/site-packages (from ksvd) (0.19.1)


, 2000 5.

: 10000 . , , , , .



%time
aksvd = ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, )
embedding_trans = embeddings.vectors
dictionary = aksvd.fit(embedding_trans).components_
gamma = aksvd.transform(embedding_trans)


CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 9.54 µs


#gamma = np.load('./data/mats/.npz')
# dictionary_glove6b_300d.np.npz - whole matrix file
dictionary = np.load('./data/mats/dictionary_glove6b_300d_10000.np.npz')
dictionary = dictionary[dictionary.keys()[0]]


#print(gamma.shape)
print(dictionary.shape)


(2000, 300)


#np.savez_compressed('gamma_glove6b_300d.npz', gamma)
#np.savez_compressed('dictionary_glove6b_300d.npz', dictionary)


3.



, . .



embeddings.similar_by_vector(dictionary[1354,:])


[('slave', 0.8417330980300903),
 ('slaves', 0.7482961416244507),
 ('plantation', 0.6208109259605408),
 ('slavery', 0.5356900095939636),
 ('enslaved', 0.4814416170120239),
 ('indentured', 0.46423888206481934),
 ('fugitive', 0.4226764440536499),
 ('laborers', 0.41914862394332886),
 ('servitude', 0.41276970505714417),
 ('plantations', 0.4113745093345642)]


embeddings.similar_by_vector(dictionary[1350,:])


[('transplant', 0.7767853736877441),
 ('marrow', 0.699995219707489),
 ('transplants', 0.6998592615127563),
 ('kidney', 0.6526087522506714),
 ('transplantation', 0.6381147503852844),
 ('tissue', 0.6344675421714783),
 ('liver', 0.6085026860237122),
 ('blood', 0.5676015615463257),
 ('heart', 0.5653558969497681),
 ('cells', 0.5476219058036804)]


embeddings.similar_by_vector(dictionary[1546,:])


[('commons', 0.7160810828208923),
 ('house', 0.6588335037231445),
 ('parliament', 0.5054076910018921),
 ('capitol', 0.5014163851737976),
 ('senate', 0.4895153343677521),
 ('hill', 0.48859673738479614),
 ('inn', 0.4566132128238678),
 ('congressional', 0.4341348707675934),
 ('congress', 0.42997264862060547),
 ('parliamentary', 0.4264637529850006)]


embeddings.similar_by_vector(dictionary[1850,:])


[('okano', 0.2669774889945984),
 ('erythrocytes', 0.25755012035369873),
 ('windir', 0.25621023774147034),
 ('reapportionment', 0.2507009208202362),
 ('qurayza', 0.2459488958120346),
 ('taschen', 0.24417680501937866),
 ('pfaffenbach', 0.2437630295753479),
 ('boldt', 0.2394050508737564),
 ('frucht', 0.23922981321811676),
 ('rulebook', 0.23821482062339783)]


! , . . , , . "tie" "spring" .



itie = index2word.index('tie')
ispring = index2word.index('spring')

tie_emb = embedds[itie]
string_emb = embedds[ispring]


simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, tie_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:15]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))


Atom #162: win victory winning victories wins won 2-1 scored 3-1 scoring
Atom #58: game play match matches games played playing tournament players stadium
Atom #237: 0-0 1-1 2-2 3-3 draw 0-1 4-4 goalless 1-0 1-2
Atom #622: wrapped wrap wrapping holding placed attached tied hold plastic held
Atom #1899: struggles tying tied inextricably fortunes struggling tie intertwined redefine define
Atom #1941: semifinals quarterfinals semifinal quarterfinal finals semis semi-finals berth champions quarter-finals
Atom #1074: qualifier quarterfinals semifinal semifinals semi finals quarterfinal champion semis champions
Atom #1914: wearing wore jacket pants dress wear worn trousers shirt jeans
Atom #281: black wearing man pair white who girl young woman big
Atom #1683: overtime extra seconds ot apiece 20-17 turnovers 3-2 halftime overtimes
Atom #369: snap picked snapped pick grabbed picks knocked picking bounced pulled
Atom #98: first team start final second next time before test after
Atom #1455: after later before when then came last took again but
Atom #1203: competitions qualifying tournaments finals qualification matches qualifiers champions competition competed
Atom #1602: hat hats mask trick wearing wears sunglasses trademark wig wore


simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, string_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:15]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))


Atom #528: autumn spring summer winter season rainy seasons fall seasonal during
Atom #1070: start begin beginning starting starts begins next coming day started
Atom #931: holiday christmas holidays easter thanksgiving eve celebrate celebrations weekend festivities
Atom #1455: after later before when then came last took again but
Atom #754: but so not because even only that it this they
Atom #688: yankees yankee mets sox baseball braves steinbrenner dodgers orioles torre
Atom #1335: last ago year months years since month weeks week has
Atom #252: upcoming scheduled preparations postponed slated forthcoming planned delayed preparation preparing
Atom #619: cold cool warm temperatures dry cooling wet temperature heat moisture
Atom #1775: garden gardens flower flowers vegetable ornamental gardeners gardening nursery floral
Atom #21: dec. nov. oct. feb. jan. aug. 27 28 29 june
Atom #84: celebrations celebration marking festivities occasion ceremonies celebrate celebrated celebrating ceremony
Atom #98: first team start final second next time before test after
Atom #606: vacation lunch hour spend dinner hours time ramadan brief workday
Atom #384: golden moon hemisphere mars twilight millennium dark dome venus magic


! , , , .

, , . , , .



. fastText, RusVectores. 300.



fasttext_model = KeyedVectors.load('/home/astromis/Embeddings/fasttext/model.model')


embeddings = fasttext_model.wv

index2word = embeddings.index2word
embedds = embeddings.vectors


embedds.shape


(164996, 300)


%time
aksvd = ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, )
embedding_trans = embeddings.vectors[:10000]
dictionary = aksvd.fit(embedding_trans).components_
gamma = aksvd.transform(embedding_trans)


CPU times: user 1 µs, sys: 2 µs, total: 3 µs
Wall time: 6.2 µs


dictionary = np.load('./data/mats/dictionary_rus_fasttext_300d.npz')
dictionary = dictionary[dictionary.keys()[0]]


embeddings.similar_by_vector(dictionary[1024,:], 20)


[('', 0.6854609251022339),
 ('', 0.6593252420425415),
 ('', 0.6360634565353394),
 ('', 0.5998549461364746),
 ('', 0.5971367955207825),
 ('', 0.5862340927124023),
 ('', 0.5788886547088623),
 ('', 0.5788123607635498),
 ('', 0.5623885989189148),
 ('', 0.5610565543174744),
 ('', 0.5551878809928894),
 ('', 0.551397442817688),
 ('', 0.5356274247169495),
 ('', 0.531707227230072),
 ('', 0.5174376368522644),
 ('', 0.5131562948226929),
 ('', 0.5120065212249756),
 ('', 0.5077806115150452),
 ('', 0.5074601173400879),
 ('', 0.5068254470825195)]


embeddings.similar_by_vector(dictionary[1582,:], 20)


[('', 0.45191124081611633),
 ('', 0.4515378475189209),
 ('', 0.4478364586830139),
 ('', 0.4280813932418823),
 ('', 0.41220104694366455),
 ('', 0.40772825479507446),
 ('', 0.4047147035598755),
 ('', 0.4030646085739136),
 ('', 0.39368513226509094),
 ('', 0.39012178778648376),
 ('', 0.3866344690322876),
 ('', 0.37968817353248596),
 ('', 0.3728911876678467),
 ('', 0.3663109242916107),
 ('', 0.3640827238559723),
 ('', 0.3474290072917938),
 ('', 0.3473641574382782),
 ('', 0.3468908369541168),
 ('', 0.34586742520332336),
 ('', 0.34555742144584656)]


embeddings.similar_by_vector(dictionary[500,:], 20)


[('', 0.6874514222145081),
 ('-', 0.5172050595283508),
 ('', 0.46720415353775024),
 ('', 0.44713956117630005),
 ('', 0.4144558310508728),
 ('', 0.40545403957366943),
 ('', 0.4030636250972748),
 ('-', 0.4016447067260742),
 ('', 0.38331469893455505),
 ('', 0.37292781472206116),
 ('', 0.3625457286834717),
 ('', 0.35121074318885803),
 ('', 0.3504621088504791),
 ('', 0.34097471833229065),
 ('', 0.33320850133895874),
 ('', 0.3277249336242676),
 ('', 0.3266661763191223),
 ('', 0.31865227222442627),
 ('::', 0.30150306224823),
 ('', 0.2975207567214966)]


itie = index2word.index('')
ispring = index2word.index('')

tie_emb = embedds[itie]
string_emb = embedds[ispring]


simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, string_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:10]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))


Atom #185:          
Atom #1217:         - 
Atom #1213:          
Atom #1978:          
Atom #1796:          
Atom #839:          
Atom #989:          
Atom #414:          
Atom #1140:       -   
Atom #878:          


simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, tie_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:10]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))


Atom #883:          -
Atom #40:          
Atom #215:          
Atom #688:          
Atom #386:          
Atom #676:          
Atom #414:          
Atom #127:          
Atom #592:          
Atom #703:    - -     


#np.savez_compressed('./data/mats/gamma_rus_fasttext_300d.npz', gamma)
#np.savez_compressed('./data/mats/dictionary_rus_fasttext_300d.npz', dictionary)


.





, (Word sense indection), , 1. — , . , , . , , , . , .



UPD: knagaev .




All Articles