🍣 😮 🙆🏼 分析一篇有关如何从嵌入中提取含义的文章 🚩 🗾 ⚫️

tl; dr：对文章的简化分析，作者提出了两个有趣的定理，在此基础上，他找到了一种从嵌入矩阵中提取隐藏的含义向量的方法。有关如何重现结果的指南。笔记本电脑可以在github上找到。

介绍

在本文中，我想谈一谈研究人员Sanjev Arora在《词义的线性代数结构及其对多义性的应用》一文中发现的一件令人惊奇的事情。这是他尝试为词嵌入的属性提供理论依据的一系列文章之一。在同一作品中，Arora假设简单的嵌入（例如word2vec或Glove）实际上包含一个单词的几种含义，并提供了一种还原它们的方法。在整篇文章中，我将尽量坚持原始示例。

更正式地说， $υ_{t i e}$ 我们表示“ 领带 ”一词的某个嵌入矢量，它可能具有打结或“领带”的含义，也可能是动词“ tie up”。Arora建议将此向量写成以下线性组合

υ_{t i e} \approx α_{1} υ_{t i e 1} + α_{2} υ_{t i e 2} + α_{3} υ_{t i e 3} + . . .

哪里 $υ_{t i e n}$ 这是tie一词的可能含义之一，并且 $α$ -系数。让我们尝试弄清楚这是怎么发生的。

理论

免责声明

由非数学家撰写，请报告所有错误，尤其是在垫子上。术语。

阿拉拉理论简述

由于Arora的启动工作比这复杂得多，因此我尚未准备完整的评估。但是，我们将简要了解一下它的样子。

因此，Arora提供了一个想法，即任何文本都是由生成模型生成的。在她工作的每个步骤中 $t$ 产生一个词 $w$ ... 该模型由上下文向量组成 $с$ 和嵌入向量 $u_{w}$ . (dimensions), , . , , - (, ), — (, ), , , — .

, .. - , . . , . : " " , " ". , "": , .

, . , , , .

: , . , $t$ $w$

P (w | c_{t}) = \frac{1}{Z_{c}} \exp < c_{t}, υ_{w} >

$c_{t}$ — $t$ , $υ_{w}$ — $w$ , $Z_{c} = \sum_{w} \exp < c, υ_{w} >$ — partition function. , , .

. , , : , , , . Y, X .

. - , - .

, , . , , "". :

, ", , , ". , , , ", , , " , " " .

, . , , . , ( , ). , , . .

1

, $s$ $n$ . $A$ ,

υ_{w} \approx A E [\frac{1}{n} \sum_{w_{i} \in s} υ_{w_{i}} | w \in s]

, . . $w$ . $S$ . , $υ_{s}$ $s \in S$ , $u$ . , , $u$ $υ_{w}$ $A$ ( ). , , out-of-vocabulary , , .

, . , SIF . , , , . , SIF $υ_{S I F}$ k, , $w$ , TF-IDF.

υ_{S I F} = \frac{1}{k} \sum_{n = 1}^{k} υ_{n} * t f_i d f (w_{n})

, , 1, $c$ . , - , , .

. , - $w$ , $υ_{w}$ , . :

. $V$ .
$w^{'} \in V$ , , SIF 20 $w^{'}$ , . $w^{'} \in V$ $(ν_{w^{'}}^{1}, ν_{w^{'}}^{2},, . . . ν_{w^{'}}^{n},)$ , n — $w^{'}$ .
$u_{w^{'}}$ SIF $w^{'} \in V$ $u_{w^{'}} = \frac{1}{n} \sum_{t = 1}^{n} ν_{w^{'}}^{t}$ .
$a r g m i n_{A} \sum_{A} | | A u_{w^{'}} - υ_{w^{'}} | |_{2}^{2}$
SIF $υ_{w} = A u_{w}$

, .. . 1/3 , A 2\3 . . .

#paragraphs	250k	500k	750k	1 million
cos similarity	0.94	0.95	0.96	0.96

2

, $w$ $s_{1}$ $s_{2}$ . $υ_{w}$ - , . , , .. , , tie_1 tie_2, tie_1 — , tie2 — .

, , $$. , , , $‖ υ_{w} - υ^{-} ‖ \to 0$ , $υ^{-}$

υ_{w}^{-} = \frac{f_{1}}{f_{1} + f_{2}} * υ_{s 1} + \frac{f_{2}}{f_{1} + f_{2}} * υ_{s 2} = α υ_{s 1} + β υ_{s 2}

$f_{1}$ $f_{2}$ $s_{1}$ and $s_{2}$ . , , .

, , , , ? , $a l p h a$ . . , $c$ . , , . , , , , , . , , , (inner product) . , , - (, , , ), $υ_{t i e 1}$ , ! .

. ? $ℜ^{d}$ $k, n$ . $k < n$ , $A_{1}, A_{2}, . . ., A_{m}$ , ,

υ_{w} = \sum_{j = 1}^{m} α_{w, j} A_{j} + μ_{w}

$k$ $α$ $μ_{w}$ — .

\sum_{w} ‖ υ_{w} - \sum_{j = 1}^{m} α_{w, j} A_{j} ‖_{2}^{2}

, k (sparsity parameter), m — .. , . k-SVD. , . , $A$ , ( , $A$ ). , , - $A_{i}$ , , , $m$ . .

, , .

import numpy as np

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from scipy.spatial.distance import cosine
import warnings
warnings.filterwarnings('ignore')

1. Gensim

GloVe.

, 300- .

tmp_file = get_tmpfile("test_word2vec.txt")
_ = glove2word2vec("/home/astromis/Embeddings/glove.6B.300d.txt", tmp_file)
model = KeyedVectors.load_word2vec_format(tmp_file)

embeddings = model.wv

index2word = embeddings.index2word
embedds = embeddings.vectors

print(embedds.shape)

(400000, 300)

400000 .

2. k-svd

. ksvd.

!pip install ksvd
from ksvd import ApproximateKSVD

Requirement already satisfied: ksvd in /home/astromis/anaconda3/lib/python3.6/site-packages (0.0.3)
Requirement already satisfied: numpy in /home/astromis/anaconda3/lib/python3.6/site-packages (from ksvd) (1.14.5)
Requirement already satisfied: scikit-learn in /home/astromis/anaconda3/lib/python3.6/site-packages (from ksvd) (0.19.1)

, 2000 5.

: 10000 . , , , , .

%time
aksvd = ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, )
embedding_trans = embeddings.vectors
dictionary = aksvd.fit(embedding_trans).components_
gamma = aksvd.transform(embedding_trans)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 9.54 µs

#gamma = np.load('./data/mats/.npz')
# dictionary_glove6b_300d.np.npz - whole matrix file
dictionary = np.load('./data/mats/dictionary_glove6b_300d_10000.np.npz')
dictionary = dictionary[dictionary.keys()[0]]

#print(gamma.shape)
print(dictionary.shape)

(2000, 300)

#np.savez_compressed('gamma_glove6b_300d.npz', gamma)
#np.savez_compressed('dictionary_glove6b_300d.npz', dictionary)

, . .

embeddings.similar_by_vector(dictionary[1354,:])

[('slave', 0.8417330980300903),
 ('slaves', 0.7482961416244507),
 ('plantation', 0.6208109259605408),
 ('slavery', 0.5356900095939636),
 ('enslaved', 0.4814416170120239),
 ('indentured', 0.46423888206481934),
 ('fugitive', 0.4226764440536499),
 ('laborers', 0.41914862394332886),
 ('servitude', 0.41276970505714417),
 ('plantations', 0.4113745093345642)]

embeddings.similar_by_vector(dictionary[1350,:])

[('transplant', 0.7767853736877441),
 ('marrow', 0.699995219707489),
 ('transplants', 0.6998592615127563),
 ('kidney', 0.6526087522506714),
 ('transplantation', 0.6381147503852844),
 ('tissue', 0.6344675421714783),
 ('liver', 0.6085026860237122),
 ('blood', 0.5676015615463257),
 ('heart', 0.5653558969497681),
 ('cells', 0.5476219058036804)]

embeddings.similar_by_vector(dictionary[1546,:])

[('commons', 0.7160810828208923),
 ('house', 0.6588335037231445),
 ('parliament', 0.5054076910018921),
 ('capitol', 0.5014163851737976),
 ('senate', 0.4895153343677521),
 ('hill', 0.48859673738479614),
 ('inn', 0.4566132128238678),
 ('congressional', 0.4341348707675934),
 ('congress', 0.42997264862060547),
 ('parliamentary', 0.4264637529850006)]

embeddings.similar_by_vector(dictionary[1850,:])

[('okano', 0.2669774889945984),
 ('erythrocytes', 0.25755012035369873),
 ('windir', 0.25621023774147034),
 ('reapportionment', 0.2507009208202362),
 ('qurayza', 0.2459488958120346),
 ('taschen', 0.24417680501937866),
 ('pfaffenbach', 0.2437630295753479),
 ('boldt', 0.2394050508737564),
 ('frucht', 0.23922981321811676),
 ('rulebook', 0.23821482062339783)]

! , . . , , . "tie" "spring" .

itie = index2word.index('tie')
ispring = index2word.index('spring')

tie_emb = embedds[itie]
string_emb = embedds[ispring]

simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, tie_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:15]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))

Atom #162: win victory winning victories wins won 2-1 scored 3-1 scoring
Atom #58: game play match matches games played playing tournament players stadium
Atom #237: 0-0 1-1 2-2 3-3 draw 0-1 4-4 goalless 1-0 1-2
Atom #622: wrapped wrap wrapping holding placed attached tied hold plastic held
Atom #1899: struggles tying tied inextricably fortunes struggling tie intertwined redefine define
Atom #1941: semifinals quarterfinals semifinal quarterfinal finals semis semi-finals berth champions quarter-finals
Atom #1074: qualifier quarterfinals semifinal semifinals semi finals quarterfinal champion semis champions
Atom #1914: wearing wore jacket pants dress wear worn trousers shirt jeans
Atom #281: black wearing man pair white who girl young woman big
Atom #1683: overtime extra seconds ot apiece 20-17 turnovers 3-2 halftime overtimes
Atom #369: snap picked snapped pick grabbed picks knocked picking bounced pulled
Atom #98: first team start final second next time before test after
Atom #1455: after later before when then came last took again but
Atom #1203: competitions qualifying tournaments finals qualification matches qualifiers champions competition competed
Atom #1602: hat hats mask trick wearing wears sunglasses trademark wig wore

simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, string_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:15]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))

Atom #528: autumn spring summer winter season rainy seasons fall seasonal during
Atom #1070: start begin beginning starting starts begins next coming day started
Atom #931: holiday christmas holidays easter thanksgiving eve celebrate celebrations weekend festivities
Atom #1455: after later before when then came last took again but
Atom #754: but so not because even only that it this they
Atom #688: yankees yankee mets sox baseball braves steinbrenner dodgers orioles torre
Atom #1335: last ago year months years since month weeks week has
Atom #252: upcoming scheduled preparations postponed slated forthcoming planned delayed preparation preparing
Atom #619: cold cool warm temperatures dry cooling wet temperature heat moisture
Atom #1775: garden gardens flower flowers vegetable ornamental gardeners gardening nursery floral
Atom #21: dec. nov. oct. feb. jan. aug. 27 28 29 june
Atom #84: celebrations celebration marking festivities occasion ceremonies celebrate celebrated celebrating ceremony
Atom #98: first team start final second next time before test after
Atom #606: vacation lunch hour spend dinner hours time ramadan brief workday
Atom #384: golden moon hemisphere mars twilight millennium dark dome venus magic

! , , , .

, , . , , .

. fastText, RusVectores. 300.

fasttext_model = KeyedVectors.load('/home/astromis/Embeddings/fasttext/model.model')

embeddings = fasttext_model.wv

index2word = embeddings.index2word
embedds = embeddings.vectors

embedds.shape

(164996, 300)

%time
aksvd = ApproximateKSVD(n_components=2000,transform_n_nonzero_coefs=5, )
embedding_trans = embeddings.vectors[:10000]
dictionary = aksvd.fit(embedding_trans).components_
gamma = aksvd.transform(embedding_trans)

CPU times: user 1 µs, sys: 2 µs, total: 3 µs
Wall time: 6.2 µs

dictionary = np.load('./data/mats/dictionary_rus_fasttext_300d.npz')
dictionary = dictionary[dictionary.keys()[0]]

embeddings.similar_by_vector(dictionary[1024,:], 20)

[('', 0.6854609251022339),
 ('', 0.6593252420425415),
 ('', 0.6360634565353394),
 ('', 0.5998549461364746),
 ('', 0.5971367955207825),
 ('', 0.5862340927124023),
 ('', 0.5788886547088623),
 ('', 0.5788123607635498),
 ('', 0.5623885989189148),
 ('', 0.5610565543174744),
 ('', 0.5551878809928894),
 ('', 0.551397442817688),
 ('', 0.5356274247169495),
 ('', 0.531707227230072),
 ('', 0.5174376368522644),
 ('', 0.5131562948226929),
 ('', 0.5120065212249756),
 ('', 0.5077806115150452),
 ('', 0.5074601173400879),
 ('', 0.5068254470825195)]

embeddings.similar_by_vector(dictionary[1582,:], 20)

[('', 0.45191124081611633),
 ('', 0.4515378475189209),
 ('', 0.4478364586830139),
 ('', 0.4280813932418823),
 ('', 0.41220104694366455),
 ('', 0.40772825479507446),
 ('', 0.4047147035598755),
 ('', 0.4030646085739136),
 ('', 0.39368513226509094),
 ('', 0.39012178778648376),
 ('', 0.3866344690322876),
 ('', 0.37968817353248596),
 ('', 0.3728911876678467),
 ('', 0.3663109242916107),
 ('', 0.3640827238559723),
 ('', 0.3474290072917938),
 ('', 0.3473641574382782),
 ('', 0.3468908369541168),
 ('', 0.34586742520332336),
 ('', 0.34555742144584656)]

embeddings.similar_by_vector(dictionary[500,:], 20)

[('', 0.6874514222145081),
 ('-', 0.5172050595283508),
 ('', 0.46720415353775024),
 ('', 0.44713956117630005),
 ('', 0.4144558310508728),
 ('', 0.40545403957366943),
 ('', 0.4030636250972748),
 ('-', 0.4016447067260742),
 ('', 0.38331469893455505),
 ('', 0.37292781472206116),
 ('', 0.3625457286834717),
 ('', 0.35121074318885803),
 ('', 0.3504621088504791),
 ('', 0.34097471833229065),
 ('', 0.33320850133895874),
 ('', 0.3277249336242676),
 ('', 0.3266661763191223),
 ('', 0.31865227222442627),
 ('::', 0.30150306224823),
 ('', 0.2975207567214966)]

itie = index2word.index('')
ispring = index2word.index('')

tie_emb = embedds[itie]
string_emb = embedds[ispring]

simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, string_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:10]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))

Atom #185:          
Atom #1217:         - 
Atom #1213:          
Atom #1978:          
Atom #1796:          
Atom #839:          
Atom #989:          
Atom #414:          
Atom #1140:       -   
Atom #878:

simlist = []

for i, vector in enumerate(dictionary):
    simlist.append( (cosine(vector, tie_emb), i) )

simlist = sorted(simlist, key=lambda x: x[0])
six_atoms_ind = [ins[1] for ins in simlist[:10]]

for atoms_idx in six_atoms_ind:
    nearest_words = embeddings.similar_by_vector(dictionary[atoms_idx,:])
    nearest_words = [word[0] for word in nearest_words]
    print("Atom #{}: {}".format(atoms_idx, ' '.join(nearest_words)))

Atom #883:          -
Atom #40:          
Atom #215:          
Atom #688:          
Atom #386:          
Atom #676:          
Atom #414:          
Atom #127:          
Atom #592:          
Atom #703:    - -

#np.savez_compressed('./data/mats/gamma_rus_fasttext_300d.npz', gamma)
#np.savez_compressed('./data/mats/dictionary_rus_fasttext_300d.npz', dictionary)

, (Word sense indection), , 1. — , . , , . , , , . , .

UPD: knagaev .

分析一篇有关如何从嵌入中提取含义的文章

介绍

理论

阿拉拉理论简述

1

2

More articles: