Talk:Word2vec
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||
|
This article links to one or more target anchors that no longer exist.
Please help fix the broken anchors. You can remove this template after fixing the problems. | Reporting errors |
Wiki Education Foundation-supported course assignment
editThis article is or was the subject of a Wiki Education Foundation-supported course assignment. Further details are available on the course page. Student editor(s): Akozlowski, JonathanSchoots, Shevajia. Peer reviewers: JonathanSchoots, Shevajia.
Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 05:04, 18 January 2022 (UTC)
The Math is wrong
editSome misunderstandings of the algorithm are evident.
Word2vec learns _two_ sets of weights - call them $W$ and $W'$. The first one, $W$, encodes the properties of each word that apply when it is the subject (the central word in the context window) - this is the actual "word embedding". The other set of weights, $W'$, is stored in the "hidden layer" in the neural net used to train $W$, and encodes the dual of those properties - these vectors represent the words appearing in the context window. $W$ and $W'$ have the same dimensions (one vector per vocabulary word), and are jointly optimised.
To estimate $\log P(c|s)$, you must take the dot product $W'_c \cdot W_s$ - and not $W_c \cdot W_s$ as stated in the article. To see this, notice that the second expression will always predict that a subject word $s$ should appear next to itself.
As an example of why this works, assume that some direction in the column space of $W$ learns to encode some specific property of words (eg: "I am a verb"). Then, that same direction in $W'$ will learn to encode the dual property ("I appear near verbs"). So the predicted probability that two words should appear nearby, $\log P(c|s) = W'_c \cdot W_s$, is increased when $s$ has a property (in $W$) and $c$ has its dual (in $W'$).
For the softmax variant of word2vec, $W$ represents the word embeddings (subject-word-to-vector), while $W'$ learns the "estimator" embeddings (surrounding-words-to-vector). From the user's point of view, $W'$ is just some hidden layer in the neural net used to train $W$ - you ship $W$ as your trained word embeddings and discard $W'$. You _might_ sometimes retain $W'$ - for example, a language model's input layer could use $W'$ to substitute for out-of-vocabulary inputs by estimating the unknown input word's vector embedding using the dual vectors of the words around it (this is simply the average of those dual vectors - although if your training subsampled more distant context words, you'll want to use a weighted average).
For the discriminative variant, the interpretation of $W$ and $W'$ is a little muddier: the two weight vectors are treated completely symmetrically by the training algorithm, so for any specific property, you can't know which set of weights will learn to code for the property and which will code for its dual. But it turns out that this doesn't matter: both matrices learn all of the same semantic information (just encoded differently), and whatever language model is built on top of them should be able to disentangle the dual embeddings as easily as the primaries. It's also harder to trust that $W'_c \cdot W_s$ represents a good estimate of a log probability (there was no softmax, so the vectors weren't optimised for normalised probabilities) - meaning that the out-of-vocab trick isn't as mathematically justified.
Note that most of the above comments apply to the skipgram model; I haven't examined CBOW in detail.
Anyway, I added this here in Talk (rather than fixing the main page) because I don't have time to do a polished, professional rewrite. If you feel up to it, the core fixes would be to mention the role of the hidden weights ($W'$), fix the dot product, and fix the softmax - the normaliser (denominator) of the softmax should be summed over $W'$ (the decoder weights), rather than over $W$.
174.164.223.51 (talk) 23:20, 10 January 2024 (UTC)
(just adding to my earlier comment). 174.165.217.42 (talk) 09:41, 9 June 2024 (UTC)
Introduction could use some work
editThe introduction to this article could use some work to comply with Wikipedia guidelines: https://en.wiki.x.io/wiki/Wikipedia:Manual_of_Style/Lead_section#Introductory_text
Specifically, there's a great amount of domain knowledge needed to make sense of the existing introduction. Reducing the burden on the reader by simplifying the introduction would help more readers understand what this articles is about.
second that; the intro is NOT written at a level appropriate for a general encylopedia — Preceding unsigned comment added by 194.144.243.122 (talk) 12:49, 26 June 2019 (UTC)
I took a shot at writing a clearer introduction. Since the previous text was not wrong, I transformed it to become the new first and second sections. Jason.Rafe.Miller (talk) 16:15, 31 July 2020 (UTC)
Extensions not relavent
editThere are numerous extensions to word2vec, and the two mentioned in the corresponding section are nowhere near the most relevant, especially not IWE. Given the page only links to fastText or GloVe, and the discussion of BioVectors doesn't even discuss how they're useful, this section seems to need an overhaul. — Preceding unsigned comment added by 98.109.81.250 (talk) 23:38, 23 December 2018 (UTC)
Iterations
editCould we also talk about iterations ? I experimented its role on stability uppon similarity scores. it is also an hyperparameter — Preceding unsigned comment added by 37.165.197.250 (talk) 04:47, 10 September 2019 (UTC)
Wiki Education assignment: Public Writing
editThis article was the subject of a Wiki Education Foundation-supported course assignment, between 7 September 2022 and 8 December 2022. Further details are available on the course page. Student editor(s): Singerep (article contribs).
— Assignment last updated by Singerep (talk) 03:07, 3 October 2022 (UTC)
Semantics of the term vector space
editI am confused about the term “produces” or “generates” when it comes to the algorithm and it producing a vector space. I am just looking for clarity on semantics. It seems like the algorithm finds a numerical vector space to embed the word vectors into, rather than the word vectors alone forming a vector space. Just technically speaking—I have been looking for a reference that explains the vector space operations (vector addition and scalar multiplication) more clearly, but I have this feeling the set of word vectors should be thought of as a set (not a vector space) that can be embedded into a vector space (rather than being thought of as a vector space in itself). To be clear, I don't know if I am thinking about this correctly, just looking for clarification. Addison314159 (talk) 17:56, 1 November 2022 (UTC)
Controversies?
editThere have been a number of controversies about the real-life usage of word2vec and the incorporated gender bias such as the "doctor - man = nurse" or "computer programmer - man = homemaker" examples, and I think this page should reflect some of these, even if this is a more general problem related to AI bias. This topic is discussed in-depth in the book "The Alignment Problem" by Brian Christian 80.135.157.222 (talk) 14:28, 8 January 2023 (UTC)