Links/Papers: Word2Vec + History + Stuff

Another topic of today: Word2Vec and predecessors.

This post belongs to ‘notes.to_self’ category; it contains less commentary and more links.

Background: UoH Deep learning 2017 course covers NLP applications, starting with word embeddings, then continuing with CNNs and RNNs/LSTMs.

What is Word2Vec: Wikipedia.

Key references appear to be [1], [2], [3].
skip-grams:
- Shallow neural model. Given word, attempt to learn a representation that is good predicting the context of the work.
CBOW (continuous bag-of-words):
- Given context, learn the target word.
CBOW and skip-grams are mirror images of each other.
problem with learning skip-grams by naively maximizing probability: requires operations on the scale of whole vocabulary
- old way: hierarchical softmax.
- Mikolov et al. 2013 [1] idea: negative sampling.
  - idea: learn model paramaters that discriminate well between the samples from the data and negative examples from noise distribution
  - predecessors:
    - “Noise Contrastive Estimation” by Gutmann and Hyvärinen [5]
    - Mnih and Teh [6] adapt NCE for natural language.
Word2Vec tutorials:
- Goldberg and Levy [4] provide in-detail derivation of negative sampling.
- https://www.tensorflow.org/tutorials/word2vec
- http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
- Rong [7] explains the CBOW and skip-gram architecture and backprop update equations.
Related: GloVe
- https://nlp.stanford.edu/projects/glove/
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. “GloVe: Global Vectors for Word Representation.” pdf
  - I don’t think I’ve understood GloVe well enough to give a proper short description what it does, but it is sorta derived from skip-gram (and maybe this means I have not yet understood word2vec sufficiently well either). See section 3.1.
See also:
- neural language models
- latent semantic analysis: latent space model for language, SVD

References

[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. “Distributed representations of words and phrases and their compositionality.” NIPS 2013. https://arxiv.org/abs/1310.4546

[2] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient estimation of word representations in vector space.” https://arxiv.org/abs/1301.3781

[3] Tomas Mikolov, Wen-tau Yih and Geoffrey Zweig. 2013 “Linguistic Regularities in Continuous Space Word Representations.” http://www.aclweb.org/anthology/N13-1090

[4] Yoav Goldberg and Omer Levy. 2014. “word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method.” https://arxiv.org/abs/1402.3722

[5] Gutmann and Hyvärinen. 2012. “Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics.” JMLR 13:1. https://dl.acm.org/citation.cfm?id=2188396

[6] Mnih and Teh. 2012. “A Fast and Simple Algorithm for Training Neural Probabilistic Language Models”. https://arxiv.org/abs/1206.6426

[7] Rong. 2014 (latest arxiv revision 2016). “word2vec Parameter Learning Explained.” https://arxiv.org/abs/1411.2738