Another topic of today: Word2Vec and predecessors.

This post belongs to ‘notes.to_self’ category; it contains less commentary and more links.

Background: UoH Deep learning 2017 course covers NLP applications, starting with word embeddings, then continuing with CNNs and RNNs/LSTMs.

What is Word2Vec: Wikipedia.

  • Key references appear to be [1], [2], [3].

  • skip-grams:
    • Shallow neural model. Given word, attempt to learn a representation that is good predicting the context of the work.
  • CBOW (continuous bag-of-words):
    • Given context, learn the target word.
  • CBOW and skip-grams are mirror images of each other.

  • problem with learning skip-grams by naively maximizing probability: requires operations on the scale of whole vocabulary
    • old way: hierarchical softmax.
    • Mikolov et al. 2013 [1] idea: negative sampling.
      • idea: learn model paramaters that discriminate well between the samples from the data and negative examples from noise distribution
      • predecessors:
        • “Noise Contrastive Estimation” by Gutmann and Hyvärinen [5]
        • Mnih and Teh [6] adapt NCE for natural language.
  • Word2Vec tutorials:
    • Goldberg and Levy [4] provide in-detail derivation of negative sampling.
    • https://www.tensorflow.org/tutorials/word2vec
    • http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
    • Rong [7] explains the CBOW and skip-gram architecture and backprop update equations.
  • Related: GloVe
    • https://nlp.stanford.edu/projects/glove/
    • Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. “GloVe: Global Vectors for Word Representation.” pdf
      • I don’t think I’ve understood GloVe well enough to give a proper short description what it does, but it is sorta derived from skip-gram (and maybe this means I have not yet understood word2vec sufficiently well either). See section 3.1.
  • See also:

References

[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. “Distributed representations of words and phrases and their compositionality.” NIPS 2013. https://arxiv.org/abs/1310.4546

[2] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient estimation of word representations in vector space.” https://arxiv.org/abs/1301.3781

[3] Tomas Mikolov, Wen-tau Yih and Geoffrey Zweig. 2013 “Linguistic Regularities in Continuous Space Word Representations.” http://www.aclweb.org/anthology/N13-1090

[4] Yoav Goldberg and Omer Levy. 2014. “word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method.” https://arxiv.org/abs/1402.3722

[5] Gutmann and Hyvärinen. 2012. “Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics.” JMLR 13:1. https://dl.acm.org/citation.cfm?id=2188396

[6] Mnih and Teh. 2012. “A Fast and Simple Algorithm for Training Neural Probabilistic Language Models”. https://arxiv.org/abs/1206.6426

[7] Rong. 2014 (latest arxiv revision 2016). “word2vec Parameter Learning Explained.” https://arxiv.org/abs/1411.2738