class: center, middle background-image:url(images/data-background-light.jpg) # Lexical Resources ## Master TAL, Nancy, 2019-2020 .footnote[.bold[[Christophe Cerisara](mailto:cerisara@loria.fr) CNRS / LORIA]] --- .center[ ## Topic of the day ] ### Words semantic representations We have seen: - How to extract lexicons from texts - How to capture local syntactic information (n-grams) - How to augment lexicon with multi-word expressions - How to capture (weak) information about morphology Today's focus: lexical semantics --- .center[ ## Word representations ] .left-column[ #### Discrete ] .right-column[ - Computer programs manipulate numbers ] --- .center[ ## Word representations ] .left-column[ #### Discrete ] .right-column[ - Computer programs manipulate numbers - Discrete representations: cat=1, table=2, dog=3 ] --- .center[ ## Word representations ] .left-column[ #### Discrete ] .right-column[ - Computer programs manipulate numbers - Discrete representations: cat=1, table=2, dog=3 - But distances are misleading: d(cat,dog)=2, d(cat,table)=1 ] --- .center[ ## Word representations ] .left-column[ #### Discrete ] .right-column[ - Computer programs manipulate numbers - Discrete representations: cat=1, table=2, dog=3 - But distances are misleading: d(cat,dog)=2, d(cat,table)=1 One-dimensional representations force ordering: ... < CAT < TABLE < DOG < ... ] --- .center[ ## Word representations ] .left-column[ #### Discrete #### One-hot vector ] .right-column[ N-dimensional representations allow words to all be at the same distance: .center[
] ] --- .center[ ## Word representations ] .left-column[ #### Discrete #### One-hot vector ] .right-column[ N-dimensional representations allow words to all be at the same distance: .center[
] **One-hot vectors** .center[
] ] --- .center[ ## Word representations ] .left-column[ #### Discrete #### One-hot vector #### Embeddings ] .right-column[ - But having all words at the same distance is not ideal - And we face the curse of dimensionality ] --- .center[ ## Word representations ] .left-column[ #### Discrete #### One-hot vector #### Embeddings ] .right-column[ - But having all words at the same distance is not ideal - And we face the curse of dimensionality We want to find word vectors that encode part of lexical semantics: .center[
] ] --- .center[ ## Embeddings ] .center[
] - Long history - Acceleration in the last 2 years - Colors = types of approaches --- .center[ ## Embeddings ] .center[
] Prehistory (?): vector space models --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis ] .right-column[ "You shall know a word by the company it keeps" [Firth, 1957] ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis ] .right-column[ "You shall know a word by the company it keeps" [Firth, 1957] - Distributional semantics is a theory of meaning - Vector Space Models is an implementation of DS - Neural embeddings also ! ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ] .right-column[ - Term-document matrix gives words co-occurrence: .tablematrix[ Lemma | Doc1 | Doc2 -----------|------|----- cat | 5 | 2 dog | 7 | 0 table | 2 | 6 feline | 3 | 0 ] - Dot-product between 2 vectors: $$X \cdot Y = \sum\_i X_i Y_i$$ ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ] .right-column[ - terms are similar if they tend to occur in the same documents - dot product of lines gives the correlation between terms: ``` import numpy cat=numpy.array([5,2]) dog=numpy.array([7,0]) table=numpy.array([2,6]) numpy.dot(cat,dog) numpy.dot(cat,table) ``` ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ] .right-column[ Main issues with this basic term-document matrix: - Dimensions quickly become very large - Contains lots of noise ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ##### LSA ] .right-column[ ### Latent Semantic Analysis Deerwester et al., 1990: - Singular Value Decomposition $$X\_{M\times N} = U\_{M\times k} \Sigma\_{k\times k} V\_{k\times N}^T$$ - $U$ projects the original term vectors into a subspace $k=\min(M,N)$ - each line $t_i$ of $U$ corresponds to one term - each column $d_j$ of $V^T$ corresponds to one document - $\Sigma$ is diagonal = singular values: we just keep the $k$ largest ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ##### LSA ] .right-column[ ### Latent Semantic Analysis - New term vectors = $\Sigma^{(k)} t_i$ - Dimensions get combined into the subspace: - handle synonymy: (cat, feline) becomes (1.9*cat + 0.2*feline) ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ##### LSA ] .right-column[ ### Latent Semantic Analysis - Deerwester et al., 1990 - Landauer, 1997: good results on the TOEFL synonym questions - Turney, 2010: show that dimensions encode lexical or topical meanings ] --- .center[ ## Embeddings ] .center[
] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ##### LSA ##### Random Indexing ] .right-column[ ### Random Indexing - LSA issues: - SVD is costly - Need to retrain when adding documents ! - Sahlgren, 2006 - Fast and online method - Starts from random low-dimensional term vectors - sum vectors that co-occur ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ##### LSA ##### Random Indexing ] .right-column[ ### Other Vector Space Models - Other derivatives of LSA: - Hyperspace Analogue to Language (HAL): (Burgess, 1997) - BEAGLE (Jones, 2007) ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ##### LSA ##### Random Indexing ##### Word context ] .right-column[ ### Term-context (co-occurrence) matrix - 2 words co-occur if they are in the same sentence / window - Extensions: syntactic context... - Point-wise mutual information (Church, 1989) - Do words x and y co-occur more than if they were independent ? - Or TF-IDF (Sparck Jones 1972) ] --- .center[ ## Distributional semantics ] .left-column[ #### Distributional hypothesis #### Vector Space Models ##### LSA ##### Random Indexing ##### Word context ##### GloVe ] .right-column[ ### GloVe - Pennington, 2014 - Trains a log-linear model on the words co-occurrence matrix $X$: $$J=\sum\_{i,j} f(X\_{ij})( w\_i^Tw\_j + b\_i + b\_j - \log X\_{ij} )^2$$ - Intuition: distance between word vectors should become equal to $\log (X\_{ij})$ - then, $(w\_i-w\_j)^Tw\_k = \log \frac{X\_{ik}}{X\_{jk}}$ - $\simeq$ do $i$ and $j$ share the same contexts $k$ ? - as good as Word-to-Vec ! ] --- .center[ ## Embeddings ] .center[
] Step 2: the deep learning explosion --- .center[ ## Bayesian perspective ] .left-column[ #### LDA ] .right-column[ ### Latent Dirichlet Allocation - Blei, Ng & Jordan, 2003 - Define a model that randomly generates a topic, and then a text about this topic - Infer the parameters that best explain the text corpus .center[
] ] --- .center[ ## Bayesian perspective ] .left-column[ #### LDA ] .right-column[ ### Latent Dirichlet Allocation - Word vector: contribution of the word to each topic (dimension) - Basis of the **Topic Models** field ] --- .center[ ## Neural perspective ] .left-column[ #### Collobert ] .right-column[ ### Word embeddings - Bengio proposed the term "word embedding" in 2003, as a by-product of a neural language model - But Collobert in "A unified architecture for natural language processing" (2008) that, when trained on sufficiently large dataset, they carry semantic meaning and may be used in downstream tasks. ] --- .center[ ## Neural perspective ] .left-column[ #### Collobert ] .right-column[ .center[ ### Collobert embeddings
] ] --- .center[ ## Neural perspective ] .left-column[ #### Collobert ] .right-column[ .center[ ### Collobert embeddings
] ] --- .center[ ## Neural perspective ] .left-column[ #### Collobert #### Mikolov ] .right-column[ ### Word-to-Vec - Mikolov, 2013 - Most famous word embeddings, because: - Released a very fast C code - New approximations to make it faster (negative sampling, hierarchical softmax..) - Training on large datasets becomes super-easy - Big companies start to pretrain W2V on huge datasets and distribute them for transfer learning ] --- .center[ ## Neural perspective ] .left-column[ #### Collobert #### Mikolov ] .right-column[ ### Word-to-Vec .center[
] ] --- .center[ ## Cosine distance ] .left-column[ #### Cosine ] .right-column[ ### Cosine similarity - How to measure the similarity between word vectors ? - Issue with dot-product: longer vectors -> larger values - Most common distance: **cosine distance** - can be computed efficiently with the dot product: $$cos(a,b)=\frac{a \cdot b}{||A|| \times ||B||}$$ ] --- .center[ ## So far, so good ? ] - Word embeddings capture part of lexical semantics - They are helpful in downstream tasks (**transfer learning**) Examples: - Predicting sentiments - Compute POStags, detect Named Entities - Synactic parsing - Question-Answering, translation, summarization... - ... --- .center[ ## So far, so good ? ] - Word embeddings capture part of lexical semantics - They are helpful in downstream tasks (**transfer learning**) - But... - What about Out-of-vocabulary words ? - What about polysemy ? - What about context-dependent meaning ? - What about multiple languages ? - What about multi-word expressions ? - What about sentence embeddings ? --- .center[ ## Embeddings ] .center[
] --- .center[ ## Contextual Embeddings ] How to handle polysemy ? - With context-dependent words embeddings - 2018: the "NLP's ImageNet moment" - Problem: - You have to distribute a complete **model**, which you have to run on your data and which returns a vector per word - constraints the programming language - much harder to "fine-tune" - But still, this model has been trained on a huge dataset, and the returned embedding encode all of this information --- .center[ ## The rise of Language Models ] .left-column[ #### LM #### n-gram ] .right-column[ - Language Model is the basis of all modern embeddings ! - Given past words, a LM predicts the next word: - Basic LM: n-grams ] .center[
] --- .center[ ## The rise of Language Models ] .left-column[ #### LM #### n-gram #### NN-LM ] .right-column[ - simple NN, recurrent NN, transformers... ] .center[
] --- .center[ ## Contextual Embeddings: ELMo ] .left-column[ #### ELMO ] .right-column[ - From *Allen-NLP* (2018): Huge improvements - Character-based; - Trained to predict the next word (Language Model) - Bi-directional, but both directions are trained separately ] --- .center[ ## Contextual Embeddings: ELMo ] .left-column[ #### ELMO ] .right-column[ - the word embeddings is a weighted combination of hidden representations from every layer ]
--- .center[ ## Contextual Embeddings: BERT ] .left-column[ #### ELMO #### BERT ] .right-column[ - Exploit *Transformers*: attention instead of recurrence - Replace LM objective by "fill in the masked words" - Trains both directions simultaneously - Represent input as **subwords** ]
--- .center[ ## Contextual Embeddings: GPT ] .left-column[ #### ELMO #### BERT #### GPT ] .right-column[ - GPT is a classical Language Model (left-right), but based on transformers. - subword units: **Byte-Pair Encoding** - Fine-tune the base model on target task for transfer learning OpenAI: "GPT2: the AI that's too dangerous to release" - GPT-1 = ULMFit + Transformer - CPT-2 = GPT-1 + reddit + gpus ] --- .center[ ## Contextual Embeddings: XLNet ] .left-column[ #### ELMO #### BERT #### GPT #### XLNet ] .right-column[ - XLNet = Google/CMU - Based on BERT: "improves upon BERT on 20 tasks" - Get rid of the artificial MASK token - Uses the *Transformer-XL* = Transformer with recurrence (pass hiden states between seqs) ] --- .center[ ## Contextual Embeddings: XLNet ] .left-column[ #### ELMO #### BERT #### GPT #### XLNet ] .right-column[ - Killer idea: Permutation Language Model - Predict tokens in random order, cumulate them to build the context - Forces to model 2 directions simultaneously
] --- .center[ ## Embeddings ] .center[
] Step 3: word, contextual word, what about full text ? --- .center[ ## Sentence Embeddings: Averaging ] .left-column[ #### Averaging ] .right-column[ Just average words embeddings in the sentence ! - Old baseline, but still [hard-to-beat](https://openreview.net/forum?id=SyK00v5xx) ] --- .center[ ## Sentence Embeddings: Language Model ] .left-column[ #### Averaging #### NN-LM ] .right-column[ - Bengio (2003): a NN-LM learns simultaneously - words representations - sequence of words probabilities - Google has released [nnlm-en-dim128](https://tfhub.dev/google/nnlm-en-dim128/1) - trained on Google News 200B - maps any sentence into 128-dimensional embeddings ] --- .center[ ## Sentence Embeddings: Doc2Vec ] .left-column[ #### Averaging #### NN-LM #### Doc2Vec ] .right-column[ - Proposed by Mikolov et al: [Distributed Representations of Sentences and Documents](https://arxiv.org/abs/1405.4053) .center[
] ] --- .center[ ## Sentence Embeddings: Skip-thought ] .left-column[ #### Averaging #### Doc2Vec #### NN-LM #### Skip-Thought ] .right-column[ - Encoder-Decoder that re-gnerates the surrounding sentences .center[
] ] --- .center[ ## Sentence Embeddings: Quick-thought ] .left-column[ #### Averaging #### Doc2Vec #### NN-LM #### Skip-Thought #### Quick-Thought ] .right-column[ - Replace decoder by a classifier .center[
] ] --- .center[ ## Sentence Embeddings: InferSent ] .left-column[ #### Averaging #### Doc2Vec #### NN-LM #### Skip-Thought #### Quick-Thought #### InferSent ] .right-column[ - Supervised encoder, trained on the Stanford Natural Language Inference datasets .center[
] ] --- .center[ ## Sentence Embeddings: Universal Sentence Encoder ] .left-column[ #### Averaging #### Doc2Vec #### NN-LM #### Skip-Thought #### Quick-Thought #### InferSent #### Universal ] .right-column[ - 2018: [Universal Sentence Encoder](https://arxiv.org/abs/1803.11175) - 2 fast models trained on many tasks: - Transformer - Deep Averaging Network - Produce 512-embeddings for any text ] --- .center[ ## State-of-the-art
Words Embed.
Sentences Embed.
Strong baselines
FastText
Bag-Of-Words
State-of-the-art
BERT, ELMO, GPT2, XLNET
Unsup: Skip-Thoughts, Quick-Thoughts
Supervised: InferSent
Multi-Task: MILA/MSR's general purpose sent
Multi-Task: Google's Universal Sent
] --- .center[ ## Tools ] - Gensim - Spacy - FastText - Senteval: Transfer learning tasks to evaluate embeddings - Huggingface: pytorch-transformers --- .center[ ## Gensim ] .left-column[ #### Gensim ] .right-column[ - Oldest python lib for embeddings (start from 2008) from Radim Rehrurek (CZ) - Designed for semantic/topic modelling - Includes models: LDA, LSI, TFIDF, W2V, DOC2VEC, FastText... - Includes corpora: test8... Get all the available datasets and models: ``` import gensim.downloader as api api.info() ``` - See https://radimrehurek.com/gensim ] --- .center[ ## SpaCy ] .left-column[ #### Gensim #### SpaCy ] .right-column[ - from 2015 - focused on modern NLP, including deep learning models (work with tensorflow, pytorch...) - Includes recent pretrained models: BERT, ULMFiT, XLNET... - Very active in 2018/2019 - See https://spacy.io/ ] --- .center[ ## FastText (Facebook) ] .left-column[ #### Gensim #### SpaCy #### FastText ] .right-column[ - Fast training of Skipgram / CBOW word embeddings - Written in C but can be integrated into python - [Combine several tricks](https://arxiv.org/pdf/1712.09405.pdf) to improve embeddings - subsample frequent words: $p_{discard} = 1-\sqrt{\alpha/f_w}$ - position-dependent features (CBOW): train a weight per position in the context window, then computes a weighted average of the word vectors in the context - phrase representations: merge ngrams with high mutual information into a single token - add subword information: - decompose words into char-ngrams - one embedding per char-ngrams - final word vector = $w + \frac 1 N \sum_n^N c_n$ ] --- .center[ ## FastText ] .left-column[ #### Gensim #### SpaCy #### FastText ] .right-column[ - Includes a text classifier: - Embeddings + linear model + softmax - Simple, but competitive with state-of-the-art - Extremely fast .center[
] - Has a python implementation, but it's not officially supported - Distributes word vectors for 157 languages, and multi-lingual word vectors in 44 languages ] --- .center[ ## SentEval ] .left-column[ #### Gensim #### SpaCy #### FastText #### SentEval ] .right-column[ - From Facebook (Alexis Conneau): a toolkit to evalute the quality of sentence embeddings - see https://github.com/facebookresearch/SentEval - Includes skipthought, Google-USE and their own InferSent encoders - Makes it "easy" to evaluate transfer learning with embeddings on more than 20 tasks: MR, TREC, SST... ] --- .center[ ## HuggingFace ] .left-column[ #### Gensim #### SpaCy #### FastText #### SentEval #### HuggingFace ] .right-column[ - HuggingFace is a company making chatbots - Released the **pytorch-transformers** library - https://github.com/huggingface/pytorch-transformers - Includes the most recent contextual word embeddings: - BERT (from Google) - GPT (from OpenAI) - GPT-2 (from OpenAI) - Transformer-XL (from Google/CMU) - XLNet (from Google/CMU) - XLM (from Facebook) ] --- .center[ ## Conclusions about the tools ] - Research on embeddings in 2018/2019 is **extremely** active - New models appear every few months (*The ImageNet effect*) - Open-source implementations are released nearly immediatly - So the software landscape for embeddings will still evolve ! - Don't become an "expert" with one tool, or you'll get stuck - Better look for the most appropriate tool at the moment for your task --- .center[ ## Conclusions about the tools ] - State-of-the-art overview: [http://ruder.io/state-of-transfer-learning-in-nlp/](http://ruder.io/state-of-transfer-learning-in-nlp/) - We do rely on big companies in NLP: - More data gives **always** better models (cf. ruder.io GLUE curve) - Baidu (ERNIE), NVidia (GPT2-8B), Google (XLNet), Facebook (RoBERTa), AllenAI... - Can we do research without them ? - Multi-lingual embeddings for low-resource languages - They release (for now) rich models; we exploit them in many target tasks --- .center[ ## Conclusions about the tools ] - Recommendations: - **Do** use the best embeddings for your research tasks in NLP - Don't train embeddings ! You can't... - trend: python; tensorflow or pytorch for maximum flexibility .center[
] --- name: last-page class: middle, center, inverse "it is very likely that in a year’s time NLP practitioners will download pretrained language models rather than pretrained word embeddings" [https://thegradient.pub/nlp-imagenet](https://thegradient.pub/nlp-imagenet) ## That's all folks (for now)! Slideshow created using [remark](http://github.com/gnab/remark).