Python4NLP

class: center, middle
background-image:url(images/data-background-light.jpg)

# Lexical Resources

## Master TAL, Nancy, 2019-2020

.footnote[.bold[[Christophe Cerisara](mailto:cerisara@loria.fr) CNRS / LORIA]]

---
.center[
## Topic of the day
]

### Words semantic representations

We have seen:

- How to extract lexicons from texts
- How to capture local syntactic information (n-grams)
- How to augment lexicon with multi-word expressions
- How to capture (weak) information about morphology

Today's focus: lexical semantics

---
.center[
## Word representations
]

.left-column[
  #### Discrete
]
.right-column[
- Computer programs manipulate numbers
]

---
.center[
## Word representations
]

.left-column[
  #### Discrete
]
.right-column[
- Computer programs manipulate numbers
- Discrete representations: cat=1, table=2, dog=3
]

---
.center[
## Word representations
]

.left-column[
 #### Discrete
]
.right-column[
- Computer programs manipulate numbers
- Discrete representations: cat=1, table=2, dog=3
- But distances are misleading: d(cat,dog)=2, d(cat,table)=1
]

---
.center[
## Word representations
]

.left-column[
  #### Discrete
]
.right-column[
- Computer programs manipulate numbers
- Discrete representations: cat=1, table=2, dog=3
- But distances are misleading: d(cat,dog)=2, d(cat,table)=1

One-dimensional representations force ordering:

...   <   CAT   <   TABLE   <   DOG   <  ...

]

---
.center[
## Word representations
]

.left-column[
  #### Discrete
  #### One-hot vector
]
.right-column[
N-dimensional representations allow words to all be at the same distance:

.center[
<img src="images/axes.svg" width="200cm"/>
]

]

---
.center[
## Word representations
]

.left-column[
  #### Discrete
  #### One-hot vector
]
.right-column[
N-dimensional representations allow words to all be at the same distance:

.center[
<img src="images/axes.svg" width="200cm"/>
]

**One-hot vectors**

.center[
<img src="images/onehot.svg" width="290cm"/>
]

]

---
.center[
## Word representations
]

.left-column[
  #### Discrete
  #### One-hot vector
  #### Embeddings
]
.right-column[
- But having all words at the same distance is not ideal
- And we face the curse of dimensionality
]

---
.center[
## Word representations
]

.left-column[
  #### Discrete
  #### One-hot vector
  #### Embeddings
]
.right-column[
- But having all words at the same distance is not ideal
- And we face the curse of dimensionality

We want to find word vectors that encode part of lexical semantics:

.center[
<img src="images/lexsem.svg" width="200cm"/>
]

]

---

.center[
## Embeddings
]

.center[
<img src="images/chrono.svg" width="990cm"/>
]

- Long history
- Acceleration in the last 2 years
- Colors = types of approaches

---

.center[
## Embeddings
]

.center[
<img src="images/chrono.svg" width="990cm"/>
]

Prehistory (?): vector space models

---
.center[
## Distributional semantics
]

.left-column[
  #### Distributional hypothesis
]
.right-column[

"You shall know a word by the company it keeps" [Firth, 1957]

]

---
.center[
## Distributional semantics
]

.left-column[
  #### Distributional hypothesis
]
.right-column[

"You shall know a word by the company it keeps" [Firth, 1957]

- Distributional semantics is a theory of meaning
- Vector Space Models is an implementation of DS
- Neural embeddings also !

]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models

]
.right-column[

- Term-document matrix gives words co-occurrence:

.tablematrix[
Lemma      | Doc1 | Doc2
-----------|------|-----
cat        | 5    |   2
dog        | 7    |   0
table      | 2    |   6
feline     | 3    |   0
]

- Dot-product between 2 vectors:
$$X \cdot Y = \sum\_i X_i Y_i$$
]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models

]
.right-column[

- terms are similar if they tend to occur in the same documents
    - dot product of lines gives the correlation between terms:
```
import numpy
cat=numpy.array([5,2])
dog=numpy.array([7,0])
table=numpy.array([2,6])
numpy.dot(cat,dog)
numpy.dot(cat,table)
```

]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models

]
.right-column[

Main issues with this basic term-document matrix:

- Dimensions quickly become very large
- Contains lots of noise

]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models
##### LSA
]
.right-column[
### Latent Semantic Analysis

Deerwester et al., 1990:

- Singular Value Decomposition
$$X\_{M\times N} = U\_{M\times k} \Sigma\_{k\times k} V\_{k\times N}^T$$
- $U$ projects the original term vectors into a subspace $k=\min(M,N)$
- each line $t_i$ of $U$ corresponds to one term
- each column $d_j$ of $V^T$ corresponds to one document
- $\Sigma$ is diagonal = singular values: we just keep the $k$ largest
]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models
##### LSA
]
.right-column[
### Latent Semantic Analysis

- New term vectors = $\Sigma^{(k)} t_i$

- Dimensions get combined into the subspace:
    - handle synonymy: (cat, feline) becomes (1.9*cat + 0.2*feline)
]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models
##### LSA
]
.right-column[
### Latent Semantic Analysis

- Deerwester et al., 1990
- Landauer, 1997: good results on the TOEFL synonym questions
- Turney, 2010: show that dimensions encode lexical or topical meanings
]

---

.center[
## Embeddings
]

.center[
<img src="images/chrono.svg" width="990cm"/>
]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models
##### LSA
##### Random Indexing
]
.right-column[
### Random Indexing

- LSA issues:
    - SVD is costly
    - Need to retrain when adding documents !
- Sahlgren, 2006
    - Fast and online method
    - Starts from random low-dimensional term vectors
    - sum vectors that co-occur
]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models
##### LSA
##### Random Indexing
]
.right-column[
### Other Vector Space Models

- Other derivatives of LSA:
    - Hyperspace Analogue to Language (HAL): (Burgess, 1997)
    - BEAGLE (Jones, 2007)
]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models
##### LSA
##### Random Indexing
##### Word context
]
.right-column[
### Term-context (co-occurrence) matrix

- 2 words co-occur if they are in the same sentence / window
- Extensions: syntactic context...
- Point-wise mutual information (Church, 1989)
    - Do words x and y co-occur more than if they were independent ?
    - Or TF-IDF (Sparck Jones 1972)
]

---
.center[
## Distributional semantics
]

.left-column[
#### Distributional hypothesis
#### Vector Space Models
##### LSA
##### Random Indexing
##### Word context
##### GloVe
]
.right-column[
### GloVe

- Pennington, 2014
- Trains a log-linear model on the words co-occurrence matrix $X$:

$$J=\sum\_{i,j} f(X\_{ij})( w\_i^Tw\_j + b\_i + b\_j - \log X\_{ij} )^2$$

- Intuition: distance between word vectors should become equal to $\log (X\_{ij})$
    - then, $(w\_i-w\_j)^Tw\_k = \log \frac{X\_{ik}}{X\_{jk}}$
    - $\simeq$ do $i$ and $j$ share the same contexts $k$ ?
- as good as Word-to-Vec !

]

---

.center[
## Embeddings
]

.center[
<img src="images/chrono.svg" width="990cm"/>
]

Step 2: the deep learning explosion

---

.center[
## Bayesian perspective
]

.left-column[
#### LDA
]
.right-column[
### Latent Dirichlet Allocation

- Blei, Ng & Jordan, 2003
- Define a model that randomly generates a topic, and then a text about this topic
- Infer the parameters that best explain the text corpus

.center[
<img src="images/LDA.png" width="400cm"/>
]

]
---

.center[
## Bayesian perspective
]

.left-column[
#### LDA
]
.right-column[
### Latent Dirichlet Allocation

- Word vector: contribution of the word to each topic (dimension)
- Basis of the **Topic Models** field

]

---

.center[
## Neural perspective
]

.left-column[
#### Collobert
]
.right-column[
### Word embeddings

- Bengio proposed the term "word embedding" in 2003, as a by-product of a neural language model
- But Collobert in "A unified architecture for natural language processing" (2008) that, when trained on sufficiently large dataset, they carry semantic meaning and may be used in downstream tasks.
]

---

.center[
## Neural perspective
]

.left-column[
#### Collobert
]
.right-column[
.center[
### Collobert embeddings

<img src="images/collobert1.png" width="400cm"/>
]

]

---

.center[
## Neural perspective
]

.left-column[
#### Collobert
]
.right-column[
.center[
### Collobert embeddings

<img src="images/collobert2.png" width="400cm"/>
]

]

---

.center[
## Neural perspective
]

.left-column[
#### Collobert
#### Mikolov
]
.right-column[
### Word-to-Vec

- Mikolov, 2013
- Most famous word embeddings, because:
    - Released a very fast C code
    - New approximations to make it faster (negative sampling, hierarchical softmax..)
    - Training on large datasets becomes super-easy
    - Big companies start to pretrain W2V on huge datasets and distribute them for transfer learning
]

---

.center[
## Neural perspective
]

.left-column[
#### Collobert
#### Mikolov
]
.right-column[
### Word-to-Vec

.center[
<img src="images/word2vec.png" width="600cm"/>
]

]

---

.center[
## Cosine distance
]

.left-column[
#### Cosine
]
.right-column[
### Cosine similarity

- How to measure the similarity between word vectors ?
- Issue with dot-product: longer vectors -> larger values
- Most common distance: **cosine distance**
- can be computed efficiently with the dot product:

$$cos(a,b)=\frac{a \cdot b}{||A|| \times ||B||}$$

]

---

.center[
## So far, so good ?
]

- Word embeddings capture part of lexical semantics
- They are helpful in downstream tasks (**transfer learning**)

Examples:

- Predicting sentiments
- Compute POStags, detect Named Entities
- Synactic parsing
- Question-Answering, translation, summarization...
- ...

---

.center[
## So far, so good ?
]

- Word embeddings capture part of lexical semantics
- They are helpful in downstream tasks (**transfer learning**)
- But...
    - What about Out-of-vocabulary words ?
    - What about polysemy ?
    - What about context-dependent meaning ?
    - What about multiple languages ?
    - What about multi-word expressions ?
    - What about sentence embeddings ?

---

.center[
## Embeddings
]

.center[
<img src="images/chrono.svg" width="990cm"/>
]

---

.center[
## Contextual Embeddings
]

How to handle polysemy ?

- With context-dependent words embeddings
    - 2018: the "NLP's ImageNet moment"
- Problem:
    - You have to distribute a complete **model**, which you have to run on your data and which returns a vector per word
        - constraints the programming language
        - much harder to "fine-tune"
    - But still, this model has been trained on a huge dataset, and the returned embedding encode all of this information

---

.center[
## The rise of Language Models
]

.left-column[
#### LM
#### n-gram
]
.right-column[
- Language Model is the basis of all modern embeddings !
- Given past words, a LM predicts the next word:
- Basic LM: n-grams
]

.center[
<img src="images/LM1.jpg" width="800cm"/>
]

---

.center[
## The rise of Language Models
]

.left-column[
#### LM
#### n-gram
#### NN-LM
]
.right-column[
- simple NN, recurrent NN, transformers...
]

.center[
<table>
<tr>
<td>
<img src="images/mlpLM.png" width="380cm" />
</td>
<td style="width:6cm">
</td>
<td>
<img src="images/rnnLM.png" width="380cm" />
</td>
</tr>
</table>
]

---

.center[
## Contextual Embeddings: ELMo
]

.left-column[
#### ELMO
]
.right-column[

- From *Allen-NLP* (2018): Huge improvements
    - Character-based;
    - Trained to predict the next word (Language Model)
    - Bi-directional, but both directions are trained separately
]

---

.center[
## Contextual Embeddings: ELMo
]

.left-column[
#### ELMO
]
.right-column[

- the word embeddings is a weighted combination of hidden representations from every layer
]

<img src="images/elmo.jpg" width="1000cm"/>

---

.center[
## Contextual Embeddings: BERT
]

.left-column[
#### ELMO
#### BERT
]
.right-column[

- Exploit *Transformers*: attention instead of recurrence
    - Replace LM objective by "fill in the masked words"
    - Trains both directions simultaneously
    - Represent input as **subwords**
]

<img src="images/bert.png" width="900cm"/>

---

.center[
## Contextual Embeddings: GPT
]

.left-column[
#### ELMO
#### BERT
#### GPT
]
.right-column[

- GPT is a classical Language Model (left-right), but based on transformers.
- subword units: **Byte-Pair Encoding**
- Fine-tune the base model on target task for transfer learning

OpenAI: "GPT2: the AI that's too dangerous to release"

- GPT-1 = ULMFit + Transformer
- CPT-2 = GPT-1 + reddit + gpus

]

---

.center[
## Contextual Embeddings: XLNet
]

.left-column[
#### ELMO
#### BERT
#### GPT
#### XLNet
]
.right-column[

- XLNet = Google/CMU
- Based on BERT: "improves upon BERT on 20 tasks"
    - Get rid of the artificial MASK token
- Uses the *Transformer-XL* = Transformer with recurrence (pass hiden states between seqs)

]

---

.center[
## Contextual Embeddings: XLNet
]

.left-column[
#### ELMO
#### BERT
#### GPT
#### XLNet
]
.right-column[

- Killer idea: Permutation Language Model
    - Predict tokens in random order, cumulate them to build the context
    - Forces to model 2 directions simultaneously

<img src="images/xlnet.gif" width="600cm"/>

]

---

.center[
## Embeddings
]

.center[
<img src="images/chrono.svg" width="990cm"/>
]

Step 3: word, contextual word, what about full text ?

---

.center[
## Sentence Embeddings: Averaging
]

.left-column[
#### Averaging
]
.right-column[

Just average words embeddings in the sentence !

- Old baseline, but still [hard-to-beat](https://openreview.net/forum?id=SyK00v5xx)
]

---

.center[
## Sentence Embeddings: Language Model
]

.left-column[
#### Averaging
#### NN-LM
]
.right-column[

- Bengio (2003): a NN-LM learns simultaneously
    - words representations
    - sequence of words probabilities
- Google has released [nnlm-en-dim128](https://tfhub.dev/google/nnlm-en-dim128/1)
    - trained on Google News 200B
    - maps any sentence into 128-dimensional embeddings

]

---

.center[
## Sentence Embeddings: Doc2Vec
]

.left-column[
#### Averaging
#### NN-LM
#### Doc2Vec
]
.right-column[

- Proposed by Mikolov et al: [Distributed Representations of Sentences and Documents](https://arxiv.org/abs/1405.4053)

.center[
<img src="images/doc2vec.png" width="800cm"/>
]

]

---

.center[
## Sentence Embeddings: Skip-thought
]

.left-column[
#### Averaging
#### Doc2Vec
#### NN-LM
#### Skip-Thought
]
.right-column[

- Encoder-Decoder that re-gnerates the surrounding sentences
.center[
<img src="images/skipthought.png" width="600cm"/>
]
]

---

.center[
## Sentence Embeddings: Quick-thought
]

.left-column[
#### Averaging
#### Doc2Vec
#### NN-LM
#### Skip-Thought
#### Quick-Thought
]
.right-column[

- Replace decoder by a classifier
.center[
<img src="images/quickthought.png" width="600cm"/>
]
]

---

.center[
## Sentence Embeddings: InferSent
]

.left-column[
#### Averaging
#### Doc2Vec
#### NN-LM
#### Skip-Thought
#### Quick-Thought
#### InferSent
]
.right-column[

- Supervised encoder, trained on the Stanford Natural Language Inference datasets

.center[
<img src="images/infersent.png" width="400cm"/>
]
]

---

.center[
## Sentence Embeddings: Universal Sentence Encoder
]

.left-column[
#### Averaging
#### Doc2Vec
#### NN-LM
#### Skip-Thought
#### Quick-Thought
#### InferSent
#### Universal
]
.right-column[

- 2018: [Universal Sentence Encoder](https://arxiv.org/abs/1803.11175)
- 2 fast models trained on many tasks:
    - Transformer
    - Deep Averaging Network
- Produce 512-embeddings for any text

]

---

.center[
## State-of-the-art

<table class="tablematrix">
<tr>
<th></th>
<th>Words Embed.</th>
<th>Sentences Embed.</th>
</tr>
<tr>
<td>Strong baselines</td>
<td>FastText</td>
<td>Bag-Of-Words</td>
</tr>
<tr>
<td>State-of-the-art</td>
<td>BERT, ELMO, GPT2, XLNET</td>
<td>Unsup: Skip-Thoughts, Quick-Thoughts<br>
Supervised: InferSent<br>
Multi-Task: MILA/MSR's general purpose sent<br>
Multi-Task: Google's Universal Sent</td>
</tr>
</table>

]

---

.center[
## Tools
]

- Gensim
- Spacy
- FastText
- Senteval: Transfer learning tasks to evaluate embeddings
- Huggingface: pytorch-transformers

---

.center[
## Gensim
]

.left-column[
#### Gensim
]
.right-column[

- Oldest python lib for embeddings (start from 2008) from Radim Rehrurek (CZ)
- Designed for semantic/topic modelling
- Includes models: LDA, LSI, TFIDF, W2V, DOC2VEC, FastText...
- Includes corpora: test8...

Get all the available datasets and models:
```
import gensim.downloader as api
api.info()
```

- See https://radimrehurek.com/gensim

]

---

.center[
## SpaCy
]

.left-column[
#### Gensim
#### SpaCy
]
.right-column[

- from 2015
- focused on modern NLP, including deep learning models (work with tensorflow, pytorch...)
- Includes recent pretrained models: BERT, ULMFiT, XLNET...
- Very active in 2018/2019

- See https://spacy.io/
]

---

.center[
## FastText (Facebook)
]

.left-column[
#### Gensim
#### SpaCy
#### FastText
]
.right-column[

- Fast training of Skipgram / CBOW word embeddings
- Written in C but can be integrated into python
- [Combine several tricks](https://arxiv.org/pdf/1712.09405.pdf) to improve embeddings
    - subsample frequent words: $p_{discard} = 1-\sqrt{\alpha/f_w}$
    - position-dependent features (CBOW): train a weight per position in the context window, then computes a weighted average of the word vectors in the context
    - phrase representations: merge ngrams with high mutual information into a single token
    - add subword information:
        - decompose words into char-ngrams
        - one embedding per char-ngrams
        - final word vector = $w + \frac 1 N \sum_n^N c_n$
]

---

.center[
## FastText
]

.left-column[
#### Gensim
#### SpaCy
#### FastText
]
.right-column[
- Includes a text classifier:
    - Embeddings + linear model + softmax
    - Simple, but competitive with state-of-the-art
- Extremely fast

.center[
<img src="images/fasttextspeed.png" width="700cm"/>
]

- Has a python implementation, but it's not officially supported
- Distributes word vectors for 157 languages, and multi-lingual word vectors in 44 languages

]

---

.center[
## SentEval
]

.left-column[
#### Gensim
#### SpaCy
#### FastText
#### SentEval
]
.right-column[

- From Facebook (Alexis Conneau): a toolkit to evalute the quality of sentence embeddings
- see https://github.com/facebookresearch/SentEval
- Includes skipthought, Google-USE and their own InferSent encoders
- Makes it "easy" to evaluate transfer learning with embeddings on more than 20 tasks: MR, TREC, SST...

]

---

.center[
## HuggingFace
]

.left-column[
#### Gensim
#### SpaCy
#### FastText
#### SentEval
#### HuggingFace
]
.right-column[

- HuggingFace is a company making chatbots
- Released the **pytorch-transformers** library
    - https://github.com/huggingface/pytorch-transformers
- Includes the most recent contextual word embeddings:
    - BERT (from Google)
    - GPT (from OpenAI)
    - GPT-2 (from OpenAI)
    - Transformer-XL (from Google/CMU)
    - XLNet (from Google/CMU)
    - XLM (from Facebook)
]

---

.center[
## Conclusions about the tools
]

- Research on embeddings in 2018/2019 is **extremely** active
    - New models appear every few months (*The ImageNet effect*)
- Open-source implementations are released nearly immediatly
- So the software landscape for embeddings will still evolve !
    - Don't become an "expert" with one tool, or you'll get stuck
    - Better look for the most appropriate tool at the moment for your task

---

.center[
## Conclusions about the tools
]

- State-of-the-art overview: [http://ruder.io/state-of-transfer-learning-in-nlp/](http://ruder.io/state-of-transfer-learning-in-nlp/)
- We do rely on big companies in NLP:
    - More data gives **always** better models (cf. ruder.io GLUE curve)
    - Baidu (ERNIE), NVidia (GPT2-8B), Google (XLNet), Facebook (RoBERTa), AllenAI...
    - Can we do research without them ?
        - Multi-lingual embeddings for low-resource languages
        - They release (for now) rich models; we exploit them in many target tasks

---

.center[
## Conclusions about the tools
]

- Recommendations:
    - **Do** use the best embeddings for your research tasks in NLP
    - Don't train embeddings ! You can't...
    - trend: python; tensorflow or pytorch for maximum flexibility

.center[
<img src="images/pytorch.jpg" width="900cm"/>
]

---

name: last-page
class: middle, center, inverse

"it is very likely that in a year’s time NLP practitioners will download pretrained language models rather than pretrained word embeddings" [https://thegradient.pub/nlp-imagenet](https://thegradient.pub/nlp-imagenet)

## That's all folks (for now)!

Slideshow created using [remark](http://github.com/gnab/remark).