Lexical Resources

class: center, middle
background-image:url(images/data-background-light.jpg)

# Lexical Resources

## Master TAL, Nancy, 2019-2020

.footnote[.bold[[Christophe Cerisara](mailto:cerisara@loria.fr) CNRS / LORIA]]

---

.center[
## What is a lexical resource ?
]

.left-column[
  #### Definition
]
.right-column[

- Any content **usable by a computer** that gives information about **lexicons**
- lexicon = the vocabulary of a person, language, or branch of knowledge.

- Obviously, old-printed encyclopedia are lexical resources... But they cannot be processed by a computer !

]

---

.center[
## What are the first and most important lexical resources ?
]

.left-column[
  #### Definition
]
.right-column[

]

---

.center[
## What are the first and most important lexical resources ?
]

.left-column[
  #### Definition
]
.right-column[

- All texts available to computers (on the internet, on usb keys, on hard drives, on DVD...)
- Raw texts contain the most useful, ever, information about lexicons
]

---

.center[
## What are the first and most important lexical resources ?
]

.left-column[
  #### Definition
]
.right-column[

- All texts available to computers (on the internet, on usb keys, on hard drives, on DVD...)
- Raw texts contain the most useful, ever, information about lexicons
- 2 ways to define a lexicon:
  - Asking an expert to write down from his knowledge
  - Extracting lexicon from texts in the domain
]

---

.center[
## What are the first and most important lexical resources ?
]

.left-column[
  #### Definition
]
.right-column[

]

---

.center[
## Other resources ?
]

.left-column[
  #### Definition
]
.right-column[

]

---

.center[
## Other resources ?
]

.left-column[
  #### Definition
]
.right-column[

- parallel corpora
]

---

.center[
## Other resources ?
]

.left-column[
  #### Definition
]
.right-column[

- parallel corpora
- List of words, with or without additional informations (lemma, POS, pronunciation, translation...)
]

---

.center[
## Other resources ?
]

.left-column[
  #### Definition
]
.right-column[

- parallel corpora
- List of words, with or without additional informations (lemma, POS, pronunciation, translation...)
- lexical/semantic networks
]

---

.center[
## Other resources ?
]

.left-column[
  #### Definition
]
.right-column[

---

.center[
## Other resources ?
]

.left-column[
  #### Definition
]
.right-column[

---

.center[
## Other resources ?
]

.left-column[
  #### Definition
]
.right-column[

- parallel corpora
- List of words, with or without additional informations (lemma, POS, pronunciation, translation...)
- lexical/semantic networks
- (specialized) dictionaries (with definitions, usage examples...)
- thesaurus, ontologies
- collaborative dictionaries: Wikipedia, Wiktionary
]

---

.center[
## Other resources ?
]

.left-column[
  #### Definition
]
.right-column[

---

.center[
## Other resources ?
]

.left-column[
  #### Definition
]
.right-column[

---

.center[
## Other resources ?
]

.left-column[
  #### Definition
]
.right-column[

]

---

.center[
## Course plan
]

.left-column[
  #### Definition
  #### Course plan
]
.right-column[

- Lexical resources: ngrams
    - How to train them
    - Existing resources (Google n-grams)
]

---

.center[
## Course plan
]

.left-column[
  #### Definition
  #### Course plan
]
.right-column[

- Lexical resources: ngrams
    - How to train them
    - Existing resources (Google n-grams)
- Lexical resources: distributional
    - Historical perspective
    - Word2Vec embeddings
]

---

.center[
## Course plan
]

.left-column[
  #### Definition
  #### Course plan
]
.right-column[

- Lexical resources: ngrams
    - How to train them
    - Existing resources (Google n-grams)
- Lexical resources: distributional
    - Historical perspective
    - Word2Vec embeddings
- Lexical resources: hand-made
    - Wordnet, Framenet
    - Wiktionary
    - Wikipedia
    - ISO standards
]

---

.center[
## Course plan
]

.left-column[
  #### Definition
  #### Course plan
]
.right-column[

]

---

.center[
## Course plan
]

.left-column[
  #### Definition
  #### Course plan
]
.right-column[

Course requirements:

]

---

.center[
## Course plan
]

.left-column[
  #### Definition
  #### Course plan
]
.right-column[

Course requirements:

- Know basics of python
    - How to edit & run a python file
    - Python variables, lists, string, functions, loops...
]

---

.center[
## Course plan
]

.left-column[
  #### Definition
  #### Course plan
]
.right-column[

Course requirements:

- Know basics of python
    - How to edit & run a python file
    - Python variables, lists, string, functions, loops...
- Access to a computer (in & outside class)
    - With python + numpy + scipy + nltk installed
    - Recommended way: **anaconda**
    - Internet access in & outside class
]

---

.center[
## Course plan
]

.left-column[
  #### Definition
  #### Course plan
]
.right-column[

Course requirements:

]

---

.center[
## Most important lexical resource: raw texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

- We can extract a lot of information from them:
]

---

.center[
## Most important lexical resource: raw texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

- We can extract a lot of information from them:
    - list of word forms
]

---

.center[
## Most important lexical resource: raw texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

- We can extract a lot of information from them:
    - list of word forms
    - their diachronic usage => lexical drift
]

---

.center[
## Most important lexical resource: raw texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

- We can extract a lot of information from them:
    - list of word forms
    - their diachronic usage => lexical drift
    - co-occurrence of words => lexical semantics
]

---

.center[
## Most important lexical resource: raw texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

---

.center[
## Most important lexical resource: raw texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

---

.center[
## Most important lexical resource: raw texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

- We can extract a lot of information from them:
    - list of word forms
    - their diachronic usage => lexical drift
    - co-occurrence of words => lexical semantics
    - compute words embeddings => synonyms, antonyms, lexical analogies
    - their relations => syntagmatic, paradigmatic relations
    - their combination => compositional (sentence) semantics
]

---

.center[
## Most important lexical resource: raw texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

---

.center[
## Most important lexical resource: raw texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

]

---

.center[
## Most important lexical resource: raw texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

- Size matters => must be computationally efficient

]

---

.center[
## Most important lexical resource: raw texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

- How to extract lexical information from raw texts ?
    - ngrams: more syntax-oriented
    - words embeddings: more semantic-oriented
    - can be at word-level or subword-level, to capture morpho-syntactic information

]

---

.center[
## How to extract lexical information from raw texts ?
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

In practice, has to handle following issues:

- Choose one or multiple sources of texts
]

---

.center[
## How to extract lexical information from raw texts ?
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

In practice, has to handle following issues:

- Choose one or multiple sources of texts
- Scrap the texts (see other course)
]

---

.center[
## How to extract lexical information from raw texts ?
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

In practice, has to handle following issues:

- Choose one or multiple sources of texts
- Scrap the texts (see other course)
- Preprocess texts (see other course)
]

---

.center[
## How to extract lexical information from raw texts ?
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

In practice, has to handle following issues:

- Choose one or multiple sources of texts
- Scrap the texts (see other course)
- Preprocess texts (see other course)
- Extract lexical information

]

---

.center[
## Selecting sources of texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

### Which domain ? what type of language ?

- Are you interested in capturing generic information, or in specialized domain (healthcare...) ?
]

---

.center[
## Selecting sources of texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

### Which domain ? what type of language ?

- Are you interested in capturing generic information, or in specialized domain (healthcare...) ?
- Large variability in language:
    - Casual: forums, conversations...
    - Micro-blog
    - Formal: books...
    - Journalistic: news
    - Educative: moocs, tutorials...
    - ...

]

---

.center[
## Selecting sources of texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

### Have I the right to scrap the text ?

It's not because it's public that you can copy it !!

]

---

.center[
## Selecting sources of texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

### Have I the right to scrap the text ?

It's not because it's public that you can copy it !!

- Check whether there is a license, like *Creative Commons*, otherwise: "all rights reserved"
]

---

.center[
## Selecting sources of texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

### Have I the right to scrap the text ?

It's not because it's public that you can copy it !!

---

.center[
## Selecting sources of texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

### Have I the right to scrap the text ?

It's not because it's public that you can copy it !!

- Check whether there is a license, like *Creative Commons*, otherwise: "all rights reserved"
- Beyond legal aspects, more and more concerns about privacy & right to be forgotten
- Anonymization does not guarantee privacy !
- Twitter provides API to download some data, but forbids you to keep them on your harddrive.
- You cannot redistribute texts without explicit CC-BY licence

Wait... doesn't Google scrap the whole web since many years ?

]

---

.center[
## Selecting sources of texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

Unsafe sources of texts:

- Social media
- Most web pages

Safe sources of texts:

- Wikipedia & derivatives (CC-BY)
- Scientific papers: arXiv, pubMed, HAL...
- Owners datasets: AskUbuntu archives, reddit archives, (Common Crawl), (WebTimeMachine)...
- Gutenberg, Gallica...

]

---

.center[
## Selecting sources of texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

This is only a very very small part of the available data on the web

]

---

.center[
## Scrapping texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

There are several ways to download corpora:

- APIs: not standard, may change, heavy for servers
- dumps archive (wikipedia, reddit...)
- peer-to-peer (academic torrents)
- OAI-PMH
- ...

See course on basic NLP techniques (Yannick Parmentier)

]

---

.center[
## Scrapping texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

We're going to use the Movie Review corpus for now:

A corpus of public film reviews widely used for research.

]

---

.center[
## Pre-processing texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

- Texts often comes with metadata, in XML, JSON...
- You have to extract texts (+ interesting metadata) with adequate parsers
]

---

.center[
## Pre-processing texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

- Texts often comes with metadata, in XML, JSON...
- You have to extract texts (+ interesting metadata) with adequate parsers
- Then:
    - Filter out garbage (texts in another language, errors...)
]

---

.center[
## Pre-processing texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

---

.center[
## Pre-processing texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

- Texts often comes with metadata, in XML, JSON...
- You have to extract texts (+ interesting metadata) with adequate parsers
- Then:
    - Filter out garbage (texts in another language, errors...)
    - Segment into sentences and words (tokenize); take care of punctuation
    - Eventually normalize (remove URL, numbers, smileys; lowercasing...)
]

---

.center[
## Pre-processing texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

See Claire Gardent's course.

]

---

.center[
## Pre-processing texts
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

Download the preprocessed 500-lines Movie Review corpus (7 MB) here:

https://synalp.loria.fr/resL5.zip

Look at the file data/l5/MR500train

]

---

.center[
## The Movie Review corpus
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

i have seen this film at least times and i am still excited by it the acting is perfect and the romance between joe and jean keeps me on the edge of my seat plus i still think bryan brown is the tops brilliant film

----

this movie features charlie spradling dancing in a strip club beyond that it features a truly bad script with dull unrealistic dialogue that it got as many positive votes suggests some people may be joking

----

this film contain far too much meaningless violence too much shooting and blood the acting seems very unrealistic and is generally poor the only reason to see this film is if you like very old cars

]

---

.center[
## Building the lexicon
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

** How many different words is there in the corpus MR500train ?**

]

---

.center[
## Building the lexicon
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

Print one word per line, sort them and count them:

Linux & Mac:
```
cut -c2- MR500train | awk '{for (i=1;i<=NF;i++) print $i}' | sort | uniq -c | wc -l
```

Windows:

- Use Ubuntu subsystem
- Or run a lubuntu VM with virtualbox
- Or do it in python

]

---

.center[
## Building the lexicon
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

```
from collections import Counter
with open("MR500train","r") as f: lines=f.readlines()
co=Counter()
for l in lines: co.update(l[2:].strip().split(" "))
print(co.most_common(10))
print(len(co))
```

]

---

.center[
## Building the lexicon
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

```
      ...
      4 zodiac
      2 zoey
      1 zoeys
     21 zombie
     16 zombies
      1 zombieverse
      ...
```

]

---

.center[
## Analyzing the lexicon
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

**Plot the frequency of each word, in decreasing order**

]

---

.center[
## Analyzing the lexicon
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

Linux, Mac:
```
cut -c2- MR500train | awk '{for (i=1;i<=NF;i++) print $i}' | sort | uniq -c | awk '{print $1}' | sort -n -r > tt
echo 'plot "tt" w histo' | gnuplot -p
```

Python:
```
from collections import Counter
import matplotlib.pyplot as mp
with open("MR500train","r") as f: lines=f.readlines()
co=Counter()
for l in lines: co.update(l[2:].strip().split(" "))
y = list(co.values())
y.sort(reverse=True)
x=range(len(y))
mp.plot(x,y)
mp.show()
```

]

---

.center[
## Zipf law
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
]
.right-column[

]

---

.center[
## Unigrams
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Raw counts depend on the size of the data
- Normalized, give **1-gram**:

$$P(w) = \frac {N(w)} {N(*)}$$

- Probability that a word occur in the language

]

---

.center[
## Bigrams
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Count the sequences $N(a,b)$
- Divides by all sequences $N(a,*)$
- 2-gram gives probability that b follows a:

$$P(b|a) = \frac {N(a,b)} {N(a,*)}$$

Note:

$$P(b|a) = \frac {P(a,b)}{P(a)} = \frac{\frac{N(a,b)}{N(\*,\*)}}{\frac{N(a)}{N(\*)}}$$

]

---

.center[
## N-grams
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Generalisation to a sequence of length $n$

$$P(w\_t|w\_{t-n+1},\dots,w\_{t-1}) = \frac{N(w\_{t-n+1},\dots,w\_{t-1},w\_t)}{N(w\_{t-n+1},\dots,w\_{t-1},\*)}$$

]

---

.center[
## N-grams
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

Ngrams contain rich lexical information:

- Which words are frequent ? Rare ?
]

---

.center[
## N-grams
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

Ngrams contain rich lexical information:

- Which words are frequent ? Rare ?
- Clues to find multi-word expressions "in short", "William Shakespeare"...
    - See: https://github.com/nert-nlp/streusle/
]

---

.center[
## N-grams
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

Ngrams contain rich lexical information:

- Which words are frequent ? Rare ?
- Clues to find multi-word expressions "in short", "William Shakespeare"...
    - See: https://github.com/nert-nlp/streusle/
- Study word usages across time
]

---

.center[
## N-grams
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

Ngrams contain rich lexical information:

---

.center[
## N-grams
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

Ngrams contain rich lexical information:

- Which words are frequent ? Rare ?
- Clues to find multi-word expressions "in short", "William Shakespeare"...
    - See: https://github.com/nert-nlp/streusle/
- Study word usages across time
- Suggest following words
- Identify writing styles (regional)
]

---

.center[
## N-grams
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

Ngrams contain rich lexical information:

---

.center[
## N-grams
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

Ngrams contain rich lexical information:

]

---

.center[
## N-grams
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

]

---

.center[
## Training N-grams
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Easy to train:
  - accumulate counts
  - can be done online
]

---

.center[
## Training N-grams
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Easy to train:
  - accumulate counts
  - can be done online
- The most difficult is to scrap & pre-process texts, so:
  - Google N-grams: https://books.google.com/ngrams
  - Trained on 1,000G-tokens https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html
  - Free (trained on 430M-words) https://www.ngrams.info/

]

---

.center[
## Rare/unseen sequences
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Sol 1: N-gram smoothing
  - Add pseudo-count for every possible sequence

$$P(w\_t|w\_{t-n+1},\dots,w\_{t-1}) = \frac{1+N(w\_{t-n+1},\dots,w\_{t-1},w\_t)}{\sum\_x \left( 1+ N(w\_{t-n+1},\dots,w\_{t-1},x) \right)}$$

]

---

.center[
## Rare/unseen sequences
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Sol 1: N-gram smoothing
  - Add pseudo-count for every possible sequence

$$P(w\_t|w\_{t-n+1},\dots,w\_{t-1}) = \frac{1+N(w\_{t-n+1},\dots,w\_{t-1},w\_t)}{\sum\_x \left( 1+ N(w\_{t-n+1},\dots,w\_{t-1},x) \right)}$$

- Other smoothings Good Turing, Kneser-Ney...

]

---

.center[
## Rare/unseen sequences
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Problem: all unseen sequences have the same probability
]

---

.center[
## Rare/unseen sequences
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Problem: all unseen sequences have the same probability
- Smoothing may be used in conjonction with backoff:
  - linear interpolation:
$$\hat P(w\_t|w\_{t-n+1},\dots,w\_{t-1}) = \lambda P(w\_t|w\_{t-n+1},\dots,w\_{t-1}) + $$
$$(1-\lambda) P(w\_t|w\_{t-n+2},\dots,w\_{t-1})$$
]

---

.center[
## Rare/unseen sequences
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

]

---

.center[
## Rare/unseen sequences
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Sol 2: N-grams of sub-words
  - Character n-grams
]

---

.center[
## Rare/unseen sequences
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Sol 2: N-grams of sub-words
  - Character n-grams
    - Good for agglutinative languages...
]

---

.center[
## Rare/unseen sequences
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Sol 2: N-grams of sub-words
  - Character n-grams
    - Good for agglutinative languages...
    - Capture common prefixes, suffixes...
]

---

.center[
## Rare/unseen sequences
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Sol 2: N-grams of sub-words
  - Character n-grams
    - Good for agglutinative languages...
    - Capture common prefixes, suffixes...
    - Very good at language detection
]

---

.center[
## Rare/unseen sequences
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

---

.center[
## Rare/unseen sequences
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Sol 2: N-grams of sub-words
  - Character n-grams
    - Good for agglutinative languages...
    - Capture common prefixes, suffixes...
    - Very good at language detection
    - Handle proper names
    - Robust to typographic mistakes
]

---

.center[
## Rare/unseen sequences
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

---

.center[
## Rare/unseen sequences
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

]

---

.center[
## Limitations
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Number of potential n-grams increase exponentially
- Longer n-grams become very sparse:
  - bad statistics
  - cannot capture long dependencies
- In practice: maximum 5-grams

]

---

.center[
## Exercise: 2-grams
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

** Extract all 2-grams from MR500train **

- How many different ?
- What are the top-5 most frequent ?

]

---

.center[
## Exercise: 2-grams
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

```
cut -c3- MR500train | awk '{for (i=1;i<NF;i++) print $i"_"$(i+1)}' | sort | uniq -c | wc -l
```

- 67822 2-grams

```
    342 this_movie
    354 is_a
    552 in_the
    573 it_is
    819 of_the
```

]

---

.center[
## Exercise: collocations
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

*Collocation* = phrase composed of several words that co-occur more frequently in a given context than its individual word parts.

- Ex: "credit card", "machine learning"...

]

---

.center[
## Exercise: collocations
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

** Extract 2-words collocations using:**

- most frequent 2-grams
- filtering only for some POS-tag patterns (Adjective, Noun)...
- Use NLTK **just to compute POS-tag** on MR500train
- Compute 2-gram counts *without* NLTK

Hint:
```
   import nltk
   toks = nltk.word_tokenizer("I bet it")
   pos = nltk.pos_tag(toks)
```

]

---

.center[
## Exercise: Word N-gram prediction errors
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
]
.right-column[

- Train a 2-gram on MR500train; use it to predict the next word on MR500test (backoff to unigram when 2-gram unknown); compute the accuracy
    - hint: never try to list all possible 2-grams !
    - hint: don't smooth, don't interpolate
    - use the same training and test corpus
- Same with 3-gram
    - Is it better than 2-gram ?
- Opt:
    - retry with the train/test split
    - Is it still better ?
    - try with smoothing and interpolation

]

---

.center[
## Byte-Pair Encoding
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
  #### Subwords
]
.right-column[

- We want to decompose words into frequent subword sequences
- Byte-Pair Encoding is a method used in many deep learning models:
]

---

.center[
## Byte-Pair Encoding
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
  #### Subwords
]
.right-column[

- We want to decompose words into frequent subword sequences
- Byte-Pair Encoding is a method used in many deep learning models:
    - Build unigram: "low": 5, "lowest": 2...
]

---

.center[
## Byte-Pair Encoding
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
  #### Subwords
]
.right-column[

---

.center[
## Byte-Pair Encoding
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
  #### Subwords
]
.right-column[

- We want to decompose words into frequent subword sequences
- Byte-Pair Encoding is a method used in many deep learning models:
    - Build unigram: "low": 5, "lowest": 2...
    - Decompose into char: "l o w @": 5, "l o w e s t @": 2...
    - Find most frequent unit pair: "ow": 7
]

---

.center[
## Byte-Pair Encoding
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
  #### Subwords
]
.right-column[

---

.center[
## Byte-Pair Encoding
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
  #### Subwords
]
.right-column[

]

---

.center[
## Exercise: BPE
]

.left-column[
  #### Definition
  #### Course plan
  #### Texts
  #### N-grams
  #### Subwords
]
.right-column[

- Print the first 100 BPE pairs merged on MR500train
- Opt: What is the Out-Of-Vocabulary rate on MR500test ?
- Opt: predict the next word on MR500test with BPE n-grams: what is the word accuracy ?

]

---

---
name: last-page
class: middle, center, inverse

## That's all folks (for now)!

Slideshow created using [remark](http://github.com/gnab/remark).