class: center, middle background-image:url(images/data-background-light.jpg) # Lexical Resources ## Master TAL, Nancy, 2019-2020 .footnote[.bold[[Christophe Cerisara](mailto:cerisara@loria.fr) CNRS / LORIA]] --- .center[ ## What is a lexical resource ? ] .left-column[ #### Definition ] .right-column[ - Any content **usable by a computer** that gives information about **lexicons** - lexicon = the vocabulary of a person, language, or branch of knowledge. - Obviously, old-printed encyclopedia are lexical resources... But they cannot be processed by a computer ! ] --- .center[ ## What are the first and most important lexical resources ? ] .left-column[ #### Definition ] .right-column[ ] --- .center[ ## What are the first and most important lexical resources ? ] .left-column[ #### Definition ] .right-column[ - All texts available to computers (on the internet, on usb keys, on hard drives, on DVD...) - Raw texts contain the most useful, ever, information about lexicons ] --- .center[ ## What are the first and most important lexical resources ? ] .left-column[ #### Definition ] .right-column[ - All texts available to computers (on the internet, on usb keys, on hard drives, on DVD...) - Raw texts contain the most useful, ever, information about lexicons - 2 ways to define a lexicon: - Asking an expert to write down from his knowledge - Extracting lexicon from texts in the domain ] --- .center[ ## What are the first and most important lexical resources ? ] .left-column[ #### Definition ] .right-column[ - All texts available to computers (on the internet, on usb keys, on hard drives, on DVD...) - Raw texts contain the most useful, ever, information about lexicons - 2 ways to define a lexicon: - Asking an expert to write down from his knowledge - Extracting lexicon from texts in the domain - "good data is more data": the more texts you can analyze, the richer information you'll get. ] --- .center[ ## Other resources ? ] .left-column[ #### Definition ] .right-column[ ] --- .center[ ## Other resources ? ] .left-column[ #### Definition ] .right-column[ - parallel corpora ] --- .center[ ## Other resources ? ] .left-column[ #### Definition ] .right-column[ - parallel corpora - List of words, with or without additional informations (lemma, POS, pronunciation, translation...) ] --- .center[ ## Other resources ? ] .left-column[ #### Definition ] .right-column[ - parallel corpora - List of words, with or without additional informations (lemma, POS, pronunciation, translation...) - lexical/semantic networks ] --- .center[ ## Other resources ? ] .left-column[ #### Definition ] .right-column[ - parallel corpora - List of words, with or without additional informations (lemma, POS, pronunciation, translation...) - lexical/semantic networks - (specialized) dictionaries (with definitions, usage examples...) ] --- .center[ ## Other resources ? ] .left-column[ #### Definition ] .right-column[ - parallel corpora - List of words, with or without additional informations (lemma, POS, pronunciation, translation...) - lexical/semantic networks - (specialized) dictionaries (with definitions, usage examples...) - thesaurus, ontologies ] --- .center[ ## Other resources ? ] .left-column[ #### Definition ] .right-column[ - parallel corpora - List of words, with or without additional informations (lemma, POS, pronunciation, translation...) - lexical/semantic networks - (specialized) dictionaries (with definitions, usage examples...) - thesaurus, ontologies - collaborative dictionaries: Wikipedia, Wiktionary ] --- .center[ ## Other resources ? ] .left-column[ #### Definition ] .right-column[ - parallel corpora - List of words, with or without additional informations (lemma, POS, pronunciation, translation...) - lexical/semantic networks - (specialized) dictionaries (with definitions, usage examples...) - thesaurus, ontologies - collaborative dictionaries: Wikipedia, Wiktionary - DB-pedia, common sense ] --- .center[ ## Other resources ? ] .left-column[ #### Definition ] .right-column[ - parallel corpora - List of words, with or without additional informations (lemma, POS, pronunciation, translation...) - lexical/semantic networks - (specialized) dictionaries (with definitions, usage examples...) - thesaurus, ontologies - collaborative dictionaries: Wikipedia, Wiktionary - DB-pedia, common sense - Wordnet, Framenet, Conceptnet... ] --- .center[ ## Other resources ? ] .left-column[ #### Definition ] .right-column[ - parallel corpora - List of words, with or without additional informations (lemma, POS, pronunciation, translation...) - lexical/semantic networks - (specialized) dictionaries (with definitions, usage examples...) - thesaurus, ontologies - collaborative dictionaries: Wikipedia, Wiktionary - DB-pedia, common sense - Wordnet, Framenet, Conceptnet... - pretrained lexical models - ngrams - word embeddings ] --- .center[ ## Course plan ] .left-column[ #### Definition #### Course plan ] .right-column[ - Lexical resources: ngrams - How to train them - Existing resources (Google n-grams) ] --- .center[ ## Course plan ] .left-column[ #### Definition #### Course plan ] .right-column[ - Lexical resources: ngrams - How to train them - Existing resources (Google n-grams) - Lexical resources: distributional - Historical perspective - Word2Vec embeddings ] --- .center[ ## Course plan ] .left-column[ #### Definition #### Course plan ] .right-column[ - Lexical resources: ngrams - How to train them - Existing resources (Google n-grams) - Lexical resources: distributional - Historical perspective - Word2Vec embeddings - Lexical resources: hand-made - Wordnet, Framenet - Wiktionary - Wikipedia - ISO standards ] --- .center[ ## Course plan ] .left-column[ #### Definition #### Course plan ] .right-column[ - Lexical resources: ngrams - How to train them - Existing resources (Google n-grams) - Lexical resources: distributional - Historical perspective - Word2Vec embeddings - Lexical resources: hand-made - Wordnet, Framenet - Wiktionary - Wikipedia - ISO standards - Multi-lingual resources: - Multi-lingual BERT - Parallel corpora ] --- .center[ ## Course plan ] .left-column[ #### Definition #### Course plan ] .right-column[ Course requirements: ] --- .center[ ## Course plan ] .left-column[ #### Definition #### Course plan ] .right-column[ Course requirements: - Know basics of python - How to edit & run a python file - Python variables, lists, string, functions, loops... ] --- .center[ ## Course plan ] .left-column[ #### Definition #### Course plan ] .right-column[ Course requirements: - Know basics of python - How to edit & run a python file - Python variables, lists, string, functions, loops... - Access to a computer (in & outside class) - With python + numpy + scipy + nltk installed - Recommended way: **anaconda** - Internet access in & outside class ] --- .center[ ## Course plan ] .left-column[ #### Definition #### Course plan ] .right-column[ Course requirements: - Know basics of python - How to edit & run a python file - Python variables, lists, string, functions, loops... - Access to a computer (in & outside class) - With python + numpy + scipy + nltk installed - Recommended way: **anaconda** - Internet access in & outside class - Any question: - cerisara@loria.fr ] --- .center[ ## Most important lexical resource: raw texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - We can extract a lot of information from them: ] --- .center[ ## Most important lexical resource: raw texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - We can extract a lot of information from them: - list of word forms ] --- .center[ ## Most important lexical resource: raw texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - We can extract a lot of information from them: - list of word forms - their diachronic usage => lexical drift ] --- .center[ ## Most important lexical resource: raw texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - We can extract a lot of information from them: - list of word forms - their diachronic usage => lexical drift - co-occurrence of words => lexical semantics ] --- .center[ ## Most important lexical resource: raw texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - We can extract a lot of information from them: - list of word forms - their diachronic usage => lexical drift - co-occurrence of words => lexical semantics - compute words embeddings => synonyms, antonyms, lexical analogies ] --- .center[ ## Most important lexical resource: raw texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - We can extract a lot of information from them: - list of word forms - their diachronic usage => lexical drift - co-occurrence of words => lexical semantics - compute words embeddings => synonyms, antonyms, lexical analogies - their relations => syntagmatic, paradigmatic relations ] --- .center[ ## Most important lexical resource: raw texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - We can extract a lot of information from them: - list of word forms - their diachronic usage => lexical drift - co-occurrence of words => lexical semantics - compute words embeddings => synonyms, antonyms, lexical analogies - their relations => syntagmatic, paradigmatic relations - their combination => compositional (sentence) semantics ] --- .center[ ## Most important lexical resource: raw texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - We can extract a lot of information from them: - list of word forms - their diachronic usage => lexical drift - co-occurrence of words => lexical semantics - compute words embeddings => synonyms, antonyms, lexical analogies - their relations => syntagmatic, paradigmatic relations - their combination => compositional (sentence) semantics - discover multi-word expressions ] --- .center[ ## Most important lexical resource: raw texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - We can extract a lot of information from them: - list of word forms - their diachronic usage => lexical drift - co-occurrence of words => lexical semantics - compute words embeddings => synonyms, antonyms, lexical analogies - their relations => syntagmatic, paradigmatic relations - their combination => compositional (sentence) semantics - discover multi-word expressions - decompose them => morphological information - ... ] --- .center[ ## Most important lexical resource: raw texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - Size matters => must be computationally efficient
] --- .center[ ## Most important lexical resource: raw texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - How to extract lexical information from raw texts ? - ngrams: more syntax-oriented - words embeddings: more semantic-oriented - can be at word-level or subword-level, to capture morpho-syntactic information ] --- .center[ ## How to extract lexical information from raw texts ? ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ In practice, has to handle following issues: - Choose one or multiple sources of texts ] --- .center[ ## How to extract lexical information from raw texts ? ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ In practice, has to handle following issues: - Choose one or multiple sources of texts - Scrap the texts (see other course) ] --- .center[ ## How to extract lexical information from raw texts ? ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ In practice, has to handle following issues: - Choose one or multiple sources of texts - Scrap the texts (see other course) - Preprocess texts (see other course) ] --- .center[ ## How to extract lexical information from raw texts ? ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ In practice, has to handle following issues: - Choose one or multiple sources of texts - Scrap the texts (see other course) - Preprocess texts (see other course) - Extract lexical information ] --- .center[ ## Selecting sources of texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ ### Which domain ? what type of language ? - Are you interested in capturing generic information, or in specialized domain (healthcare...) ? ] --- .center[ ## Selecting sources of texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ ### Which domain ? what type of language ? - Are you interested in capturing generic information, or in specialized domain (healthcare...) ? - Large variability in language: - Casual: forums, conversations... - Micro-blog - Formal: books... - Journalistic: news - Educative: moocs, tutorials... - ... ] --- .center[ ## Selecting sources of texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ ### Have I the right to scrap the text ? It's not because it's public that you can copy it !! ] --- .center[ ## Selecting sources of texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ ### Have I the right to scrap the text ? It's not because it's public that you can copy it !! - Check whether there is a license, like *Creative Commons*, otherwise: "all rights reserved" ] --- .center[ ## Selecting sources of texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ ### Have I the right to scrap the text ? It's not because it's public that you can copy it !! - Check whether there is a license, like *Creative Commons*, otherwise: "all rights reserved" - Beyond legal aspects, more and more concerns about privacy & right to be forgotten - Anonymization does not guarantee privacy ! ] --- .center[ ## Selecting sources of texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ ### Have I the right to scrap the text ? It's not because it's public that you can copy it !! - Check whether there is a license, like *Creative Commons*, otherwise: "all rights reserved" - Beyond legal aspects, more and more concerns about privacy & right to be forgotten - Anonymization does not guarantee privacy ! - Twitter provides API to download some data, but forbids you to keep them on your harddrive. - You cannot redistribute texts without explicit CC-BY licence Wait... doesn't Google scrap the whole web since many years ? ] --- .center[ ## Selecting sources of texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ Unsafe sources of texts: - Social media - Most web pages Safe sources of texts: - Wikipedia & derivatives (CC-BY) - Scientific papers: arXiv, pubMed, HAL... - Owners datasets: AskUbuntu archives, reddit archives, (Common Crawl), (WebTimeMachine)... - Gutenberg, Gallica... ] --- .center[ ## Selecting sources of texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ This is only a very very small part of the available data on the web
] --- .center[ ## Scrapping texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ There are several ways to download corpora: - APIs: not standard, may change, heavy for servers - dumps archive (wikipedia, reddit...) - peer-to-peer (academic torrents) - OAI-PMH - ... See course on basic NLP techniques (Yannick Parmentier) ] --- .center[ ## Scrapping texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ We're going to use the Movie Review corpus for now: A corpus of public film reviews widely used for research. ] --- .center[ ## Pre-processing texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - Texts often comes with metadata, in XML, JSON... - You have to extract texts (+ interesting metadata) with adequate parsers ] --- .center[ ## Pre-processing texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - Texts often comes with metadata, in XML, JSON... - You have to extract texts (+ interesting metadata) with adequate parsers - Then: - Filter out garbage (texts in another language, errors...) ] --- .center[ ## Pre-processing texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - Texts often comes with metadata, in XML, JSON... - You have to extract texts (+ interesting metadata) with adequate parsers - Then: - Filter out garbage (texts in another language, errors...) - Segment into sentences and words (tokenize); take care of punctuation ] --- .center[ ## Pre-processing texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - Texts often comes with metadata, in XML, JSON... - You have to extract texts (+ interesting metadata) with adequate parsers - Then: - Filter out garbage (texts in another language, errors...) - Segment into sentences and words (tokenize); take care of punctuation - Eventually normalize (remove URL, numbers, smileys; lowercasing...) ] --- .center[ ## Pre-processing texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ - Texts often comes with metadata, in XML, JSON... - You have to extract texts (+ interesting metadata) with adequate parsers - Then: - Filter out garbage (texts in another language, errors...) - Segment into sentences and words (tokenize); take care of punctuation - Eventually normalize (remove URL, numbers, smileys; lowercasing...) - Eventually compute features (lemmas, stemming, POStags, NER...) See Claire Gardent's course. ] --- .center[ ## Pre-processing texts ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ Download the preprocessed 500-lines Movie Review corpus (7 MB) here: https://synalp.loria.fr/resL5.zip Look at the file data/l5/MR500train ] --- .center[ ## The Movie Review corpus ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ i have seen this film at least times and i am still excited by it the acting is perfect and the romance between joe and jean keeps me on the edge of my seat plus i still think bryan brown is the tops brilliant film ---- this movie features charlie spradling dancing in a strip club beyond that it features a truly bad script with dull unrealistic dialogue that it got as many positive votes suggests some people may be joking ---- this film contain far too much meaningless violence too much shooting and blood the acting seems very unrealistic and is generally poor the only reason to see this film is if you like very old cars ] --- .center[ ## Building the lexicon ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ ** How many different words is there in the corpus MR500train ?** ] --- .center[ ## Building the lexicon ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ Print one word per line, sort them and count them: Linux & Mac: ``` cut -c2- MR500train | awk '{for (i=1;i<=NF;i++) print $i}' | sort | uniq -c | wc -l ``` Windows: - Use Ubuntu subsystem - Or run a lubuntu VM with virtualbox - Or do it in python ] --- .center[ ## Building the lexicon ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ ``` from collections import Counter with open("MR500train","r") as f: lines=f.readlines() co=Counter() for l in lines: co.update(l[2:].strip().split(" ")) print(co.most_common(10)) print(len(co)) ``` ] --- .center[ ## Building the lexicon ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ ``` ... 4 zodiac 2 zoey 1 zoeys 21 zombie 16 zombies 1 zombieverse ... ``` ] --- .center[ ## Analyzing the lexicon ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ **Plot the frequency of each word, in decreasing order** ] --- .center[ ## Analyzing the lexicon ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[ Linux, Mac: ``` cut -c2- MR500train | awk '{for (i=1;i<=NF;i++) print $i}' | sort | uniq -c | awk '{print $1}' | sort -n -r > tt echo 'plot "tt" w histo' | gnuplot -p ``` Python: ``` from collections import Counter import matplotlib.pyplot as mp with open("MR500train","r") as f: lines=f.readlines() co=Counter() for l in lines: co.update(l[2:].strip().split(" ")) y = list(co.values()) y.sort(reverse=True) x=range(len(y)) mp.plot(x,y) mp.show() ``` ] --- .center[ ## Zipf law ] .left-column[ #### Definition #### Course plan #### Texts ] .right-column[
] --- .center[ ## Unigrams ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Raw counts depend on the size of the data - Normalized, give **1-gram**: $$P(w) = \frac {N(w)} {N(*)}$$ - Probability that a word occur in the language ] --- .center[ ## Bigrams ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Count the sequences $N(a,b)$ - Divides by all sequences $N(a,*)$ - 2-gram gives probability that b follows a: $$P(b|a) = \frac {N(a,b)} {N(a,*)}$$ Note: $$P(b|a) = \frac {P(a,b)}{P(a)} = \frac{\frac{N(a,b)}{N(\*,\*)}}{\frac{N(a)}{N(\*)}}$$ ] --- .center[ ## N-grams ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Generalisation to a sequence of length $n$ $$P(w\_t|w\_{t-n+1},\dots,w\_{t-1}) = \frac{N(w\_{t-n+1},\dots,w\_{t-1},w\_t)}{N(w\_{t-n+1},\dots,w\_{t-1},\*)}$$ ] --- .center[ ## N-grams ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ Ngrams contain rich lexical information: - Which words are frequent ? Rare ? ] --- .center[ ## N-grams ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ Ngrams contain rich lexical information: - Which words are frequent ? Rare ? - Clues to find multi-word expressions "in short", "William Shakespeare"... - See: https://github.com/nert-nlp/streusle/ ] --- .center[ ## N-grams ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ Ngrams contain rich lexical information: - Which words are frequent ? Rare ? - Clues to find multi-word expressions "in short", "William Shakespeare"... - See: https://github.com/nert-nlp/streusle/ - Study word usages across time ] --- .center[ ## N-grams ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ Ngrams contain rich lexical information: - Which words are frequent ? Rare ? - Clues to find multi-word expressions "in short", "William Shakespeare"... - See: https://github.com/nert-nlp/streusle/ - Study word usages across time - Suggest following words ] --- .center[ ## N-grams ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ Ngrams contain rich lexical information: - Which words are frequent ? Rare ? - Clues to find multi-word expressions "in short", "William Shakespeare"... - See: https://github.com/nert-nlp/streusle/ - Study word usages across time - Suggest following words - Identify writing styles (regional) ] --- .center[ ## N-grams ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ Ngrams contain rich lexical information: - Which words are frequent ? Rare ? - Clues to find multi-word expressions "in short", "William Shakespeare"... - See: https://github.com/nert-nlp/streusle/ - Study word usages across time - Suggest following words - Identify writing styles (regional) - Identify badly written sentences ] --- .center[ ## N-grams ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ Ngrams contain rich lexical information: - Which words are frequent ? Rare ? - Clues to find multi-word expressions "in short", "William Shakespeare"... - See: https://github.com/nert-nlp/streusle/ - Study word usages across time - Suggest following words - Identify writing styles (regional) - Identify badly written sentences - Useful in plagiarism detection - ... ] --- .center[ ## N-grams ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[
] --- .center[ ## Training N-grams ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Easy to train: - accumulate counts - can be done online ] --- .center[ ## Training N-grams ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Easy to train: - accumulate counts - can be done online - The most difficult is to scrap & pre-process texts, so: - Google N-grams: https://books.google.com/ngrams - Trained on 1,000G-tokens https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html - Free (trained on 430M-words) https://www.ngrams.info/ ] --- .center[ ## Rare/unseen sequences ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Sol 1: N-gram smoothing - Add pseudo-count for every possible sequence $$P(w\_t|w\_{t-n+1},\dots,w\_{t-1}) = \frac{1+N(w\_{t-n+1},\dots,w\_{t-1},w\_t)}{\sum\_x \left( 1+ N(w\_{t-n+1},\dots,w\_{t-1},x) \right)}$$ ] --- .center[ ## Rare/unseen sequences ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Sol 1: N-gram smoothing - Add pseudo-count for every possible sequence $$P(w\_t|w\_{t-n+1},\dots,w\_{t-1}) = \frac{1+N(w\_{t-n+1},\dots,w\_{t-1},w\_t)}{\sum\_x \left( 1+ N(w\_{t-n+1},\dots,w\_{t-1},x) \right)}$$ - Other smoothings Good Turing, Kneser-Ney... ] --- .center[ ## Rare/unseen sequences ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Problem: all unseen sequences have the same probability ] --- .center[ ## Rare/unseen sequences ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Problem: all unseen sequences have the same probability - Smoothing may be used in conjonction with backoff: - linear interpolation: $$\hat P(w\_t|w\_{t-n+1},\dots,w\_{t-1}) = \lambda P(w\_t|w\_{t-n+1},\dots,w\_{t-1}) + $$ $$(1-\lambda) P(w\_t|w\_{t-n+2},\dots,w\_{t-1})$$ ] --- .center[ ## Rare/unseen sequences ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Problem: all unseen sequences have the same probability - Smoothing may be used in conjonction with backoff: - linear interpolation: $$\hat P(w\_t|w\_{t-n+1},\dots,w\_{t-1}) = \lambda P(w\_t|w\_{t-n+1},\dots,w\_{t-1}) + $$ $$(1-\lambda) P(w\_t|w\_{t-n+2},\dots,w\_{t-1})$$ - Other backoffs: Katz... ] --- .center[ ## Rare/unseen sequences ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Sol 2: N-grams of sub-words - Character n-grams ] --- .center[ ## Rare/unseen sequences ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Sol 2: N-grams of sub-words - Character n-grams - Good for agglutinative languages... ] --- .center[ ## Rare/unseen sequences ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Sol 2: N-grams of sub-words - Character n-grams - Good for agglutinative languages... - Capture common prefixes, suffixes... ] --- .center[ ## Rare/unseen sequences ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Sol 2: N-grams of sub-words - Character n-grams - Good for agglutinative languages... - Capture common prefixes, suffixes... - Very good at language detection ] --- .center[ ## Rare/unseen sequences ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Sol 2: N-grams of sub-words - Character n-grams - Good for agglutinative languages... - Capture common prefixes, suffixes... - Very good at language detection - Handle proper names ] --- .center[ ## Rare/unseen sequences ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Sol 2: N-grams of sub-words - Character n-grams - Good for agglutinative languages... - Capture common prefixes, suffixes... - Very good at language detection - Handle proper names - Robust to typographic mistakes ] --- .center[ ## Rare/unseen sequences ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Sol 2: N-grams of sub-words - Character n-grams - Good for agglutinative languages... - Capture common prefixes, suffixes... - Very good at language detection - Handle proper names - Robust to typographic mistakes - But requires much more data than words n-grams ! ] --- .center[ ## Rare/unseen sequences ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Sol 2: N-grams of sub-words - Character n-grams - Good for agglutinative languages... - Capture common prefixes, suffixes... - Very good at language detection - Handle proper names - Robust to typographic mistakes - But requires much more data than words n-grams ! - Often combined with words n-grams ] --- .center[ ## Limitations ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ - Number of potential n-grams increase exponentially - Longer n-grams become very sparse: - bad statistics - cannot capture long dependencies - In practice: maximum 5-grams ] --- .center[ ## Exercise: 2-grams ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ ** Extract all 2-grams from MR500train ** - How many different ? - What are the top-5 most frequent ? ] --- .center[ ## Exercise: 2-grams ] .left-column[ #### Definition #### Course plan #### Texts #### N-grams ] .right-column[ ``` cut -c3- MR500train | awk '{for (i=1;i