Lexical Resources

class: center, middle
background-image:url(images/data-background-light.jpg)

# Lexical Resources

## Master TAL, Nancy, 2019-2020

#### Christophe cerisara

.footnote[.bold[cerisara@loria.fr - CNRS / LORIA]]

---

.center[
## Use case for lexical resources
]

- Lexical resources:
  - Word embeddings -> used in all NLP tasks
  - WordNet
  - VerbNet
  - FrameNet

- Today:
  - PropBank
  - Use case: SRL

---

.center[
## Semantic Role Labeling
]

- aka *Shallow semantic parsing*
- NLPTask
- In a sentence:
  - Find predicates (segment of text)
  - For each predicate:
    - Identify the predicate (opt)
    - Find its arguments (segments of text)
    - Find their roles (agent, patient...)

---

.center[
## Semantic Role Labeling
]

Example:

---

.center[
## Semantic Role Labeling
]

- First SRL system:
  - Daniel Gidea & Daniel Jurafsky (UC Berkeley, Stanford)
  - Based on FrameNet

book "Speech and Language Processing", D. Jurafsky & J. H. Martin, Chapter 20, Oct. 2019

---

.center[
## Why bother with roles ?
]

- Identifying roles => Understanding a sentence
- Example:
  - A company analyses text reports and wants to fill in a table: *Company A acquires company B*
  - Identifying the roles enable filling this table

---

.center[
## Why bother with roles ?
]

- Identifying roles => Generalize over surface syntactic realisations
- Example:
  - John [AGENT] broke the window [THEME]
  - The rock [INSTRUMENT] broke the window [THEME]
  - The window [THEME] broke
  - The window [THEME] was broken by John [AGENT]

---

.center[
## On the notion of role
]

- Very difficult to define a role
- Two approaches:
  - Specific to a verb (noun) or a group of verbs
  - Generalized semantic roles: PROTO-AGENT, PROTO-PATIENT

- **FrameNet**: roles specific to a general semantic idea = frame
- **PropBank**: use both proto-roles and verb-specific roles

---

.center[
## PropBank
]

- **Propositional Bank** = sentences annotated with roles
- English ProbBank provides predicate-argument annotation for the entire Penn Treebank (Wall Street Journal)
- Roles defined for an individual verb sense
- Each specific role is noted with a number: Arg0, Arg1, Arg2...

- In general
  - Arg0 = PROTO-AGENT
  - Arg1 = PROTO-PATIENT

---

.center[
## PropBank
]

- Definition of the roles in **frame files**
- Ex: agree.01
  - Arg0: Agreer
  - Arg1: Proposition
  - Arg2: Other entity agreeing
  - Ex: The group agreed it would make an offer

- Ex: fall.01
  - Arg1: Logical subject, patient, thing falling
  - Arg2: Extent, amount fallen
  - Arg3: start point
  - Arg4: end point
  - Ex: Sales [A1] fell to \$25 [A4] from \$27 [A3]

---

.center[
## PropBank
]

- PropBank includes non-numbered arguments **ArgM-TMP ArgM-LOC …**
- = Modifications
- They are stable across predicates
- Not always listed in frame files

---

.center[
## PropBank
]

- PropBank focuses on verbs
- **NomBank** on nouns:
  - Ex: Apple [A0] 's agreement [pred] with IBM [A2]
- **FrameNet** generalizes across different verbs
  - Ex: The price of bananas [A1] increased 5% [A2]
  - Ex: The price of bananas [A1] rose 5% [A2]

- PropBank is the most extensive resources
  - Used in CoNLL shared tasks (CoNLL-2005, 2009, 2012)

---

.center[
## Semantic Role Labeling
]

- Current approaches of SRL use supervised machine learning, trained on FrameNet and PropBank.

**Method 1: Feature-based algorithm**
- Parse the sentence
- Traverse the parse to find predicates
- For each predicate,
  - Examine each node in the parse tree
  - Computes a feature vector for this (node, predicate)
  - Predict either 0, A1, A2...

---

.center[
## Semantic Role Labeling
]

Common enhancements of this algo:
- Pruning: remove with heuristics unlikely arguments
- Separate identification / classification:
  - First, a classifier decides whether it's an argument or not
  - Second, another classifier infers the role
  - This enables designing specific features per classifier.
- Global optimization:
  - Viterbi algorithm
  - Re-ranking
  - Integer Linear Programming

---

.center[
## Semantic Role Labeling
]

- Typical features:
  - Predicate: *issued*
  - phrase type: *NP*
  - headword of constituent: *Examiner*
  - headword POS: *NNP*
  - path from predicate to constituent
  - voice: *passive*
  - position: *before*
  - …

---

.center[
## Semantic Role Labeling
]

- **Method 2: Neural algorithm**

- bi-LSTM IOB tagger (Same model used for POS-tagging, NER...)

---

.center[
## Semantic Role Labeling
]

- e.g.: (He et al., 2017)

- State-of-the-art: 6 to 8 layers with highway connections

Limitations: don't take into account global constraints
- e.g. I-ARG0 can only follow I-ARG0 or B-ARG0
- Sol:
  - Replace softmax with CRF layer
  - Use Viterbi to decode from the CRF or from the softmax
  - Eventually train n-grams (when no CRF)

---

.center[
## SRL evaluation
]

- each arg must be assigned exactly the correct word seq
- compute prec/rec/F1
- Datasets:
  - CoNLL 2005
  - CoNLL 2012

---

.center[
## SRL evaluation
]

[https://www.cs.upc.edu/~srlconll/](https://www.cs.upc.edu/~srlconll/)

- semantic arguments include Agent, Patient, Instrument, etc.
- adjunctive arguments indicating Locative, Temporal, Manner, Cause, etc.

[A0 He ] [AM-MOD would ] [AM-NEG n't ] [V accept ] [A1 anything of value ] from [A2 those he was writing about ] .

---

.center[
## SRL evaluation
]

- Usage:

```
   from nltk.corpus import propbank
```

- PropBank is composed of annotations. Getting all annotations:

```
   ins = propbank.instances()
```

---

.center[
## SRL evaluation
]

- Each annotation contains pointers towards the position in the sentence

```
   i=ins[103]
   i.fileid
   i.sentnum
   i.wordnum
   i.predicate (TreePointer)
   i.arguments (Lis of TreePointer)
```

- Exercice: How many sentences do you have access to ?
---

.center[
## SRL evaluation
]

- Each annotation contains pointers towards the position in the sentence

```
   i=ins[103]
   i.fileid
   i.sentnum
   i.wordnum
   i.predicate (TreePointer)
   i.arguments (Lis of TreePointer)
```

- Exercice: How many sentences do you have access to ?
- <div>

```
len(set([(x.fileid,x.sentnum) for x in pb]))
```

</div>

---

.center[
## SRL evaluation
]

- You can get (sometimes) access to the tree itself:

```
   tr=i.tree
```

- And use the TreePointers to extract elements from the tree:

```
   i.predicate.select(tr)
```

- Exercice: How many predicates with a tree do you have access to ?
---

.center[
## SRL evaluation
]

- You can get (sometimes) access to the tree itself:

```
   tr=i.tree
```

- And use the TreePointers to extract elements from the tree:

```
   i.predicate.select(tr)
```

- Exercice: How many predicates with a tree do you have access to ?
- <div>

```
   len([1 for x in ins if not x.tree==None])
   9353
```

</div>

---

.center[
## SRL evaluation
]

- Each argument gives a pair (location, argument ID)

```
   for (loc,aid) in i.arguments:
      print("%s %s" % (aid,loc.select(tr)))
```

- The tree can be flattened:

```
   loc,aid = i.arguments[0]
   print("%s %s" % (aid,loc.select(tr).pformat(500)))
```

```
   print("%s %s" % (aid,loc.select(tr).flatten()))
```

- Exercice: using help(),
  - what means the 500 in pformat(500) ?
  - how can you export a tree in LaTeX ?

---

.center[
## SRL evaluation
]

- PropBank is based on VerbNet
- Extensions:
  - NomBank
  - Unified Verb Index: mapping PropBank/VerbNet/FrameNet: https://verbs.colorado.edu/verb-index/vn3.3/
  - Unified PropBank: https://github.com/propbank/propbank-release applied to OntoNotes and English Web Treebank
  - Universal PropBank: https://github.com/System-T/UniversalPropositions
  - Non-English: Hindi, Chinese, Arabic, Finnish, Portuguese, Basque, Turkish

---

.center[
## SRL evaluation
]

Parsing results: Nov. 2018:

- https://arxiv.org/pdf/1804.08199.pdf
- https://github.com/strubell/LISA

---

.center[
## SRL evaluation
]

Easiest (?) tool nowadays to perform Semantic Role Labeling:

[https://pypi.org/project/practnlptools/1.0/](https://pypi.org/project/practnlptools/1.0/)

- Based on SENNA (Ronan Collobert)

---

.center[
## CoNLL Format
]

- Free (but automatic) corpus: https://www.informatik.tu-darmstadt.de/ukp/research_6/data/semantic_role_resources/knowledge_based_semantic_role_labeling/index.en.jsp
- Small version: https://members.loria.fr/CCerisara/smallsrl.zip
- Conll tabular format, includes dep parses, roles...

Exercise on smallsrl:
- Read the README
- How many predicates in this small corpus ?
- How many roles ?
- Most frequent roles ?

---

.center[
## TP
]

Using the Propbank data available in NLTK
- Write a baseline rule-based SRL parser:
  - Assumes predicate known
  - Assign the A0 role to the first noun/pronoun on the left
  - And A1 role to the first noun/pronoun on the right
  - Compute F1
- Another rule-based baseline:
  - Assign the A0 role to the SUBJ subtree
  - A1 to the OBJ subtree
  - Compute F1

---

---
name: last-page
class: middle, center, inverse

## That's all folks (for now)!

Slideshow created using [remark](http://github.com/gnab/remark).