class: center, middle background-image:url(images/data-background-light.jpg) # Lexical Resources ## Master TAL, Nancy, 2019-2020 #### Christophe cerisara .footnote[.bold[cerisara@loria.fr - CNRS / LORIA]] --- .center[ ## Use case for lexical resources ] - Lexical resources: - Word embeddings -> used in all NLP tasks - WordNet - VerbNet - FrameNet - Today: - PropBank - Use case: SRL --- .center[ ## Semantic Role Labeling ] - aka *Shallow semantic parsing* - NLPTask - In a sentence: - Find predicates (segment of text) - For each predicate: - Identify the predicate (opt) - Find its arguments (segments of text) - Find their roles (agent, patient...) --- .center[ ## Semantic Role Labeling ] Example:
--- .center[ ## Semantic Role Labeling ] - First SRL system: - Daniel Gidea & Daniel Jurafsky (UC Berkeley, Stanford) - Based on FrameNet book "Speech and Language Processing", D. Jurafsky & J. H. Martin, Chapter 20, Oct. 2019 --- .center[ ## Why bother with roles ? ] - Identifying roles => Understanding a sentence - Example: - A company analyses text reports and wants to fill in a table: *Company A acquires company B* - Identifying the roles enable filling this table --- .center[ ## Why bother with roles ? ] - Identifying roles => Generalize over surface syntactic realisations - Example: - John [AGENT] broke the window [THEME] - The rock [INSTRUMENT] broke the window [THEME] - The window [THEME] broke - The window [THEME] was broken by John [AGENT] --- .center[ ## On the notion of role ] - Very difficult to define a role - Two approaches: - Specific to a verb (noun) or a group of verbs - Generalized semantic roles: PROTO-AGENT, PROTO-PATIENT - **FrameNet**: roles specific to a general semantic idea = frame - **PropBank**: use both proto-roles and verb-specific roles --- .center[ ## PropBank ] - **Propositional Bank** = sentences annotated with roles - English ProbBank provides predicate-argument annotation for the entire Penn Treebank (Wall Street Journal) - Roles defined for an individual verb sense - Each specific role is noted with a number: Arg0, Arg1, Arg2... - In general - Arg0 = PROTO-AGENT - Arg1 = PROTO-PATIENT --- .center[ ## PropBank ] - Definition of the roles in **frame files** - Ex: agree.01 - Arg0: Agreer - Arg1: Proposition - Arg2: Other entity agreeing - Ex: The group agreed it would make an offer - Ex: fall.01 - Arg1: Logical subject, patient, thing falling - Arg2: Extent, amount fallen - Arg3: start point - Arg4: end point - Ex: Sales [A1] fell to \$25 [A4] from \$27 [A3] --- .center[ ## PropBank ] - PropBank includes non-numbered arguments **ArgM-TMP ArgM-LOC …** - = Modifications - They are stable across predicates - Not always listed in frame files --- .center[ ## PropBank ] - PropBank focuses on verbs - **NomBank** on nouns: - Ex: Apple [A0] 's agreement [pred] with IBM [A2] - **FrameNet** generalizes across different verbs - Ex: The price of bananas [A1] increased 5% [A2] - Ex: The price of bananas [A1] rose 5% [A2] - PropBank is the most extensive resources - Used in CoNLL shared tasks (CoNLL-2005, 2009, 2012) --- .center[ ## Semantic Role Labeling ] - Current approaches of SRL use supervised machine learning, trained on FrameNet and PropBank. **Method 1: Feature-based algorithm** - Parse the sentence - Traverse the parse to find predicates - For each predicate, - Examine each node in the parse tree - Computes a feature vector for this (node, predicate) - Predict either 0, A1, A2... --- .center[ ## Semantic Role Labeling ] Common enhancements of this algo: - Pruning: remove with heuristics unlikely arguments - Separate identification / classification: - First, a classifier decides whether it's an argument or not - Second, another classifier infers the role - This enables designing specific features per classifier. - Global optimization: - Viterbi algorithm - Re-ranking - Integer Linear Programming --- .center[ ## Semantic Role Labeling ] - Typical features: - Predicate: *issued* - phrase type: *NP* - headword of constituent: *Examiner* - headword POS: *NNP* - path from predicate to constituent - voice: *passive* - position: *before* - … --- .center[ ## Semantic Role Labeling ] - **Method 2: Neural algorithm** - bi-LSTM IOB tagger (Same model used for POS-tagging, NER...)
--- .center[ ## Semantic Role Labeling ] - e.g.: (He et al., 2017) - State-of-the-art: 6 to 8 layers with highway connections Limitations: don't take into account global constraints - e.g. I-ARG0 can only follow I-ARG0 or B-ARG0 - Sol: - Replace softmax with CRF layer - Use Viterbi to decode from the CRF or from the softmax - Eventually train n-grams (when no CRF) --- .center[ ## SRL evaluation ] - each arg must be assigned exactly the correct word seq - compute prec/rec/F1 - Datasets: - CoNLL 2005 - CoNLL 2012 --- .center[ ## SRL evaluation ] [https://www.cs.upc.edu/~srlconll/](https://www.cs.upc.edu/~srlconll/) - semantic arguments include Agent, Patient, Instrument, etc. - adjunctive arguments indicating Locative, Temporal, Manner, Cause, etc. [A0 He ] [AM-MOD would ] [AM-NEG n't ] [V accept ] [A1 anything of value ] from [A2 those he was writing about ] . --- .center[ ## SRL evaluation ] - Usage: ``` from nltk.corpus import propbank ``` - PropBank is composed of annotations. Getting all annotations: ``` ins = propbank.instances() ``` --- .center[ ## SRL evaluation ] - Each annotation contains pointers towards the position in the sentence ``` i=ins[103] i.fileid i.sentnum i.wordnum i.predicate (TreePointer) i.arguments (Lis of TreePointer) ``` - Exercice: How many sentences do you have access to ? --- .center[ ## SRL evaluation ] - Each annotation contains pointers towards the position in the sentence ``` i=ins[103] i.fileid i.sentnum i.wordnum i.predicate (TreePointer) i.arguments (Lis of TreePointer) ``` - Exercice: How many sentences do you have access to ? -
``` len(set([(x.fileid,x.sentnum) for x in pb])) ```
--- .center[ ## SRL evaluation ] - You can get (sometimes) access to the tree itself: ``` tr=i.tree ``` - And use the TreePointers to extract elements from the tree: ``` i.predicate.select(tr) ``` - Exercice: How many predicates with a tree do you have access to ? --- .center[ ## SRL evaluation ] - You can get (sometimes) access to the tree itself: ``` tr=i.tree ``` - And use the TreePointers to extract elements from the tree: ``` i.predicate.select(tr) ``` - Exercice: How many predicates with a tree do you have access to ? -
``` len([1 for x in ins if not x.tree==None]) 9353 ```
--- .center[ ## SRL evaluation ] - Each argument gives a pair (location, argument ID) ``` for (loc,aid) in i.arguments: print("%s %s" % (aid,loc.select(tr))) ``` - The tree can be flattened: ``` loc,aid = i.arguments[0] print("%s %s" % (aid,loc.select(tr).pformat(500))) ``` or ``` print("%s %s" % (aid,loc.select(tr).flatten())) ``` - Exercice: using help(), - what means the 500 in pformat(500) ? - how can you export a tree in LaTeX ? --- .center[ ## SRL evaluation ] - PropBank is based on VerbNet - Extensions: - NomBank - Unified Verb Index: mapping PropBank/VerbNet/FrameNet: https://verbs.colorado.edu/verb-index/vn3.3/ - Unified PropBank: https://github.com/propbank/propbank-release applied to OntoNotes and English Web Treebank - Universal PropBank: https://github.com/System-T/UniversalPropositions - Non-English: Hindi, Chinese, Arabic, Finnish, Portuguese, Basque, Turkish --- .center[ ## SRL evaluation ] Parsing results: Nov. 2018: - https://arxiv.org/pdf/1804.08199.pdf - https://github.com/strubell/LISA --- .center[ ## SRL evaluation ] Easiest (?) tool nowadays to perform Semantic Role Labeling: [https://pypi.org/project/practnlptools/1.0/](https://pypi.org/project/practnlptools/1.0/) - Based on SENNA (Ronan Collobert) --- .center[ ## CoNLL Format ] - Free (but automatic) corpus: https://www.informatik.tu-darmstadt.de/ukp/research_6/data/semantic_role_resources/knowledge_based_semantic_role_labeling/index.en.jsp - Small version: https://members.loria.fr/CCerisara/smallsrl.zip - Conll tabular format, includes dep parses, roles... Exercise on smallsrl: - Read the README - How many predicates in this small corpus ? - How many roles ? - Most frequent roles ? --- .center[ ## TP ] Using the Propbank data available in NLTK - Write a baseline rule-based SRL parser: - Assumes predicate known - Assign the A0 role to the first noun/pronoun on the left - And A1 role to the first noun/pronoun on the right - Compute F1 - Another rule-based baseline: - Assign the A0 role to the SUBJ subtree - A1 to the OBJ subtree - Compute F1 --- --- name: last-page class: middle, center, inverse ## That's all folks (for now)! Slideshow created using [remark](http://github.com/gnab/remark).