Imitation Learning for Structured Prediction and Automated Fact Checking

Andreas Vlachos
a.vlachos@sheffield.ac.uk

Department of Computer Science
University of Sheffield

### Part 0: Why this title? These are my two main research topics: - imitation learning for structured prediction - automated fact checking Some context is needed: **machine learning for natural language processing**

Why Natural Language Processing?

Why ML for NLP?

### Words of caution When exploring a task, it is often useful to experiment with some simple rules to test our assumptions For some tasks rule-based approaches rule: - coreference resolution - natural language generation If we don't know how to perform a task, unlikely that an ML algorithm will find it (no learning without bias)

Part 1: Imitation learning for structured prediction in natural language processing

Joint work with:

  • Gerasimos Lampouras (UCL, Sheffield)
  • Sebastian Riedel, Jason Naradowsky, James Goodman (UCL)
  • Daniel Beck, Isabelle Augenstein (Sheffield)
  • Stephen Clark (Cambridge)
  • Mark Craven (Wisconsin-Madison)

Structured prediction in NLP

I studied in London with Sebastian Riedel
PRP VBD IN NNP IN NNP NNP
O O O B-LOC O B-PER I-PER

  • part of speech (PoS) tagging
  • named entity recognition (NER)

Input: a sentence $\mathbf{x}=[x_1...x_N]$
Output: a sequence of labels $\mathbf{y}=[y_{1}\ldots y_{N}] \in {\cal Y}^N$

More Structured Prediction

  • syntactic parsing
  • semantic parsing, question answering, etc.

Input: a sentence $\mathbf{x}=[x_1...x_N]$
Output: a graph $\mathbf{G}=(V,E) \in {\cal G_{\mathbf{x}}}$

Even More Structured Prediction

Natural language generation (NLG), but also summarization, decoding in machine translation, etc.

Input: a meaning representation/database record
Output: $\mathbf{w}=[w_1...w_N], w\in {\cal V}\cup END$

Imitation Learning for Structured Prediction

We assume gold standard
output for supervised training

But we train a classifier to predict
actions constructing the output.

Actions not in gold;
IL is rather semi-supervised

Incremental structured prediction

Breaking structured prediction into actions

Incremental structured prediction

A classifier predicting a label at a time given the previous ones: \begin{align} \hat y_1 &=\mathop{\arg \max}_{y \in {\cal Y}} f(y, \mathbf{x}),\\ \mathbf{\hat y} = \quad \hat y_2 &=\mathop{\arg \max}_{y \in {\cal Y}} f(y, \mathbf{x}, \hat y_1), \cdots\\ \hat y_N &=\mathop{\arg \max}_{y \in {\cal Y}} f(y, \mathbf{x}, \hat y_{1} \ldots \hat y_{N-1}) \end{align}
  • use our favourite classifier
  • no restrictions on features
  • prone to error propagation (i.i.d. assumption broken)
  • local model not trained wrt the task-level loss

Imitation learning for Part of Speech tagging

Gold standard

expert policy: at each word return the correct PoS tag

Imitation learning for Part of Speech tagging

Standard training (exact imitation of the expert)

word label features
I Pronoun token=I, prev=NULL...
can Modal token=can, prev=Pronoun...
fly Verb token=fly, prev=Modal...

Imitation learning for Part of Speech tagging

Labels as costs

word Pronoun Modal Verb Noun features
I 0 1 1 1 token=I, prev=NULL...
can 1 0 1 1 token=can, prev=Pronoun...
fly 1 1 0 1 token=fly, prev=Modal...

Imitation learning for Part of Speech tagging

Breaking down action costing

  • rollin to get a trajectory through the sentence
  • for each possible label:
    • rollout till the end
    • cost the complete output with the task loss
  • construct a training instance

Imitation learning for Part of Speech tagging

Breaking down action costing

  • rollin to get a trajectory through the sentence
  • for each possible label:
    • rollout till the end
    • cost the complete output with the task loss
  • construct a training instance

Imitation learning for Part of Speech tagging

Breaking down action costing

  • rollin and rollout follow the expert policy
  • loss is the number of incorrect tags
  • correct label has 0 cost, the rest 1
word Pronoun Modal Verb Noun features
can 1 0 1 1 token=can, prev=Pronoun...

Imitation learning for Part of Speech tagging

Mix (i.e. roll a dice) the expert policy with the previously learned classifier during rollin and rollout

word Pronoun Modal Verb Noun features
can 1 0 2 1 token=can, prev=Pronoun...
fly 1 1 0 1 token=fly, prev=Verb...

Is it reinforcement learning?

Yes (a kind of): we train a policy to
maximize rewards/minimize losses

But learning is facilitated by an expert

What about Recurrent Neural Networks?

They also predict a sequence of actions incrementally:

and face similar problems (Ranzato et al., 2016):

  • trained at the word rather than sentence level
  • assume previous predictions are correct

Abstract meaning representation parsing

  • Designed for semantics-based MT
  • Many applications: summarization, generation, etc.
  • Long, complex (>100 steps, 10^4 labels) action sequences

Comparison on AMR rel pre1.0

,precision, recall, F-score imitation learning, 68, 73, 70 Flanigan et al. (2014), 52, 66, 58 Werling et al. (2015), 59, 66, 62 Peng et al. (2015), 57, 59, 58 Artzi et al. (2015), 66, 67, 66 Wang et al. (2015), 69, 71, 70 Pust et al. (2015), 0, 0, 66

Natural Language Generation (NLG)

  • Reversed semantic parsing, similar to translation
  • Unlike MT, labeled data is rather limited
  • Evaluation with BLEU

Human evaluation

,SFO-hotel-fluency, SFO-hotel-inform, SFO-rest-fluency, SFO-rest-inform, BAGEL-fluency, BAGEL-inform imitation learning, 4.68, 5.19, 4.23, 5.36, 4.79, 5.24 Wen et al. (2015), 4.41, 5.36, 4.49, 5.29, 0 , 0 Dusek and Jurcicek (2015), 0, 0, 0, 0, 5.15, 4.53
### Questions? See our [EACL 2017 tutorial](https://sheffieldnlp.github.io/ImitationLearningTutorialEACL2017/) for more!
### Fact checking > "The United Kingdom has ten times our number of immigrants", *Matteo Renzi* [*True* or *False*?](http://factcheckeu.org/factchecks/show/868/matteo-renzi)

Part 2: Automated fact checking

Joint work with:

  • James Thorne (Sheffield)
  • Dhruv Ghulati, Rob Stojnic (Factmata)
  • Christos Christodoulopoulos, Arpit Mittal (Amazon)
  • Sebastian Riedel, Will Ferreira (UCL)
### Desiderata for automated fact-checking - Generalization to a variety of domains and relations - Verdict justification - Learn with little or no explicitly labeled training data [Vlachos and Riedel (LT-CSS 2014)](http://www.aclweb.org/anthology/W/W14/W14-2508.pdf)

Fact checking simple numerical statements

Vlachos and Riedel (EMNLP 2015)

Knowledge base population evaluation

lower MAPE and higher Coverage are better

Factmata

with Dhruv Ghulati and Sebastian Riedel

Emergent.info automation

Ferreira and Vlachos (NAACL 2016)

Emergent.info automation

Three way classification: a headline can be for, against or observing a claim

Developed a classifier with manually engineered features achieving 73% accuracy, 26% higher than a mature textual entailment system

The Fake news challenge

Our paper evolved into the Fake News Challenge:

  • Extending to comparing claims with full articles
  • Extra class unrelated: the article discusses a different topic
  • 50 participating teams, max 82% weighted accuracy

### Fact Extraction and VERification Emergent has 300 claims, we need more! Together with Amazon we are building FEVER: - 100,000 claims - constructed by mutating Wikipedia sentences - annotated as supported/refuted/unverified with evidence from Wikipedia - both steps manually conducted

FEVER example

Shared task being planned for 2018

### Vision: Learn to Imitate the human fact checkers - Fact checking is a complex task in which the output might be true/false, but we also care about the justification (structure) - Human fact checkers take a (possibly) long sequence of actions to constructing the justification - And they publish it for free to benefit the public - Let's help them with AI!
### Questions?