Structured prediction in natural language processing with imitation learning

Andreas Vlachos
a.vlachos@sheffield.ac.uk

Department of Computer Science
University of Sheffield

Structure in Natural Language Processing

Sequential structures are very common:

I studied in Sheffield with Lucia Specia
PRP VBD IN NNP IN NNP NNP
O O O LOC O PER PER
Εγώ σπούδασα στο Σέφιλντ με τη Λουσία Σπέσια

  • part of speech tagging
  • named entity recognition
  • machine translation

(More) structure in Natural Language Processing

as well as more complex graphs:

  • semantic parsing
  • natural language interfaces to database
  • syntactic parsing

Incremental modeling

A classifier that predicts one label at a time given the previous predictions: \begin{align} \hat y_1 &=\mathop{\arg \max}_{y \in {\cal Y}} f(y;\mathbf{x}),\\ \mathbf{\hat y} = \quad \hat y_2 &=\mathop{\arg \max}_{y \in {\cal Y}} f(y;\mathbf{x}, \hat y_1), \cdots\\ \hat y_N &=\mathop{\arg \max}_{y \in {\cal Y}} f(y;\mathbf{x}, \hat y_{1} \ldots \hat y_{N-1}) \end{align}
  • use our favourite classifier
  • no restrictions on features
  • prone to error propagation (i.i.d. assumption broken)
  • local model not trained wrt the task-level loss

Joint modeling

A model (e.g. conditional random field) that scores complete outputs (e.g. label sequences): $$\mathbf{\hat y} =\hat y_{1}\ldots \hat y_{N} = \mathop{\arg \max}_{Y \in {\cal Y}^N} f(y_{1}\ldots y_{N};\mathbf{x})$$
  • no error propagation
  • exhaustive exploration of the search space
  • large/complex search spaces are challenging
  • efficient dynamic programming restricts modelling flexibility (i.e. Markov assumptions)

Imitation learning for Part of Speech tagging

Gold standard

Expert policy: at each word returns the correct PoS tag

a.k.a. as dynamic oracle, data as demonstrator

Imitation learning for Part of Speech tagging

Standard classifier training (exact imitation of the expert)

word label features
I Pronoun token=I, prev=NULL...
can Modal token=can, prev=Pronoun...
fly Verb token=fly, prev=Modal...

Imitation learning for Part of Speech tagging

Labels as costs

word Pronoun Modal Verb Noun features
I 0 1 1 1 token=I, prev=NULL...
can 1 0 1 1 token=can, prev=Pronoun...
fly 1 1 0 1 token=fly, prev=Modal...

Imitation learning for Part of Speech tagging

Breaking down action costing

  • rollin to get a trajectory through the sentence
  • try each possible label
  • rollout till the end
  • evaluate the complete output with the task loss

Imitation learning for Part of Speech tagging

Breaking down action costing

Assuming rollin and rollout follow the expert policy, and number of incorrect tags as the loss the correct label has 0 cost, the rest 1.

Imitation learning for Part of Speech tagging

Mix (i.e. roll a dice) the expert policy with the previously learned classifier during rollin and rollout

word Pronoun Modal Verb Noun features
can 1 0 2 1 token=can, prev=Pronoun...
fly 1 1 0 1 token=fly, prev=Verb...

Generic imitation learning

\begin{align} & \textbf{Input:} \; D_{train} = \{(\mathbf{x}^1,\mathbf{y}^1)...(\mathbf{x}^M,\mathbf{y}^M)\}, \; expert\; \pi^{\star}, \; loss \; function \; \ell\\ & \textbf{Output:} \; classifier \; H\\ & training\; examples\; \cal E = \emptyset\\ & \mathbf{while}\; \text{termination condition not reached}\; \mathbf{do}\\ & \quad \text{set} \; rollin \; policy \; \pi^{in} = mix(H,\pi^{\star})\\ & \quad \text{set} \; rollout \; policy \; \pi^{out} = mix(H,\pi^{\star})\\ & \quad \mathbf{for} \; (\mathbf{x},\mathbf{y}) \in D_{train} \; \mathbf{do}\\ & \quad \quad \text{rollin to predict} \; \hat y_1\dots\hat y_N = \pi^{in}(\mathbf{x})\\ & \quad \quad \mathbf{for} \; \hat y_n \in \hat y_1\dots\hat y_N \; \mathbf{do}\\ & \quad \quad \quad \text{extract features}\; f=\phi(\mathbf{x},\hat y_1\dots \hat y_n) \\ & \quad \quad \quad \text{rollout to obtain costs}\; c \; \text{for all possible actions using}\; \ell\; \\ & \quad \quad \quad \cal E = \cal E \cup (f,c)\\ & \quad \text{learn} \; H\; \text{from}\; \cal E\\ \end{align}

Biomedical event extraction

Input is sentences with protein names tagged

Biomedical event extraction

Predict the triggers

Biomedical event extraction

Predict the Themes if #triggers > 0

Biomedical event extraction

Predict the Causes if #Themes > 0

Biomedical event extraction

Non-decomposable loss function:

  • correct edges predicted for incorrect triggers are worse than predicting nothing
  • incorrect triggers without edges are ignored

Rollouts with the classifier help us learn how to handle these

Breaking it down

RollIn until production

Breaking it down

Try labeling production as PosReg

Breaking it down

RollOut with the (dynamic) expert (don't add edges): cost = 3

Breaking it down

RollOut with the classifier trained previously: cost = 6

Biomedical event extraction

Abstract meaning representation

  • Designed for semantics-based machine translation
  • Many applications: summarization, generation, etc.
  • see recent tutorial by Schneider et al. (2015)

Results

What about Recurrent Neural Networks?

They also predict a sequence of actions incrementally:

and face similar problems (Ranzatto et al., 2016):

  • trained at the word rather than sentence level
  • assume previous predictions are correct (exposure bias)

### Questions?