A Critical Review of ACL 2019 Best Long Paper

Causality

The cause that motivates me to write this blog post is due to my re-read of this year’s ACL’s best long paper Bridging the Gap between Training and Inference for Neural Machine Translation. When I know it has received the honor, I start to re-read and try to find out the conceptual advance (which I haven’t perceived during my first skim) the paper has made towards this issue of exposure bias or train/test mis-match which has already been rich in its literature.

I used to work on this issue when I encountered the conceptual fascination of the problem via Sam Wiseman’s Beam Search Optimization paper and then Mohammad Norouzi’s mind-sweeping RAML paper, and further appealed with earlier works like Samy Bengio’s Scheduled Sampling and Marc’ Aurelio Ranzato’s Sequence-level Training.

At that time, my first grasp of the solution of the troubling teacher forcing or maximum likelihood training (MLE) of sequence-to-sequence model was:

Using statistics (word rank, probability nucleus) from the model’s probability scores \(P_\theta(\cdot \vert \hat{y}_{<t}, x)\) to construct certain sample \(\hat{y}\) with relatively high model score, while evaluating $\hat{y}$ against \(y^*\) to get sequence-level supervision.

This conceptual framework is known as the Learning-to-Search (L2S) regime and can be at least, to my knowledge, traced back to Hal Daume’s Search-based optimization formulation of Structured Prediction (SP) problems (conf. paper, and journal paper), which he has realised at that time SP’s connection to Reinforcement Learning (RL) but later tried to clarify their difference. This regime is called L2S because the learning is based on inference-then-optimize paradigm, so that training just matches the test.

Due to SP’s connection to RL, a bunch of papers including Sequence-level Training, Actor-Critic (AC), GAN-like inverse RL works are then proposed by different groups. The main idea is to:

Closing the gap between training and test by optimising the inference or decoding process during training, that is to optimise according to certain on-policy reward which can be metric-relevant (sentence-level, or decomposed to be step-wise) or even learned (by Discriminator in GAN or Critic in AC-based RL).

Actually, most of such algorithms or models can be reformulated under the following mathematical form:

\[\nabla_\theta = Q(\hat{y}_t; \hat{y}, x, y^*) \cdot \nabla P_\theta(\hat{y}_t \vert x, \hat{y}_{<t})\]