Sampling-Based Minimum Bayes Risk Decoding for Neural Machine Translation

Bryan Eikema and Wilker Aziz

Sampling-based MBR for NMT

Neural Machine Translation

We give an NMT model some source-language text xx, and it predicts the probability that any target-language text yy is a translation of xx.

Another way of saying this is: given a source sentence, NMT predicts a probability distribution over translation candidates.

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Distribution over Translation Candidates

You can imagine such an object as a bar plot:

The 10 most probable translation candidates of a given sentence order by probability. The 3 most probable candidates are clearly inadequate, essentially incomplete translations. Although these are the most probable candidates, they only account for less than 10 percent of the probability mass. It is fair to conclude they are rather rare, despite being the most probable options available.
Most probable candidates and their probabilities

  • For NMT, any sequence yy made of known target-language tokens and ending in a special end-of-sequence symbol is a valid translation candidate.
Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Quiz

Let's analyse this example for a bit longer:
The 10 most probable translation candidates of a given sentence order by probability. The 3 most probable candidates are clearly inadequate, essentially incomplete translations. Although these are the most probable candidates, they only account for less than 10 percent of the probability mass. It is fair to conclude they are rather rare, despite being the most probable options available.

  • What is the most probable translation (i.e., the mode of the distribution)?
  • What is the probability that a translation should be non-empty?
  • What is the probability that a translation should contain the word mode?
Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Deciding under Uncertainty

We tend to think of NMT models as predicting the correct translation of xx, but, as far as the model is concerned, there is no such a thing as a single correct translation.

NMT packs its knowledge in an entire distribution over candidates. To pick a translation, we (not the model) decide to place all of ours bets on a single outcome (e.g., the mode).

  • To decide under uncertainty, we need a criterion (i.e., a decision rule).
  • An NMT model is not a decision rule, it cannot tell you how to decide.
  • But we can use the uncertainty NMT quantifies to make an informed decision.
Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

MAP Decoding

The most common decision rule in NMT is known as maximum-a-posteriori (MAP) decoding. It tells us to pick the mode of the distribution, no matter how improbable it is.

The 10 most probable translation candidates of a given sentence order by probability. The 3 most probable candidates are clearly inadequate, essentially incomplete translations. Although these are the most probable candidates, they only account for less than 10 percent of the probability mass. It is fair to conclude they are rather rare, despite being the most probable options available.

  • MAP decoding: </s>
MAP decoding is a misnomer in NMT's context for NMT does not employ a prior over translations and, thus, does not require posterior inference.
Sampling-based MBR for NMT

Inadequacy of the Mode

The mode of the distribution is the single most probable outcome. Yet, in a large enough sample space, the mode may be extremely rare.

  • Modes in NMT are oftentimes as rare as 1 in millions.
  • NMT models store statistics/patterns they learn from training data in the distributions they predict, not in any one specific outome.
  • The mode can only be a good summary of an entire distribution, when an NMT model has no reason to be uncertain.
    • Uncertainty is unavoidable: ambiguity in natural language, lack of context, change in domain, lack of training data, etc.
Sampling-based MBR for NMT

Beliefs

While no single outcome is more probable than the mode, there are many patterns that are far more probable than the mode.

The 10 most probable translation candidates of a given sentence order by probability. The 3 most probable candidates are clearly inadequate, essentially incomplete translations. Although these are the most probable candidates, they only account for less than 10 percent of the probability mass. It is fair to conclude they are rather rare, despite being the most probable options available.

It's fair to claim that the model does not really want an empty translation, that mode is preferred to fashion, that we need an adjective, etc.

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

A Model as a Representation of What's Known

If you had to decide between </s> and the mode isn't adequate </s>, and all you knew about language and translation is what an NMT model tells you:
The 10 most probable translation candidates of a given sentence order by probability. The 3 most probable candidates are clearly inadequate, essentially incomplete translations. Although these are the most probable candidates, they only account for less than 10 percent of the probability mass. It is fair to conclude they are rather rare, despite being the most probable options available.

What would you pick and why?

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Utility

If we interpret translation candidates as atomic and unrelated outcomes, all NMT does is to express a preference over complete translations. This preference is oftentimes very weak.

Interpreted as combinatorial structures, we can appreciate structural similarity (e.g., some translations are equally long, make similar word choices, use similar word order).

A utility function quantifies this similarity in a way that matters for a decision maker.

  • We say that u(y,h;x)u(y, h; x) quantifies the benefit in choosing hh as the translation of xx when yy is known to be a plausible translation of it.
  • Examples: edit distance, ChrF, BEER, TER, COMET, human judgment, etc.
Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Uncertainty About Utility

When deciding whether or not hh is a reasonable translation of xx, we do not have access to translations we already know to be reasonable choices.

But we have NMT models that give an approximate view of what good choices look like.

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Expected Utility

If all I know is that yy translates xx with probability p(yx,θ)p(y|x, \theta), then my expectation on
hh's utility is the weighted average utility against every valid translation under the model:

In technical terms we have,

μu(h;x,θ)expected utility of h=p(y(1)x,θ)u(y(1),h;x)utility wrt y(1)+p(y(2)x,θ)u(y(2),h;x)utility wrt y(2)+\overbrace{\mu_u(h; x, \theta)}^{\textcolor{gray}{\text{expected utility of }h}}=\, p(y^{(1)}|x, \theta) \overbrace{u(y^{(1)}, h; x)}^{\textcolor{gray}{\text{utility wrt }y^{(1)}}} + p(y^{(2)}|x, \theta) \overbrace{u(y^{(2)}, h; x)}^{\textcolor{gray}{\text{utility wrt }y^{(2)}}}+\cdots

=yYp(yx,θ)u(y,h;x)also denoted by E[u(Y,h;x)θ]\,\,\qquad\qquad\qquad= \sum_{\textcolor{#DC3220}{y \in \mathcal Y}} \textcolor{#005AB5}{p(y|x, \theta)}u(y, h; x) \qquad \small{\textcolor{gray}{\text{also denoted by }\mathbb E[ u(Y, h; x) | \theta]}}

where, in turn and with some probability, each and every possible translation is assumed to be a reference translation.

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Example

Let's judge two candidates: </s> and the mode isn't adequate </s>.

For utility, we will use ChrF, which values candidates that match character nn-grams of a good translation.

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT
h                             y                                 p(y|x)    u(y, h;x)    p(y|x) * u(y, h;x)
----------------------------  ------------------------------  --------  -----------  --------------------
</s>                          </s>                              0.0645       100.00                  6.45
                              the mode </s>                     0.0605        29.71                  1.80
                              the mode is </s>                  0.0477        24.93                  1.19
                              the mode is inadequate </s>       0.0469        13.84                  0.65
                              the mode is not adequate </s>     0.0441        13.25                  0.58
                              the mode is awkward </s>          0.0412        15.97                  0.66
                              the mode is empty </s>            0.0397        17.79                  0.71
                              the mode is deficient </s>        0.0390        14.48                  0.56
                              the mode is poor </s>             0.0359        18.87                  0.68
                              the fashion isn't fitting </s>    0.0342        12.21                  0.42
                              [...]
                              [SUM]                                                                 25.92

the mode isn't adequate </s>  </s>                              0.0645        37.93                  2.45
                              the mode </s>                     0.0605        58.62                  3.55
                              the mode is </s>                  0.0477        62.16                  2.96
                              the mode is inadequate </s>       0.0469        77.17                  3.62
                              the mode is not adequate </s>     0.0441        82.98                  3.66
                              the mode is awkward </s>          0.0412        45.80                  1.89
                              the mode is empty </s>            0.0397        49.20                  1.96
                              the mode is deficient </s>        0.0390        44.47                  1.73
                              the mode is poor </s>             0.0359        49.81                  1.79
                              the fashion isn't fitting </s>    0.0342        23.08                  0.79
                              [...]
                              [SUM]                                                                 40.87
Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Minimum Bayes Risk Decoding

MBR decoding tells us to choose the candidate's whose expected utility is maximum:

y=argmaxhY E[u(Y,h;x)θ]\begin{aligned} y^\star &= \operatorname*{argmax}_{h \in \mathcal Y} ~ \mathbb E\left[u(Y, h; x) \mid \theta \right] \end{aligned}

  • Decision maker: chooses the utility function uu
  • NMT model: contributes beliefs (i.e., the probability of yy given xx for every possible yy)
  • Search algorithm: enumerates candidate translations hYh \in \mathcal Y
Sampling-based MBR for NMT

But, we don't need to pick a utility function when we use MAP decoding, right?

Actually, MBR decoding with u(y,h;x)=1u(y, h; x)=1 if, and only if, y=hy=h, and 00 otherwise is exactly MAP decoding.

So, when we make decisions via MAP decoding, we implicitly decide via MBR where our utility function is the exact match function.

Sampling-based MBR for NMT

Intractabilities of MBR decoding

There are two sources of intractability in MBR decoding

y=argmaxhY E[u(Y,h;x)θ]\begin{aligned} y^\star &= \operatorname*{argmax}_{\textcolor{#DC3220}{h \in \mathcal Y}} ~ \textcolor{#DC3220}{\mathbb{E}[}u(\textcolor{#DC3220}{Y}, h; x) \mid \theta \textcolor{#DC3220}{]} \end{aligned}

  • The objective function (expected utility) requires an intractable sum
    this is different from MAP decoding, where the objective is tractable
  • The hypothesis space Y\mathcal Y is unbounded
    just like in MAP decoding
Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Summarising the Model's Beliefs

The space of all translation candidates is unbounded, making it impossible for us to exactly compute the expected utility of any given candidate.

But, expectations can be estimated in a principled manner via Monte Carlo.

We use the sample mean

μ^u(h;x,θ)=1Ss=1Su(y(s),h;x)\hat \mu_u(h; x, \theta) = \frac{1}{S} \sum_{s=1}^S u(y^{(s)}, h; x)

where y(s)y^{(s)} is sampled from the model with probability p(y(s)x,θ)p(y^{(s)}|x, \theta).

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

What's a Sample?

Think of the NMT model as a bag of tokens, each token is a translation, if you put your hand in it and get a token, there's a probability p(yx,θ)p(y|x,\theta) that you will get yy.

  • Drawing samples like that is easy in NMT because of the way the model decomposes the probability of a complete sequence as a product of probabilities, one for each target word in context from left-to-right.
Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Example

Let's judge two candidates: </s> and the mode isn't adequate </s>.

We will estimate each candidate's expected utility using 20 samples from the model.

For utility, we will use ChrF.

Bryan Eikema and Wilker Aziz

Monte Carlo Estimation of Expected Utility (I)

h                             y ~ Y|x                                   u(y, h;x)
----------------------------  --------------------------------------  -----------
</s>                          the mode is deficient </s>                    14.48
                              the mode is very probable </s>                12.71
                              isn't a thing </s>                            21.48
                              uncool </s>                                   32.88
                              the mode is </s>                              24.93
                              the mode is </s>                              24.93
                              </s>                                         100.00
                              yes ! </s>                                    41.86
                              uncool </s>                                   32.88
                              the mode is actually rare </s>                12.71
                              </s>                                         100.00
                              the mode is what is is </s>                   15.19
                              uncool mode </s>                              23.07
                              the mode is inadequate </s>                   13.84
                              the mode is inadequate </s>                   13.84
                              the mode is inadequate </s>                   13.84
                              the mode is awkward </s>                      15.97
                              the mode is strange </s>                      15.97
                              the fashion isn't fitting </s>                12.21
                              the mode is awkward </s>                      15.97

                              [AVG]                                         27.94

Monte Carlo Estimation of Expected Utility (II)

h                             y ~ Y|x                                   u(y, h;x)
----------------------------  --------------------------------------  -----------
the mode isn't adequate </s>  what ? </s>                                   21.01
                              the mode is </s>                              62.16
                              the mode is awkward </s>                      45.80
                              the mode is deficient </s>                    44.47
                              fashion isn't a thing </s>                    29.31
                              is the </s>                                   35.67
                              the mode is inadequate </s>                   77.17
                              the is the </s>                               33.96
                              uncool </s>                                   18.16
                              </s>                                          37.93
                              what ? </s>                                   21.01
                              the mode is </s>                              62.16
                              the mode </s>                                 58.62
                              the mode is very probable </s>                40.76
                              the mode is what is is </s>                   43.96
                              mode is strange </s>                          38.55
                              the mode is awkward </s>                      45.80
                              the mode </s>                                 58.62
                              sometimes NMT does strange things </s>        14.09
                              the fashion isn't fitting </s>                23.08

                              [AVG]                                         40.61
Sampling-based MBR for NMT

Revisiting Intractabilities of MBR Decoding

There are two sources of intractability in MBR decoding

y=argmaxhY E[u(Y,h;x)θ]\begin{aligned} y^\star &= \operatorname*{argmax}_{\textcolor{#DC3220}{h \in \mathcal Y}} ~ \textcolor{#DC3220}{\mathbb{E}[}u(\textcolor{#DC3220}{Y}, h; x) \mid \theta \textcolor{#DC3220}{]} \end{aligned}

  • The objective function (expected utility) requires an intractable sum
    but unbiased estimation is tractable via MC
  • The hypothesis space Y\mathcal Y is unbounded
    but, like in MAP decoding, we can use a small subset
Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Hypothesis Space

Ideally, we want to search through a small space of hypotheses which mostly contains candidates that are highly beneficial in expectation.

  • This sort of approximately best-first enumeration is what beam search does for MAP decoding.
  • Unlike probability, expected utility cannot be computed incrementally from left-to-right, thus approximate best-first enumeration is a lot more difficult.
  • Some tractable heuristics:
    • a set of unbiased samples
    • a set of samples aimed at high-probability outcomes (e.g., nucleus sampling)
    • the most probable outcomes (output of k-best beam search)
Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Sampling-Based MBR MBRN×N_{N\times N}

Obtain NN independent samples from an NMT model

  • use unique samples as candidates
  • use all samples as pseudo-references
  • share samples across candidates for efficiency

That is,

y=argmaxh(y(1),,y(N))) 1Nn=1Nu(y(n),h;x)y(n)Yθ,x\begin{aligned} y^\star &= \operatorname*{argmax}_{h \in (y^{(1)}, \ldots, y^{(N)}))} ~ \frac{1}{N} \sum_{n=1}^N u(y^{(n)}, h; x) \qquad y^{(n)} \sim Y|\theta, x \end{aligned}

Shared Samples

y ~ Y|x                           u(y, "</s>";x)    u(y, "the mode isn't adequate </s>";x)
------------------------------  ----------------  ----------------------------------------
the mode is deficient </s>                 14.48                                     44.47
the mode is very probable </s>             12.71                                     40.76
isn't a thing </s>                         21.48                                     38.07
uncool </s>                                32.88                                     18.16
the mode is </s>                           24.93                                     62.16
the mode is </s>                           24.93                                     62.16
</s>                                      100.00                                     37.93
yes ! </s>                                 41.86                                     19.55
uncool </s>                                32.88                                     18.16
the mode is actually rare </s>             12.71                                     42.80
</s>                                      100.00                                     37.93
the mode is what is is </s>                15.19                                     43.96
uncool mode </s>                           23.07                                     29.39
the mode is inadequate </s>                13.84                                     77.17
the mode is inadequate </s>                13.84                                     77.17
the mode is inadequate </s>                13.84                                     77.17
the mode is awkward </s>                   15.97                                     45.80
the mode is strange </s>                   15.97                                     50.25
the fashion isn't fitting </s>             12.21                                     23.08
the mode is awkward </s>                   15.97                                     45.80

[AVG]                                      27.94                                     44.60
Sampling-based MBR for NMT

MBRN×N_{N \times N}

Unlike approximate MAP decoding, approximate MBR decoding improves with computation.

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Limitations

It's difficult to explore a large sample space because the decoder runs in time
O(N2×U)\mathcal O(N^2 \times U), where UU is the time for a single assessment of utility.

Strategy Disentangle sample size from search space

Sample size controls variance of our estimates of expected utility, a large search space increases our chances of enumerating good and bad translations, but expected utility is robust to inadequate samples.

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

MBRN×S_{N \times S}

  • bigger search space helps
  • MC estimation seems robust enough with 100 samples
Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Limitation

Utility functions can be slow, they may require external tools, complex alignment algorithms, expensive neural models.

Strategy Prune bad candidates using a proxy utility.

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

MBRC2F_{\text{C2F}}

Rank using expected skip-bigram F1, filter down to TT candidates,
re-rank those using expected BEER.

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Comparison

MBR is robust

  • it works with large search spaces
  • it improves the beam ranking
  • we can combine enumeration strategies for a boost
Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Remarks

  • Probability is tractable but a poor proxy to utility (requires ad-hoc patches).
  • Expected utility is intractable, but principled estimation is simple and affordable.
  • Beam search enumerates candidates in approximately best-first order,
    which is hard to do for MBR decoding. But we are working on it :D
  • It's possible (even beneficial) to bias the search space towards high probability candidates (e.g., via beam search or nucleus sampling).
    Despite this finding, using high-probability candidates alone is risky (esp in OOD)
  • MBR gives us an additional knob to express qualitative values: the utility function.
Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Overview of Pre-Print

  • MBR is robust
  • but MBR is slow
  • we speed it up considerably making it linear in the size of the set of candidates
  • we test various utilities (and BEER comes out best)
  • we combine with other approximations such as beam search and nucleus sampling as they find good (and small) hypothesis spaces
Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Beyond the pre-print (or what we are up to)

  • Approximate best-first search for MBR
  • MBR for neural utilities
  • Utilities that control for certain attributes

Thanks!

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Some additional slides

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Hypothetical Q&A

  • I've sampled from NMT before, it didn't look good. How about that?
    • A sample is not a decision, it's a summary of the model's beliefs expressed in data space. Unless the model is extremely confident, a single sample is unlikely to be a good translation.
  • I've obtained lots of samples from NMT before, then ranked them by probability, that didn't look good either. How about that?
    • That's actually fine too. Unless the model is extremely confident, model probability is just not a good measure of overall utility.
  • Wait, are you telling me to sample or not?
    • Yes, but sampling is something we do to gather information, we still need to decide and, for that, we need to pick a utility function.
Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Why does it work?

Let's illustrate a simple problem in continuous space. The problem is to decide under uncertainty where the model is a mixture of two Gaussians.

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Why does it work?

I will try to illustrate it with a toy example (mixture of 2 Gaussians).

  • The MAP solution is the candidate hh with highest probability density.
  • MAP assigns value to hh regardless of how much probability is distributed to outcomes that are similar to it.

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Why does it work?

Now I introduce a utility function (rbf) and computed its value in expectation.

  • The MBR solution is the candidate hh with highest expected utility.
  • Expected utility is sensitive to how much probability is distributed over similar outcomes. In this case, "similar" means rbf-similar.

The radial basis function (rbf) exponentiates the negative of the squared distance.
Sampling-based MBR for NMT

Why does it work?

MBR re-expresses probability in terms of utility. The decision maker can use the utility to express preferences. For example, if we bypass rbf's exponentiation, we allow outcomes that are farther from hh to still exert a lot of influence in our decisions.

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Can it break?

Yes, of course. Here I show a utility function that is not so discriminative, it considers hh beneficial wrt yy even when yy is relatively far from hh.

Bryan Eikema and Wilker Aziz
Sampling-based MBR for NMT

Overview of Decoders

Decoder Objective Function Hypothesis Space
MAP decoding probability all sentences
- beam-search probability (adjusted for length) most probable candidates
MBR decoding expected utility all sentences
- sampling-based MBR MC estimate of expected utility probable candidates
Bryan Eikema and Wilker Aziz

accessible blue and red: #005AB5 #DC3220 <span style='color: #005AB5'>blue</span> <span style='color: #DC3220'>red</span>