Let's judge two candidates: </s> and the mode isn't adequate </s>.
For utility, we will use ChrF, which values candidates that match character n-grams of a good translation.
Sampling-based MBR for NMT
h y p(y|x) u(y, h;x) p(y|x) * u(y, h;x)
---------------------------- ------------------------------ -------- ----------- --------------------
</s> </s> 0.0645 100.00 6.45
the mode </s> 0.0605 29.71 1.80
the mode is </s> 0.0477 24.93 1.19
the mode is inadequate </s> 0.0469 13.84 0.65
the mode is not adequate </s> 0.0441 13.25 0.58
the mode is awkward </s> 0.0412 15.97 0.66
the mode is empty </s> 0.0397 17.79 0.71
the mode is deficient </s> 0.0390 14.48 0.56
the mode is poor </s> 0.0359 18.87 0.68
the fashion isn't fitting </s> 0.0342 12.21 0.42
[...]
[SUM] 25.92
the mode isn't adequate </s> </s> 0.0645 37.93 2.45
the mode </s> 0.0605 58.62 3.55
the mode is </s> 0.0477 62.16 2.96
the mode is inadequate </s> 0.0469 77.17 3.62
the mode is not adequate </s> 0.0441 82.98 3.66
the mode is awkward </s> 0.0412 45.80 1.89
the mode is empty </s> 0.0397 49.20 1.96
the mode is deficient </s> 0.0390 44.47 1.73
the mode is poor </s> 0.0359 49.81 1.79
the fashion isn't fitting </s> 0.0342 23.08 0.79
[...]
[SUM] 40.87
Sampling-based MBR for NMT
Minimum Bayes Risk Decoding
MBR decoding tells us to choose the candidate whose expected utility is maximum:
y⋆=h∈YargmaxE[u(Y,h;x)∣θ]
Decision maker: chooses the utility function u
NMT model: contributes beliefs (i.e., the probability of y given x for every possible y)
Search algorithm: enumerates candidate translations h∈Y, assesses their expected utility, and picks the best
Sampling-based MBR for NMT
An Origin Story
Consider the exact match utility 1y(h) which is 1 when y and h are the same and 0 otherwise. Let's compute its expected value under the model:
What do we get if we solve argmaxh∈YpY∣X(h∣x,θ)?
When we decide via MAP decoding, we implicitly decide via MBR using a utility function that treats translations as atomic categories.
Sampling-based MBR for NMT
Intractabilities of MBR decoding
There are two sources of intractability in MBR decoding
y⋆=h∈YargmaxE[u(Y,h;x)∣θ]
The objective function (expected utility) requires an intractable sum this is different from MAP decoding, where the objective is tractable
The hypothesis space Y is unbounded just like in MAP decoding
Sampling-based MBR for NMT
Summarising the Model's Beliefs
The space of all translation candidates is unbounded, making it impossible for us to exactly compute the expected utility of any given candidate.
But, expectations can be estimated in a principled manner via Monte Carlo.
We use the sample mean
μ^u(h;x,θ)=S1s=1∑Su(y(s),h;x)
where y(s) is sampled from the model with probability p(y(s)∣x,θ).
Sampling-based MBR for NMT
What's a Sample?
Think of the NMT model as a bag of tokens, each token is a translation, if you put your hand in it and get a token, there's a probability p(y∣x,θ) that you will get y.
Drawing samples like that is easy in NMT because of the way the model decomposes the probability of a complete sequence as a product of probabilities, one for each target word in context from left-to-right.
Yj∣θ,x,y<j∼Cat(f(x,y<j;θ))
Sampling-based MBR for NMT
Example
Let's judge two candidates: </s> and the mode isn't adequate </s>.
We will estimate each candidate's expected utility using 20 samples from the model.
For utility, we will use ChrF.
Monte Carlo Estimation of Expected Utility (I)
h y ~ Y|x u(y, h;x)
---------------------------- -------------------------------------- -----------
</s> the mode is awkward </s> 15.97
the mode is not very probable </s> 11.33
the fashion isn't fitting </s> 12.21
the mode is </s> 24.93
mode is weird </s> 21.48
is </s> 58.01
the mode is empty </s> 17.79
the mode is not adequate </s> 13.25
the mode is poor </s> 18.87
NMT does strange things </s> 13.25
rare rare rare rare ! </s> 15.19
sometimes NMT does strange things </s> 9.59
mode mode mode mode </s> 15.97
fashionable </s> 21.48
the mode is deficient </s> 14.48
the mode </s> 29.71
the the the the the the the </s> 12.71
mode is not cool </s> 18.87
the mode is deficient </s> 14.48
the mode is not adequate </s> 13.25
[AVG] 18.64
Monte Carlo Estimation of Expected Utility (II)
h y ~ Y|x u(y, h;x)
---------------------------- -------------------------------------- -----------
the mode isn't adequate </s> the mode is poor </s> 49.81
modes aren't adequate </s> 69.07
the mode is awkward </s> 45.80
the mode is awkward </s> 45.80
fashion isn't a thing </s> 29.31
cool </s> 18.05
unfashionable </s> 21.98
the mode is awkward </s> 45.80
the mode is awkward </s> 45.80
fashion isn't a thing </s> 29.31
the mode is inadequate </s> 77.17
the mode is inadequate </s> 77.17
the is the </s> 33.96
the mode </s> 58.62
the mode is a mode </s> 55.00
the mode is deficient </s> 44.47
the mode is actually rare </s> 42.80
aren't adequate </s> 75.46
the mode is not very probable </s> 41.56
the mode is not very probable </s> 41.56
[AVG] 47.42
Sampling-based MBR for NMT
Revisiting Intractabilities of MBR Decoding
There are two sources of intractability in MBR decoding
y⋆=h∈YargmaxE[u(Y,h;x)∣θ]
The objective function (expected utility) requires an intractable sum but unbiased estimation is tractable via MC
The hypothesis space Y is unbounded but, like in MAP decoding, we can use a small subset
Sampling-based MBR for NMT
Hypothesis Space
Ideally, we want to search through a small space of hypotheses which mostly contains candidates that are highly beneficial in expectation.
This sort of approximately best-first enumeration is what beam search does for MAP decoding.
A candidate's expected utility cannot be computed incrementally from left-to-right, thus approximate best-first enumeration is a lot more difficult.
Some tractable heuristics:
a set of unbiased samples
a set of samples aimed at high-probability outcomes (e.g., nucleus sampling)
the most probable outcomes (output of k-best beam search)
Unlike approximate MAP decoding, approximate MBR decoding improves with computation.
Sampling-based MBR for NMT
Limitations
It's difficult to explore a large sample space because the decoder runs in time O(N2×U), where U is the time for a single assessment of utility.
Strategy Disentangle sample size from search space
Sample size controls variance of our estimates of expected utility, a large search space increases our chances of enumerating good and bad translations, but expected utility is robust to inadequate samples.
Sampling-based MBR for NMT
MBRN×S
bigger search space helps
MC estimation seems robust enough with 100 samples
Sampling-based MBR for NMT
Limitation
Utility functions can be slow, they may require external tools, complex alignment algorithms, expensive neural models.
Strategy Prune bad candidates using a proxy utility.
Sampling-based MBR for NMT
MBRC2F
Rank using a proxy objective (e.g., expected skip-bigram F1), filter down to T candidates,
pick the translation that maximises the target objective (e.g., expected BEER).
Sampling-based MBR for NMT
Comparison
MBR is robust
it works with large search spaces
it improves the beam ranking
we can combine enumeration strategies for a boost
Sampling-based MBR for NMT
Remarks
Probability is tractable but a poor proxy to utility (requires ad-hoc patches).
Expected utility is intractable, but principled estimation is simple and affordable.
Beam search enumerates candidates in approximately best-first order,
which is hard to do for MBR decoding. But we are working on it :D
It's possible (even beneficial) to bias the search space towards high probability candidates (e.g., via beam search or nucleus sampling). Despite this finding, using high-probability candidates alone is risky (esp in OOD)
MBR gives us an additional knob to express qualitative values: the utility function.
Sampling-based MBR for NMT
Overview of Pre-Print
MBR is robust (but slow)
we disentangle the sources of intractability in MBR
and speed it up considerably making it linear in the size of the set of candidates
we test various lexical-based utilities (and BEER comes out best)
we combine with other approximations such as beam search and nucleus sampling as they find good (and small) hypothesis spaces
our related work section is probably the most coherent and comprehensive account of MBR literature you can find
Thanks!
Sampling-based MBR for NMT
Some additional slides
Sampling-based MBR for NMT
Hypothetical Q&A
I've sampled from NMT before, it didn't look good. How about that?
A sample is not a decision, it's a summary of the model's beliefs expressed in data space. Unless the model is extremely confident, a single sample is unlikely to be a good translation.
I've obtained lots of samples from NMT before, then ranked them by probability, that didn't look good either. How about that?
That's actually fine too. Unless the model is extremely confident, model probability is just not a good measure of overall utility.
Wait, are you telling me to sample or not?
Yes, but sampling is something we do to gather information, we still need to decide and, for that, we need to pick a utility function.
Sampling-based MBR for NMT
Why does it work?
Let's illustrate a simple problem in continuous space. The problem is to decide under uncertainty where the model is a mixture of two Gaussians.
Sampling-based MBR for NMT
Why does it work?
I will try to illustrate it with a toy example (mixture of 2 Gaussians).
The MAP solution is the candidate h with highest probability density.
MAP assigns value to h regardless of how much probability is distributed to outcomes that are similar to it.
Sampling-based MBR for NMT
Why does it work?
Now I introduce a utility function (rbf) and computed its value in expectation.
The MBR solution is the candidate h with highest expected utility.
Expected utility is sensitive to how much probability is distributed over similar outcomes. In this case, "similar" means rbf-similar.
Sampling-based MBR for NMT
Why does it work?
MBR re-expresses probability in terms of utility. The decision maker can use the utility to express preferences. For example, if we bypass rbf's exponentiation, we allow outcomes that are farther from h to still exert a lot of influence in our decisions.
Sampling-based MBR for NMT
Can it break?
Yes, of course. Here I show a utility function that is not so discriminative, it considers h beneficial wrt y even when y is relatively far from h.
Sampling-based MBR for NMT
Overview of Decoders
Decoder
Objective Function
Hypothesis Space
MAP decoding
probability
all sentences
- beam-search
probability (adjusted for length)
most probable candidates
MBR decoding
expected utility
all sentences
- sampling-based MBR
MC estimate of expected utility
probable candidates
Sampling-based MBR for NMT
y ~ Y|x u(y, "</s>";x) u(y, "the mode isn't adequate </s>";x)
------------------------------ ---------------- ----------------------------------------
the mode is deficient </s> 14.48 44.47
the mode is very probable </s> 12.71 40.76
isn't a thing </s> 21.48 38.07
uncool </s> 32.88 18.16
the mode is </s> 24.93 62.16
the mode is </s> 24.93 62.16
</s> 100.00 37.93
yes ! </s> 41.86 19.55
uncool </s> 32.88 18.16
the mode is actually rare </s> 12.71 42.80
</s> 100.00 37.93
the mode is what is is </s> 15.19 43.96
uncool mode </s> 23.07 29.39
the mode is inadequate </s> 13.84 77.17
the mode is inadequate </s> 13.84 77.17
the mode is inadequate </s> 13.84 77.17
the mode is awkward </s> 15.97 45.80
the mode is strange </s> 15.97 50.25
the fashion isn't fitting </s> 12.21 23.08
the mode is awkward </s> 15.97 45.80
[AVG] 27.94 44.60
Sampling-based MBR for NMT
accessible blue and red: #005AB5 #DC3220
<span style='color: #005AB5'>blue</span>
<span style='color: #DC3220'>red</span>