Sampling-based MBR for NMT

Expected Utility

If all I know is that $y$ translates $x$ with probability $p(y|x, \theta)$ , then my expectation on
$h$ 's utility is the weighted average utility against every valid translation under the model:

In technical terms we have,

$\overbrace{\mu_u(h; x, \theta)}^{\textcolor{gray}{\text{expected utility of }h}}=\, p(y^{(1)}|x, \theta) \overbrace{u(y^{(1)}, h; x)}^{\textcolor{gray}{\text{utility wrt }y^{(1)}}} + p(y^{(2)}|x, \theta) \overbrace{u(y^{(2)}, h; x)}^{\textcolor{gray}{\text{utility wrt }y^{(2)}}}+\cdots$

$\,\,\qquad\qquad\qquad= \sum_{\textcolor{#DC3220}{y \in \mathcal Y}} \textcolor{#005AB5}{p(y|x, \theta)}u(y, h; x) \qquad \small{\textcolor{gray}{\text{also denoted by }\mathbb E[ u(Y, h; x) | \theta]}}$

where, in turn and with some probability, each and every possible translation is assumed to be a reference translation.

Decoder	Objective Function	Hypothesis Space
MAP decoding	probability	all sentences
- beam-search	probability (adjusted for length)	most probable candidates
MBR decoding	expected utility	all sentences
- sampling-based MBR	MC estimate of expected utility	probable candidates

Sampling-Based Minimum Bayes Risk Decoding for Neural Machine Translation

Neural Machine Translation

Distribution over Translation Candidates

Quiz

Deciding under Uncertainty

MAP Decoding

Inadequacy of the Mode

Beliefs

A Model as a Representation of What's Known

Utility

Uncertainty About Utility

Expected Utility

Example

Minimum Bayes Risk Decoding

But, we don't need to pick a utility function when we use MAP decoding, right?

Intractabilities of MBR decoding

Summarising the Model's Beliefs

What's a Sample?

Example

Monte Carlo Estimation of Expected Utility (I)

Monte Carlo Estimation of Expected Utility (II)

Revisiting Intractabilities of MBR Decoding

Hypothesis Space

Sampling-Based MBR MBRN×N_{N\times N}N×N​

Shared Samples

MBRN×N_{N \times N}N×N​

Limitations

MBRN×S_{N \times S}N×S​

Limitation

MBRC2F_{\text{C2F}}C2F​

Comparison

Remarks

Overview of Pre-Print

Beyond the pre-print (or what we are up to)

Hypothetical Q&A

Why does it work?

Why does it work?

Why does it work?

Why does it work?

Can it break?

Overview of Decoders

Sampling-Based MBR MBR $_{N\times N}$

MBR $_{N \times N}$

MBR $_{N \times S}$

MBR $_{\text{C2F}}$