2

In the noisy channel model, for a given input sentence D we are given a set of possible translations A1, A2, A3, .., Ai, ... We want to pick the best possible translation, hence argmax(P(Ai|D)). When maximizing the probability, we use the formula for conditional probability: P(Ai|D) = P(D|Ai)*P(Ai)/P(D). We then omit the P(D), because it is constant and doesn't affect the argmax result.

From this follows argmax(P(Ai|D)) = argmax(P(D|Ai)*P(Ai)) and thus we try to get the best translation using the second form.

I understand that P(Ai) is easier to predict, but why is P(Ai|D) harder to model than P(D|Ai)?

Sir Cornflakes
  • 30,154
  • 3
  • 65
  • 128
Vilda
  • 123
  • 4

1 Answers1

4

If we take into account only P(D|Ai), the probability is similarly wrong/right as P(Ai|D). Since both are taken from the double-language corpora and thus not really good.

But because in the right side of the equation we have also the P(Ai) availabel we gain much more realiable result. P(Ai) is much more representative since it is from single-language corpus.

Thus P(Ai|D) derived from a double-language corpus does not equal to P(D|Ai)*P(Ai).

Septinel
  • 56
  • 1