In the noisy channel model, for a given input sentence D we are given a set of possible translations A1, A2, A3, .., Ai, ... We want to pick the best possible translation, hence argmax(P(Ai|D)). When maximizing the probability, we use the formula for conditional probability: P(Ai|D) = P(D|Ai)*P(Ai)/P(D). We then omit the P(D), because it is constant and doesn't affect the argmax result.
From this follows argmax(P(Ai|D)) = argmax(P(D|Ai)*P(Ai)) and thus we try to get the best translation using the second form.
I understand that P(Ai) is easier to predict, but why is P(Ai|D) harder to model than P(D|Ai)?