"Is it because in the candidate generation stage, the relative order of items does not matter?"
Yes this is exactly what appears to be going on, although it seems that youtube is using softmax in a untraditional way. The Candidate Generation model simply selects the few hundred candidate videos that are subsequently ranked by the ranking model.
Section 3 of the paper you referenced I think does a good job explaining whats going on:
"At serving time we need to compute the most likely N
classes (videos) in order to choose the top N to present
to the user... Since calibrated
likelihoods from the softmax output layer are not needed
at serving time, the scoring problem reduces to a nearest
neighbor search in the dot product space for which general
purpose libraries can be used."
As far as I can tell, this kind of recommender architecture is only beneficial at the scale of which an organization like youtube is operating, and has more to do with the practicality of organizing computational infrastructure rather than model performance. I'm sure their model performance vs. a more "traditional" architecture is negligible as far as things like map@k are concerned.
EDIT: Found the same question already answered with much more detail than shown here.