Why softmax in YouTube’s DNN recommender

Question

I am confused about the softmax layer of YouTube’s DNN candidate generation. A user may interact with many videos. Softmax is assuming classes are exclusive. For example, logits = [[4.0, 4.0, 1.0]], labels = [[1.0, 1.0, 0.0]], the sigmoid cross entropy loss is 0.45 while softmax cross entropy loss is 1.43.

Is it because in the candidate generation stage, the relative order of items does not matter?

mkerrig · Accepted Answer · 2020-12-06T07:19:53.493

"Is it because in the candidate generation stage, the relative order of items does not matter?"

Yes this is exactly what appears to be going on, although it seems that youtube is using softmax in a untraditional way. The Candidate Generation model simply selects the few hundred candidate videos that are subsequently ranked by the ranking model.

Section 3 of the paper you referenced I think does a good job explaining whats going on:

"At serving time we need to compute the most likely N classes (videos) in order to choose the top N to present to the user... Since calibrated likelihoods from the softmax output layer are not needed at serving time, the scoring problem reduces to a nearest neighbor search in the dot product space for which general purpose libraries can be used."

As far as I can tell, this kind of recommender architecture is only beneficial at the scale of which an organization like youtube is operating, and has more to do with the practicality of organizing computational infrastructure rather than model performance. I'm sure their model performance vs. a more "traditional" architecture is negligible as far as things like map@k are concerned.

EDIT: Found the same question already answered with much more detail than shown here.

"As far as I can tell, this kind of recommender architecture is only beneficial at the scale of which an organization like youtube is operating[...]". Do you think a less complex case study could skip the candidate generation stage? Like, directly performing the ranking stage given the possible items to recommend are less than 1K. — Galo Castillo, Apr 26 '23 at 15:28
Yup! Exactly. It really depends on what your requirements are. If you need to serve up ordered recommendations to a user within the human attention span, and the database of content that you're sorting for the user is really large, then you probably need some kind of candidate generation stage to reduce total inference time. But if the database that you're sorting is small, or if you don't have stringent inference time constraint requirements, then you could do a much simpler system that just sorts everything and serves it to the user directly. — mkerrig, Apr 27 '23 at 23:27

Why softmax in YouTube’s DNN recommender

1 Answers1