2

I came across this research paper released by YouTube, on how they use deep learning neural networks for recommendations. It's located here: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf

In the paper, the candidate generation neural network model outputs a softmax with 256 dimensions, which acts as an "output embedding" of each of the 1M video classes.

How is this possible to implement in tensorflow, for example? Isn't softmax supposed to be only 1-Dimensional. If the model outputs an "embedding" like this, as they say it does, how would the training data's labels be formatted as 256-dimensional? In other words, how do they compute the 256-dimensional vector for each of the videos in their training dataset?

Thank you so much for your time and help, guys!

user1414202
  • 123
  • 1
  • 4

1 Answers1

1

You are confusing "dimensions" with "order of tensor". A softmax with 256 different categories is a 256 dimensional vector, but is also a tensor with order 1 (whilst a matrix is a tensor of order 2). The paper is using the technical terms correctly, so the 256 dimensional vector is just a normal vector with 256 scalar entries.

Therefore a 256-dimensional softmax in TensorFlow is typically an output layer that looks something like this:

y = tf.nn.softmax(tf.matmul(h, W) + b)

where h is the last hidden layer, W is the weight matrix n x 256, and b is the bias 1 x 256 vector.

In the paper, the candidate generation neural network model outputs a softmax with 256 dimensions, which acts as an "output embedding" of each of the 1M video classes

That is a description of the training process that compresses 1M different inputs to 256-dimensional output for use as an embedding for recommendation matches. The softmax is at the output, and as far as I can see is just a normal softmax classifier output as seen in many other classifier networks (except the result is not technically being used to classify anything). I am not clear on what supervision data was used or on what the input representation was. However, I don't think it likely that 1M "classes" ever appear as e.g. 1-hot encoding, because that would not scale out usefully to the many other millions of videos - the point of the embedding is to turn disparate features of the videos into something that be used as a similarity measure, that can be run on any video stored in YouTube.

Neil Slater
  • 28,918
  • 4
  • 80
  • 100
  • Hi, thanks for your help Neil! In the paper, it says that they are applying a softmax over 1M classes (each of which is a youtube video). However, they say that the final output layer also acts as an "output embedding layer," with dimension 256 (which means that each class has an "embedding" vector of size 256). So, wouldn't the weights and biases be of shape n x num_classes, and not n x 256? Pls correct me if wrong. Thanks! – user1414202 Apr 08 '17 at 21:15
  • Also, would the weights of the softmax layer, which you have provided as an example, represent the embeddings for the different classes. If so, can we use these embeddings to find similar classes in an N-dimensional space, as they do in the original layer? Thanks @neil-slater – user1414202 Apr 08 '17 at 21:23
  • I have skim-read the paper, and see no indication that the embedding is anything other than a normal softmax as I described. I could not realistically re-create the full architecture from that paper myself so perhaps I have missed something. However, the "millions of classes" looks to me like a problem statement, and not a reference to the network architecture. As far as I can see, the network uses a normal softmax with 256 classes (as a "summary"), it is other parts of the architecture and pipeline which help it upscale to match from such a large number of candidate videos. – Neil Slater Apr 08 '17 at 21:29
  • 1
    Hi, it's on page 4, Section 3.5, first few lines of that section. Also, it says that they do a softmax over millions of classes (each of which is a youtube video), which is why they use a sampled softmax during training in the first place (as computing full softmax for a million+ classes will be unfeasible). Really thanks for taking the time to help me! You're awesome! – user1414202 Apr 08 '17 at 21:31
  • 1
    on the paper it says the following (directly quoted): "To efficiently train such a model with millions of classes, we rely on a technique to sample negative classes from the background distribution (“candidate sampling”) and then correct for this sampling via importance weighting [10]." From this statement, it does seem that they're training a model with millions of classes, right? Thanks – user1414202 Apr 08 '17 at 23:23
  • @user3692525: The recommendation engine is where each video is separate entity and that is the "class" that is being selected, but that is a property of the higher-level pipeline, and does not translate to the architecture of the neural network. The neural network is not being trained to categorise millions of classes. It is because there are millions of classes that the neural network is being used within the larger process to mange the problem down to a 256 dimensional vector. – Neil Slater Apr 08 '17 at 23:29
  • @neil-sater, ooooh that makes sense now. So, that's why the neural network architectures included in the paper have a final hidden layer of 256... But, how would we vaguely express this statement in tensorflow: "The softmax layer outputs a multinomial distribution over the same 1M video classes with a dimension of 256 (which can be thought of as a separate output video embedding)."? Does this mean that the softmax's output is of the shape [num_videos, 256]? Sorry to disturb you, but thanks for your help! – user1414202 Apr 08 '17 at 23:41
  • @neil-sater, if the nn is supposed to output a 256-dimensional vector, wouldn't it make sense for them to treat this as a regression problem, not a classification problem? After all, how can you classify a 256-dimensional vector via softmax? That would be more suited for regression right? Can you kindly briefly clarify this? thanks... – user1414202 Apr 08 '17 at 23:50
  • @user3692525: Yes I have more often seen functions other than softmax used for embedding vectors. I am not sure what the benefit of softmax is here, but I would guess it constrains things similar to using regularisation. The nn training is attempting to map input vectors that "go together" have similar output values, so yes this is more like a form of regression but using an output layer more traditionally used for classifying. – Neil Slater Apr 09 '17 at 08:19
  • "The softmax layer outputs a multinomial distribution over the same 1M video classes with a dimension of 256 (which can be thought of as a separate output video embedding)." -> This output matches what I put in the answer. There is nothing special about the softmax layer in this network, it is just same as the softmax layer used in countless MNIST digits examples. The clever parts happen elsewhere in the system. – Neil Slater Apr 09 '17 at 08:29
  • Oh ok @neil-slater. Your answer is reasonable, and I've marked it. Also, do you know how, in tensorflow, we can train the network to partition the labels (which is a specific video id in this case) into 256 groups (which would be the actual 256-dim labels that are fed to the network to match the softmax's 256-d output)? – user1414202 Apr 09 '17 at 08:52
  • In other words, the paper says that this classif. model "[scores] millions of items under a strict serving latency of tens of milliseconds requires an approximate scoring scheme sublinear in the number of classes. Previous systems at YouTube relied on hashing [24] and the classi- fier described here uses a similar approach." Do you know or have any ideas on how to implement this hashing approach in tensorflow (which I believe partitions the millions of videos into 256 groups)? Thanks! – user1414202 Apr 09 '17 at 08:53
  • @user3692525: Sorry I do not understand the paper well enough to attempt an implementation of a system similar to YouTube's. I'd reckon that to be a project taking a few weeks. – Neil Slater Apr 09 '17 at 09:00
  • Anyways, @neil-slater, thank you very much for your consideration into my question and your thoughtful replies/answers! – user1414202 Apr 09 '17 at 09:21
  • First of all @NeilSlater is completely wrong. It is very clear that they are doing softmax over 1M classes as per the discussion in 3.1 about Extreme Multi class where they cite the WASABI paper where they do the same thing. However, the original question still stands, how does one get "softmax vectors" I've never seen that before... – eggie5 Aug 07 '18 at 03:13