1

I can't find anywhere what do the columns of the matrix represent in the hidden layer in skip-gram model.

If the rows of the matrix represent the words then columns should also represent the words or contexts like documents? Can anyone tell me if the above statement is correct? As most of the papers use the term "features" and i am not sure what do they mean by feature.

I have added an image below to clarify the matrix i am talking about.

enter image description here

Ismael Padilla
  • 4,666
  • 4
  • 23
  • 33
Khan Saab
  • 392
  • 1
  • 6
  • 17

1 Answers1

1

Short answer: they represent the amount of neurons in your network. They don't necessarily correspond to a predefined attribute of the word or corpus you're working with.

Slightly more in depth: in machine learning, the word feature is more of an abstract concept. Given an item that you want to classify (words in this scenario), you can say that the item is vaguely defined by the its of features.

So, to mention an example not related to your problem, if we were classifying a set of mushrooms and we wanted to train a model to learn whether a given mushroom is poisonous or safe to eat, we could look at the things that make up the mushroom. These could be things such as cap size, cap color, odor. Or we could choose a different set of features, such as the gill size, stalk size, and stalk color. Or we could use all the features available to us (i.e. every piece of data that we've been gathered about the mushrooms). You can look more into this particular example here.

In some cases, we may not even know what the features that define the objects that we want to classify are. How do we know if we are supposed to consider this or that set of features? Well, we could take a different approach, and train a model to learn what the features are.

This approach is relevant here. Given a word, how do we know how to predict that word's neighbors? What attributes of the word, or the word's context are relevant to us? How can we even learn what those attributes are? Well, we could set up a neural network with a large enough set of neurons (300 should do the trick!) and train that network. This is what we're doing here, the nodes in our network don't necessarily correspond to a predefined attribute of the word or corpus you're working with. We don't know what the relevant specific features are, that's one of the reasons why we are using a neural network in the first place. We want the network to do the dirty work for us, and learn what the features are.

Our end goal is that, given a word, we will end up with a vector associated with that word. That vector should be interpreted as coordinates in an n-th dimensional space, where n is the amount of features what we chose to use. Note that how many features to consider, i.e. how many neurons to have in our network is a choice. We can use 300 neurons, but we could use any arbitrary amount we want, such as 3, 100, or even 1000.

So... Why are we doing this in the first place?

  • Well, first of all, if our corpus has 10,000 different words, then any given word is described by a vector consisting of length 10,000. Not very efficient!
  • Secondly, remember that in the end we want to have something that we can use to compare words. We would like to know if any two words (say, France and Paris) are similar, and if possible, have a measure of how similar they are. Placing all of our words in a vector space is very convenient, because then, we can just measure the distance between two points to find out how similar any two words are.

So, having said all that, let's come up with an example. Let's say that we want to use a 3 dimensional space to represent our words. That means that we choose (again, this is a choice) to have 3 features. So we want to map every word in our corpus to a point in a 3 dimensional space. If we had a corpus with words such as Batman, Joker, Spiderman, and Thanos; in the end we could end up with something like this (this example is taken from here):

‘Batman’    = [0.9, 0.8, 0.2]
‘Joker’     = [0.8, 0.3, 0.1]
‘Spiderman’ = [0.2, .9, 0.8]
‘Thanos’    = [0.3, 0.1, 0.9]

It's important to understand that these numbers may not represent anything specific. We chose to use a model with 3 features, but don't know what the 3 features are. We just fed the words to a neural network and it spat out the vectors, it's up to us to make sense of them (if we can, you can imagine than in a model with 300 features, we would end up with vectors of length 300 so it would be very hard for a human to look at them and learn something from them).

So, what if we tried to make sense of the vectors we obtained? What does the first feature represent? Directly quoting the article I linked above:

  1. It seems that the 1st feature represents the belongingness to the DC Universe. See that ‘Batman’ and ‘Joker’ have higher values for their 1st feature because they do belong to DC Universe.
  2. Maybe the 2nd element in the word2vec representation here captures the hero/villian features. That’s why ‘Batman’ and ‘Spiderman’ have higher values and, ‘Joker’ and ‘Thanos’ have smaller values.
  3. One might say that the 3rd component of the word vectors represent the supernatural powers/abilities. We all know that ‘Batman’ and ‘Joker’ have no such superpowers and that’s why their vectors have small numbers at the 3rd position.
Ismael Padilla
  • 4,666
  • 4
  • 23
  • 33
  • The input is one-hot encoded vector and we take the dot-product between row vector and matrix. We will get a vector with dimensions equal to 300(because the input is one-hot encoded). Now what does a value represents in ith position of that output vector? More specifically, you can have 300 words in the neighbourhood of any word but how would you find what those neighbour words are if the columns represent number of neurons? It is needed as we will use softmax on this vector to compute probabilities and we have to pick the word with highest probability. – Khan Saab Jan 02 '20 at 21:13
  • `More specifically, you can have 300 words in the neighborhood of any word...` That's not exactly right, the number of features isn't related to the amount of neighboring words. You can choose any number you want to use as your vector length, so let's say you choose to have 3 features. Then, given a specific word, you may end up with something like [0.3, 0.9, 0.2]. Those are meant to be interpreted as coordinates in an 3 dimensional space, and aren't necessarily related to any real world observable feature. I'll edit my answer shortly to include an example as I'm reaching the character limit – Ismael Padilla Jan 02 '20 at 21:33