Short answer: they represent the amount of neurons in your network. They don't necessarily correspond to a predefined attribute of the word or corpus you're working with.
Slightly more in depth: in machine learning, the word feature is more of an abstract concept. Given an item that you want to classify (words in this scenario), you can say that the item is vaguely defined by the its of features.
So, to mention an example not related to your problem, if we were classifying a set of mushrooms and we wanted to train a model to learn whether a given mushroom is poisonous or safe to eat, we could look at the things that make up the mushroom. These could be things such as cap size, cap color, odor. Or we could choose a different set of features, such as the gill size, stalk size, and stalk color. Or we could use all the features available to us (i.e. every piece of data that we've been gathered about the mushrooms). You can look more into this particular example here.
In some cases, we may not even know what the features that define the objects that we want to classify are. How do we know if we are supposed to consider this or that set of features? Well, we could take a different approach, and train a model to learn what the features are.
This approach is relevant here. Given a word, how do we know how to predict that word's neighbors? What attributes of the word, or the word's context are relevant to us? How can we even learn what those attributes are? Well, we could set up a neural network with a large enough set of neurons (300 should do the trick!) and train that network. This is what we're doing here, the nodes in our network don't necessarily correspond to a predefined attribute of the word or corpus you're working with. We don't know what the relevant specific features are, that's one of the reasons why we are using a neural network in the first place. We want the network to do the dirty work for us, and learn what the features are.
Our end goal is that, given a word, we will end up with a vector associated with that word. That vector should be interpreted as coordinates in an n-th dimensional space, where n is the amount of features what we chose to use. Note that how many features to consider, i.e. how many neurons to have in our network is a choice. We can use 300 neurons, but we could use any arbitrary amount we want, such as 3, 100, or even 1000.
So... Why are we doing this in the first place?
- Well, first of all, if our corpus has 10,000 different words, then any given word is described by a vector consisting of length 10,000. Not very efficient!
- Secondly, remember that in the end we want to have something that we can use to compare words. We would like to know if any two words (say,
France and Paris) are similar, and if possible, have a measure of how similar they are. Placing all of our words in a vector space is very convenient, because then, we can just measure the distance between two points to find out how similar any two words are.
So, having said all that, let's come up with an example. Let's say that we want to use a 3 dimensional space to represent our words. That means that we choose (again, this is a choice) to have 3 features. So we want to map every word in our corpus to a point in a 3 dimensional space. If we had a corpus with words such as Batman, Joker, Spiderman, and Thanos; in the end we could end up with something like this (this example is taken from here):
‘Batman’ = [0.9, 0.8, 0.2]
‘Joker’ = [0.8, 0.3, 0.1]
‘Spiderman’ = [0.2, .9, 0.8]
‘Thanos’ = [0.3, 0.1, 0.9]
It's important to understand that these numbers may not represent anything specific. We chose to use a model with 3 features, but don't know what the 3 features are. We just fed the words to a neural network and it spat out the vectors, it's up to us to make sense of them (if we can, you can imagine than in a model with 300 features, we would end up with vectors of length 300 so it would be very hard for a human to look at them and learn something from them).
So, what if we tried to make sense of the vectors we obtained? What does the first feature represent? Directly quoting the article I linked above:
- It seems that the 1st feature represents the belongingness to the DC Universe. See that ‘Batman’ and ‘Joker’ have higher values for their 1st feature because they do belong to DC Universe.
- Maybe the 2nd element in the word2vec representation here captures the hero/villian features. That’s why ‘Batman’ and ‘Spiderman’ have higher values and, ‘Joker’ and ‘Thanos’ have smaller values.
- One might say that the 3rd component of the word vectors represent the supernatural powers/abilities. We all know that ‘Batman’ and ‘Joker’ have no such superpowers and that’s why their vectors have small numbers at the 3rd position.