use hidden layer of word2vec instead of 'one-hot', to reduce number of weights in other nets?

Question

I've been reading about word2vec and it's ability to encode words into vector representations. Coordinates (probabilities) of these words are clustered together with their usual context-neighbor words.

For example, if we have 10k unique words in our dataset, we will feed 10 000 features into the network. The network can have for example 300 hidden neurons, each with linear activation function.

The output of the net are 10 000 neurons, each with a softmax classifier. Every such a "soft-maxed output" represents probability to select the appropriate word from our 10k dictionary.

Question 1: Well, I was able to get benefit of "semantically grouped" words via that final softmaxed-output layer. However, it seems way too large - it still has dimension of 10 000 probabilities.

Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?

Then weights of my LSTM wouldn't take so much space on disk. However, I would need to decode the 300-dimensional state back into my 10k-dimensional state, so I can look it up

Question 2: ...are the "Encoded" vectors actually already sitting in the Hidden layer or in the Output Layer?

References: first, second, third

score 1 · Answer 1 · edited Mar 08 '18 at 03:40

1

Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?

If this is the case, I'd say no. There's nothing in the one-hot vector that specifies distance between two words. For instance, the one-hot vectors for cat and table will be the same distance away for cat and lion. With embeddings, clearly you want to have the second example closer together.

edited Mar 08 '18 at 03:40

Toros91

2,392
3
15
32

answered Mar 07 '18 at 23:06

Daniel

246
1
8

Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else – Kari Mar 08 '18 at 11:17

Kari · Accepted Answer · 2019-01-20T18:29:21.037

Answering Question 1 of my own post::

Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?

Yes, I can, because the outcome of the hidden (orange in the picture) already represents the actual embedded vector. It should be now used in any other custom networks (LSTM, etc), NOT the one-hot encoded vector. Using this orange vector will indeed greately reduce number of required weights if there are multiple custom networks under development.

Notice, connections between one-hot vector and the orange layer are represented by a special matrix.

The matrix is special because each column (or rows, depending on your preferred notation) already represents these pre-activations in those 300 neurons - a response to a corresponding incoming one-hot vector. So you just take a column, and treat it as the entire orange layer, which is super convenient

More on that here: https://datascience.stackexchange.com/a/29161/43077

Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.

The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about

use hidden layer of word2vec instead of 'one-hot', to reduce number of weights in other nets?

2 Answers2