I've been reading about word2vec and it's ability to encode words into vector representations. Coordinates (probabilities) of these words are clustered together with their usual context-neighbor words.
For example, if we have 10k unique words in our dataset, we will feed 10 000 features into the network. The network can have for example 300 hidden neurons, each with linear activation function.
The output of the net are 10 000 neurons, each with a softmax classifier. Every such a "soft-maxed output" represents probability to select the appropriate word from our 10k dictionary.

Question 1: Well, I was able to get benefit of "semantically grouped" words via that final softmaxed-output layer. However, it seems way too large - it still has dimension of 10 000 probabilities.
Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?
Then weights of my LSTM wouldn't take so much space on disk. However, I would need to decode the 300-dimensional state back into my 10k-dimensional state, so I can look it up
Question 2: ...are the "Encoded" vectors actually already sitting in the Hidden layer or in the Output Layer?
References: first, second, third