Understanding Feature Hashing

Question

Wikipedia provides the following example when describing feature hashing; but the mapping does not seem consistent with the dictionary defined

For example, to should be converted to 3 according to the dictionary, but it is encoded as 1 instead.

Is there an error in the description? How does feature hashing work?

The texts:

John likes to watch movies. Mary likes too.
John also likes to watch football games.

can be converted, using the dictionary

{"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, 
"football": 7, "games": 8, "Mary": 9, "too": 10}

to the matrix

[[1 2 1 1 1 0 0 0 1 1]
 [1 1 1 1 0 1 1 1 0 0]]

score 10 · Accepted Answer · answered Oct 21 '13 at 09:31

10

The matrix is constructed in the following way:

rows represent lines
columns represent features

and every entry matrix(i,j)=k means:

In line i, the word with index j appears k times.

So to is mapped to index 3. It appears exactly one time in line 1. So m(1,3)=1.

More examples

likes is mapped to index 2. It appears exactly two times in the first line. So m(1,2)=2
also is mapped to index 6. It does not appear in line 1, but one time in line 2. So m(1,6)=0 and m(2,6)=1.

answered Oct 21 '13 at 09:31

steffen

10,367

In the context of feature hashing though, we don't have a dictionary. We only have a hash function. Does this work similarly in the sense that you (1) compute the hash value of the feature and (2) increment the index given by the the hash function by 1 each time you see a data point? For instance, as @user20370 states below, if you decide to encode your features with 13 bits and the hash value of "likes" is 5674, then does the index 5674 get incremented by 1? And if you use fewer bits, do you just mod 5674 by 2^(# bits) and increment that index? – Vivek Subramanian Jul 24 '15 at 14:53
1

@VivekSubramanian yes. The challenge is to find a hash-function without collisions (i.e. different words, but same hash value), or with collisions occurring rarely. This is an area of research in computer science (https://en.wikipedia.org/wiki/Perfect_hash_function). – steffen Jul 27 '15 at 14:40

user20370 · Answer 2 · 2013-10-21T17:21:41.650

As Steffen pointed out, the example matrix encodes the number of times a word appears in a text. The position of the encoding into the matrix is given by the word (column position on the matrix) and by the text (row position on the matrix).

Now, The hashing trick works the same way, though you don't have to initially define the dictionary containing the column position for each word.

In fact it is the hashing function that will give you the range of possible column positions (the hashing function will give you a minimum and maximum value possible) and the exact position of the word you want to encode into the matrix. So for example, let's imagine that the word "likes" is hashed by our hashing function into the number 5674, then the column 5674 will contain the encodings relative to the word "likes".

In such a fashion you won't need to build a dictionary before analyzing the text. If you will use a sparse matrix as your text matrix you won't even have to define exactly what the matrix size will have to be. Just by scanning the text, on the fly, you will convert words into column positions by the hashing function and your text matrix will be populated of data (frequencies, i.e.) accordingly to what document you are progressively analyzing (row position).

Understanding Feature Hashing

2 Answers2