Why is it a big problem to have sparsity issues in Natural Language processing (NLP)?

Question

I was watching a stanford lecture on deep learning with NLP and he pointed out a bunch of issues with co-ocurrance matrices (which I assume apply to n-gram models and any other count method like that). The slide that pointed out the issues were:

Problems with simple co-ocurrence vectors

Increase in size with vocabulary

Very high dimensional: require a lot of storage

Subsequent classification models have sparsity issues

$\implies$ Models are less robust

(Richard Socher, 4/1/15)

I think the first two issues sort of make sense, it seems that computing the a co-occurance matrix can be as big as $$ C^V_k$$ where V is the vocabulary and k is the window. That expression is usually really big I believe (plug in stirlings or something, its exponential I believe unless k is really small). If its really big its hard to store, updated, compute predictions etc.

The issue I don't understand is the Sparsity. Why is it a problem if its sparse, I thought sparsity was usually a nice feature in ML (like l1 norm etc). I also don't understand what robustness has to do with sparsity. I usually think of robustness sort of like having low variance (not changing to much if the data changes). How is sparsity affecting this overfitting/underfitting issue? I don't see the relation.

The only issue I could possibly see is if the model for predicting somehow uses fractions (i.e. a PGM) and the the denominator is zero then numerical errors happen (like with n-grams), or if the numerator is zero, then saying something is impossible is usually a bit of a stretch...is that what the issue is?

His use of the word classifier made me suspect that these models might use logistic regression or something...then he goes on to talk about SVD. How is SVD helping wrt to robustness? (its clear it makes things smaller).

In short, what is the issue with sparsity and why does it affect the robustness and generalization so much?

As its fairly easy to find more elaborate answers, here just a quick-and-dirty explanation: if for most combinations of class, feature pairs you have very little or, more likely, zero observations, how do you factor that in statistically? Zero probabilities make everything else zero, too, and is the difference between, say 5 and 10 observations of a feature, class combination really that significant or just a sampling quirk? (Note that by "feature" I mean the typical word, n-gram, token, or something like that.) — fnl, Apr 20 '17 at 09:30
@fnl maybe I don't appreciate the issue, but whats wrong with 5-10 data points being the difference? Isn't language supposed to be sort of subtle and small difference should matter? — Charlie Parker, Jan 23 '18 at 14:28
If you happen to have a uniquely identifying keyword where the difference in label P is significant, sure. But generally, being dependent on keywords, not semantics, is unwanted. — fnl, Jan 23 '18 at 19:29
@fnl but why is that unwanted? Sorry Im having difficulties understanding. — Charlie Parker, Jan 24 '18 at 04:08
No problem - that's what this site is for! The point being that a keyword might be a bad indicator of the label. Take sentiment classification: in "how lovely, {brand}, that's it, really!", the adjective love is clearly positive. But in "oh my lovely {brand}, really, that's it?", the expression is negative. The classifier must understand more than keywords to truly succeed at this task. — fnl, Jan 24 '18 at 22:27

Aaron · Answer 1 · 2017-05-09T04:10:54.360

There are two kinds of sparsity: data sparsity and model sparsity. Model sparsity can be good because it means that there is a concise explanation for the effect that we are modeling. Data sparsity is usually bad because it means that we are missing information that might be important. That slide is talking about data sparsity.

Using the SVD to compress the matrix gives us dense low-dimensional vectors for each word. This is a way of sharing information between similar words to help deal with the data sparsity. Another thing that people sometimes do to deal with sparsity is to use sub-word units instead of words or to use stemming or lexicalization to reduce the vocabulary size.

Data sparsity is more of an issue in NLP than in other machine learning fields because we typically deal with large vocabularies where it is impossible to have enough data to actually observe examples of all the things that people can say. There will be many real phrases that we will just never see in the training data.

sorry for being dense can you explain a bit more why this is data sparsity and not model sparsity? Isn't one-hot vectors a model choice that we did? — Charlie Parker, May 06 '17 at 15:53
Data sparsity is caused by items that are unobserved in the training data. Very common in NLP because the input space has a very high dimensionality. Model sparsity comes about because we use a concise explanation of the effect that we are modeling. Model sparsity is be design and data sparsity is a limitation. — Aaron, May 09 '17 at 04:13

score 2 · Answer 2 · answered Jun 20 '17 at 03:39

A 1 hot word vector in a large dictionary or a co-occurrence matrix in a large corpus are both usually very sparse matrices. This is a problem because a one hot vector, [0 0 1 0 .. 0] vector cannot compute relations to another one hot vector, i.e. dot product. So this would cause two words such as hotel and motel to have no mathematical relations with one another if all of the word vectors were one hot vectors.

A co-occurrence matrix will solve this problem of relations, because hotel and motel will have similar co-occurrence results in a large corpus. But this matrix is too large in size, and computationally is inefficient to store in memory. For a corpus with a dictionary of 100,000 unique tokens, this is a (100,000 x 100,000) matrix which is 10 billion numbers.

Using something like word2vec or glove, you solve both of these problems. Word2vec uses a feed forward neural network with a single hidden layer. We will end up using the weights from the hidden layer as the word vector representations. The length of vectors is a hyperparameter, (100, 300, etc.). This way words such as hotel and motel can do vector operations to find the similarity or closeness to another vector. Also, this reduces the size of each word vector from 100,000 from the example above to 300 or 100, depending on the dimension of the word vector set.

score 0 · Answer 3 · answered Jan 23 '18 at 05:53

0

In my opinion, the slide here is just stating that the co-occurrence matrix will have a lot of zeros in it. This is a bad thing because it is very CPU- and memory-inefficient. Imagine having to compute 0 times something 90% of the time and always get 0 as the answer. Also imagine having to store a lot of 0s in order to encode this matrix in memory.

answered Jan 23 '18 at 05:53

Michael Ma

109

I can see how its physically inefficient but not statistically inefficient (to have lots of zeros/sparsity). If anything knowing this should help guide better design of models. – Charlie Parker Jan 23 '18 at 14:28

Why is it a big problem to have sparsity issues in Natural Language processing (NLP)?

3 Answers3

Linked