I was watching a stanford lecture on deep learning with NLP and he pointed out a bunch of issues with co-ocurrance matrices (which I assume apply to n-gram models and any other count method like that). The slide that pointed out the issues were:
Problems with simple co-ocurrence vectors
- Increase in size with vocabulary
- Very high dimensional: require a lot of storage
- Subsequent classification models have sparsity issues
$\implies$ Models are less robust
(Richard Socher, 4/1/15)
I think the first two issues sort of make sense, it seems that computing the a co-occurance matrix can be as big as $$ C^V_k$$ where V is the vocabulary and k is the window. That expression is usually really big I believe (plug in stirlings or something, its exponential I believe unless k is really small). If its really big its hard to store, updated, compute predictions etc.
The issue I don't understand is the Sparsity. Why is it a problem if its sparse, I thought sparsity was usually a nice feature in ML (like l1 norm etc). I also don't understand what robustness has to do with sparsity. I usually think of robustness sort of like having low variance (not changing to much if the data changes). How is sparsity affecting this overfitting/underfitting issue? I don't see the relation.
The only issue I could possibly see is if the model for predicting somehow uses fractions (i.e. a PGM) and the the denominator is zero then numerical errors happen (like with n-grams), or if the numerator is zero, then saying something is impossible is usually a bit of a stretch...is that what the issue is?
His use of the word classifier made me suspect that these models might use logistic regression or something...then he goes on to talk about SVD. How is SVD helping wrt to robustness? (its clear it makes things smaller).
In short, what is the issue with sparsity and why does it affect the robustness and generalization so much?