Combining n-grams

Question

In text mining, if we've computed n-gram counts, for say $n=1\ldots4$, is there a principled way to combine them, other than just concatenating the $tf-idf$ matrices for each one? (equivalent to an unweighted sum of kernels if we were to construct kernel matrices for each one). For example, google's n-gram viewer:

http://books.google.com/ngrams/datasets

shows that they calculated from unigrams up to 5-grams, but they don't say how they combine them.

Fred Foo · Answer 1 · 2012-01-13T15:20:54.720

2

Not sure if this is what you're looking for, but you might want to look at Katz backoff. This entails training vanilla n-gram models for $1 \le n \le N$, then estimating probabilities for large n by "backing off" to an (n-1)-gram model when the n-gram in question was not observed more often than some frequency threshold.

edited Jan 13 '12 at 15:20

answered Jan 13 '12 at 15:04

Fred Foo

201

If possible, please provide a more self-contained answer, e.g., with a brief description of the salient points of what Katz backoff is and why it's relevant. Otherwise, this may be better placed as a comment. – cardinal Jan 13 '12 at 15:11
Interesting ... I'll take a look – tdc Jan 13 '12 at 16:21

Combining n-grams

1 Answers1