7

In text mining, if we've computed n-gram counts, for say $n=1\ldots4$, is there a principled way to combine them, other than just concatenating the $tf-idf$ matrices for each one? (equivalent to an unweighted sum of kernels if we were to construct kernel matrices for each one). For example, google's n-gram viewer:

http://books.google.com/ngrams/datasets

shows that they calculated from unigrams up to 5-grams, but they don't say how they combine them.

tdc
  • 7,569

1 Answers1

2

Not sure if this is what you're looking for, but you might want to look at Katz backoff. This entails training vanilla n-gram models for $1 \le n \le N$, then estimating probabilities for large n by "backing off" to an (n-1)-gram model when the n-gram in question was not observed more often than some frequency threshold.

Fred Foo
  • 201
  • If possible, please provide a more self-contained answer, e.g., with a brief description of the salient points of what Katz backoff is and why it's relevant. Otherwise, this may be better placed as a comment. – cardinal Jan 13 '12 at 15:11
  • Interesting ... I'll take a look – tdc Jan 13 '12 at 16:21