I'm looking into using of the shelf compression algorithm to approximate the Kolmogorov complexity of a document corpus D and the complexity of D+d, where d is an extra document. I've got a strange result where size(compress(D)) > size(compress(D+d)). Is this possible if compress actually calculated (the uncomputable) Kolmogorov complexity of D and D+d?
1 Answers
I'm no expert on Kolmogorov complexity, but I think it could happen; here is some intuition. The $D$ and $D + d$ are represented by two strings $S(D)$ and $S(D + d)$, and $S(D)$ is a prefix of $S(D + d)$. So you're asking whether a prefix of a string can have higher Kolmogorov complexity than the string itself. I think this is the case:
Let $P$ be some randomly generated string of high complexity. Now let $X$ be the result of repeating $P$ a number of times, say 10 times. Let $Y$ a prefix of $X$ that is some characters shorter. Then a program to output $X$ would just say "repeat string $P$ 10 times", whereas the program to output $Y$ would have to say: "repeat string $P$ 10 times, and then subtract a couple of characters from the end". If $P$ is sufficiently complex then the intuition is that indeed the shortest way to generate $Y$ would be in this way, hence the program that generates $X$ is shorter than the one which generates $Y$, and the Kolmogorov complexity of $X$ would be lower than of its prefix $Y$.
- 5,265
- 21
- 39
-
1Yep... Similarly, you can have a program of constant size output all of the digits of pi, but you can also have a sequence of programs, one for each integer n, that outputs the first n digits of pi. This is a way to encode integers, so the Kolmogorov complexity of this sequence of programs must grow with log(n). – Aaron Roth Jan 13 '11 at 13:30