Compression algorithms for low-complexity strings?

Question

Let $s$ denote a string over a finite alphabet, $n_s = |s|$ be the length of $s$, and $n_s^{*}$ denote the minimum description size of $s$ under a given computational model (TM, CFG, etc.). Are there compression algorithms that, when restricted to input $s$ where $n_s / n_s^*$ is large, achieve provably tighter approximations than algorithms with unrestricted input?

In other words, can a compression algorithm specifically tailored for low complexity inputs outperform the best general input algorithm on inputs of low complexity?

Admittedly, the question isn't completely precise, but I'm hoping to find results of roughly the same flavor.

Edit: See domotorp's answer below for a proof of the impossibility of a large class of low-complexity compression algorithms under the TM model. Note that this result does not bar the possibility of compression algorithms where the function $A$ (see the answer for context) is non-computable. For example, setting $A(|x|) = 100*BB^{-1}(|x|)$ where $BB^{-1}$ is the inverse busy beaver function, could make for a useful compression algorithm and is not covered by the proof.

I expect the answer to be "no", because any such algorithm should be able to synthesize Turing Machines from arbitrary strings (think e.g. about a TM machine that produces an infinite string of data using a pseudorandom generator. Since the TM has $O(1)$ size and the text is unbounded, any reasonable "provably tight approximation" should involve a $O(1)$ output). — Andrea Farruggia, Jan 19 '17 at 18:04
I agree with your hunch, but notice what I'm looking for is weaker than a provably tight approximation: I'm only looking for an improvement over the best general input algorithm. For the case of Turing Machines, the worst case is achieved with pseudorandom strings where $n^$ is $O(1)$ but the size of the compressed string is $O(n)$. An algorithm that returns a TM of size, say, $O(\sqrt{n})$ when given strings where $n^$ is $O(1)$ would provide a tighter approximation than the best general input algorithm. — Jack G., Jan 19 '17 at 22:16

score 3 · Accepted Answer · edited Jan 23 '17 at 18:22

3

No, at least not in the TM model, i.e., for Kolmogorov complexity. Given $g(x) \leq f(x) \leq |x|$ for all $x$ where $g$ is monotonic and unbounded and $f$ is recursive, it is not possible to enumerate an infinite subset of $\{ x : K(x) > f(x)\}$.

Suppose for the sake of contradiction there exists a recursive function $h$, the low-complexity compression algorithm, which on input $x$ outputs a TM that reproduces $x$ and has the following additional property: Let $A:\mathbb{N} \rightarrow \mathbb{N}$ and $B:\mathbb{N} \rightarrow \mathbb{N}$ be given such that both are recursive, $A(|x|) \leq B(|x|) \leq |x|$ for all $x$, and $A$ is unbounded. Then $\forall x, K(x) \leq A(|x|) \implies |h(x)| \leq B(|x|)$.

Now choose $f(x) = A(|x|)$. We wish to enumerate an infinite subset of $\{x: K(x) > A(|x|)\}$. Note that there are infinite $x$ such that $K(x) > B(|x|)$ because $B(|x|) \leq |x|$. For any such $x$, $|h(x)| \geq K(x)$ by the definition of $K$ so $|h(x)| > B(|x|)$. Then for a generic string $x$, $|h(x)| > B(|x|) \implies K(x) > A(|x|)$ and we can enumerate an infinite subset of $\{x: K(x) > A(|x|)\}$, a contradiction.

edited Jan 23 '17 at 18:22

Jack G.

35
4

answered Jan 19 '17 at 09:21

domotorp

13,931
37
93

3

Yes, finding or approximating $n^*$ is undecidable in the TM model, but it's not obvious to me that this would exclude the possibility of an algorithm that performs well on low complexity inputs without "knowing" whether each input is in fact low-complexity. – Jack G. Jan 19 '17 at 16:28
Suppose that the algorithm performs well on low-complexity input. Run it on any input. If it is a low-complexity input, it will compress it. If it is a high-complexity input, it cannot be compressed. So your algorithm would solve my undecidable problem. – domotorp Jan 19 '17 at 20:03
1

@domotorp: the algorithm might just perform poorly on the high-complexity input. The point is that you can't know if its output is a tight approximation of the Kolmogorov complexity or not. – Andrea Farruggia Jan 19 '17 at 20:30
@domotorp I think it's a bit more slippery than that. For example, the hypothetical low-complexity compression algorithm would not be able to decide whether $K(x)$ is smaller or larger than $|x|$. Let's say the algorithm returns a TM of size $O(\sqrt{|x|})$ whenever $K(x)$ is $O(1)$. Then for sufficiently large $|x|$ the algorithm could differentiate between $K(x) = O(1)$ and $K(x) = O(|x|)$ strings. But this is much weaker than differentiating $K(x) < |x|$ from $K(x) \geq |x|$. – Jack G. Jan 20 '17 at 02:03
2

$K(x)=O(1)$ for only finitely many strings, so it doesn't make much sense to study this problem. If, however, you replace $O(1)$ with any function $f$ that tends to $\infty$, as I wrote in my answer, then you get an undecidable problem that is much weaker than what you wrote in the last line of your last comment. – domotorp Jan 20 '17 at 08:04
@domotorp That's a good point. Do you have a reference for the undecidability of $K(x) \leq f(|x|)$ where $f$ tends to infinity? I'm wondering if the result holds for the inverse busy beaver function, which does tend to infinity, but slower than any computable function. In order to make your point, I think the result would need to hold for the inverse busy beaver function. – Jack G. Jan 20 '17 at 14:55
This can be found in most textbooks, like Li, Ming, Vitányi, Paul M.B.; An Introduction to Kolmogorov Complexity and Its Applications. I can also recommend for further reading this brilliant question: http://cstheory.stackexchange.com/questions/31282/can-we-not-output-the-kolmogorov-complexity – domotorp Jan 20 '17 at 19:10
Here's the result from Thm. 2.7.1 from the 2nd edition: Let $g(x)$ be unbounded and monotonic. Let $f(x)$ be a recursive function such that $g(x) \leq f(x) \leq |x|$ for all $x$. Then the set ${ x : K(x) > f(x) }$ is infinite but does not contain an infinite recursively enumerable subset. – Jack G. Jan 23 '17 at 04:21
Now suppose there exists a recursive function $g$ that, on input $x$, produces a TM that reproduces $x$ and, furthermore, $\forall x, K(x) \leq |x|^{0.1} \implies |g(x)| \leq |x|^{0.2}$ ... – Jack G. Jan 23 '17 at 04:50
I can't prove the consistency of such a $g$ with the above result, but I can show how the most straightforward proof of inconsistency fails: Suppose $g$ exists and see if you can enumerate an infinite number of strings $x$ such that $K(x) > f(x)$ for any appropriate $f$ by computing $g(x)$. If on input $x$, $|g(x)| \leq |x|^{0.2}$, then we have no suitable $f$ as a lower bound for $K(x)$. Otherwise, if $|g(x)| > |x|^{0.2}$, we know $K(x) \leq |g(x)|$, which does not determine $K(x)$ relative to $|x|^{0.2}$, the obvious choice of $f$. – Jack G. Jan 23 '17 at 04:50
1

I don't understand why this wouldn't work. If $K(x)>|x|^{0.2}$, then $|g(x)|>|x|^{0.2}$. If $|g(x)|>|x|^{0.2}$, then $K(x)>|x|^{0.1}$. Therefore for the obvious choice of $f(x)=|x|^{0.1}$ we get a contradiction. – domotorp Jan 23 '17 at 08:19
Yes, that's convincing, thank you! I edited your answer to reflect a generalized version of this contradiction. I'll accept it once it is reviewed. – Jack G. Jan 23 '17 at 16:33

Compression algorithms for low-complexity strings?

1 Answers1