Functions that are Not Efficiently Computable but Learnable

Question

We know that (see, e.g., Theorems 1 and 3 of [1]), roughly speaking, under suitable conditions, functions that can be efficiently computed by Turing machine in polynomial time ("efficiently computable") can be expressed by polynomial neural networks with reasonable sizes, and thus can be learned with polynomial sample complexity ("learnable") under any input distributions.

Here "learnable" only concerns the sample complexity, regardless of the computational complexity.

I am wondering about a very closely related problem: does there exist a function that can not be efficiently computed by Turing machine in polynomial time ("not efficiently computable"), but meanwhile, can be learned with polynomial sample complexity ("learnable") under any input distributions?

[1] Roi Livni, Shai Shalev-Shwartz, Ohad Shamir, "On the Computational Efficiency of Training Neural Networks", 2014

I take issue with "and thus can be learned". There are very efficiently computable functions (say, DFAs) that are VERY difficult to learn, even approximately. — Aryeh, Oct 05 '17 at 17:41
@Aryeh Yes you are right. But if I am correct the difficulty in learning DFAs is in its computational complexity. Given infinite computational power, we may still be able to learn them with polynomial samples via ERM? — Minkov, Oct 05 '17 at 18:23
Yes, of course -- just find the smallest DFA consistent with the sample (for the realizable case). — Aryeh, Oct 05 '17 at 21:02
This is probably missing the point, but what about the class of (say) $2^{-\sqrt{n}}$-biased Boolean functions? (I.e., more or less, a random function with each value being independently $1$ with probability $2^{-\sqrt{n}}$). For any $\varepsilon > 2^{-\sqrt{n}}$, PAC-learning under the uniform distribution is trivial (0 sample needed, the constant function $0$ is a good hypothesis), but it seems like any evaluation algorithm would need to spend superpolynomial time (as there is no structure to the function). I'm most likely misunderstanding the question, though. — Clement C., Oct 06 '17 at 20:17
Your terminology is a bit confusing. When we say “efficiently learnable,” we usually refer to computational efficiency. Just saying “learnable” is sufficient to imply sample efficiency. — Lev Reyzin, Oct 07 '17 at 13:08
@ClementC. Good point! I think we need to impose the same requirement on the approximation accuracy of computation as for learning. In your example, if we only require $\epsilon$ approximation accuracy of computation, then the evaluation algorithm can directly return $0$ as the answer without any computation. — Minkov, Oct 07 '17 at 21:36
@LevReyzin Thanks for pointing out. I have revised the question accordingly. — Minkov, Oct 07 '17 at 21:36
@Minkov To PAC learn, you should learn with respect to any distribution. Otherwise the question is not interesting (as Clement points out). — Lev Reyzin, Oct 07 '17 at 21:40
@LevReyzin You are absolutely right. But on the computation side, what would be a more natural way to impose the input distribution? — Minkov, Oct 07 '17 at 23:07
You don't impose a distribution, you try to prove it for all distributions. — Lev Reyzin, Oct 08 '17 at 01:47
Why are people voting to close? I think this is a deep and subtle question! — Aryeh, Oct 09 '17 at 10:26

Aryeh · Answer 1 · 2018-11-17T22:23:21.563

I will formalize a variant of this question where "efficiency" is replaced by "computability".

Let $C_n$ be the concept class of all languages $L\subseteq\Sigma^*$ recognizable by Turing machines on $n$ states or fewer. In general, for $x\in\Sigma^*$ and $f\in C_n$, the problem of evaluating $f(x)$ is undecidable.

However, suppose we have access to a (proper, realizable) PAC-learning oracle $A$ for $C_n$. That is, for any $\epsilon,\delta>0$, the oracle requests a labeled sample of size $m_0(n,\epsilon,\delta)$ such that, assuming such a sample was drawn iid from an unknown distribution $D$, the oracle $A$ outputs a hypothesis $\hat f\in C_n$ which, with probability at least $1-\delta$, has $D$-generalization error not more than $\epsilon$. We will show that this oracle is not Turing-computable.

Actually, we will show that a simpler problem is undecidable: One of determining, given a labeled sample $S$, whether there exists an $f\in C_n$ consistent with $S$. Suppose (to get a contradiction) that $K$ is a Turing machine that decides the consistency problem.

We make the following notational conventions. Identify $\Sigma^*$ with $\mathbb{N}=\{0,1,2,\ldots\}$ via the usual lexicographic ordering. For $x\in\{0,1\}^*$, we say that a TM $M$ "S-prints" $x$ if it accepts all of the strings in $\Sigma^*$ corresponding to the indices $i$ s.t. $x_i=1$ and does not accept (possibly by not halting) any of the strings corresponding to the indices $x_i=0$. Since (by assumption) $K$ is decidable, it follows that the function $\tilde K:x\mapsto k$, defined to be the smallest $k$ such that some TM in $C_k$ S-prints $x$, is Turing-computable. It further follows that the function $g:k\mapsto x$, which maps a $k\in\mathbb{N}$ to the least (lexicographically) string $x\in\{0,1\}^*$ such that $\tilde K(x)>k$, is also computable.

Now define the TM $M$ as follows: $M$ S-prints $g(|\langle M\rangle|)$, where $\langle M\rangle$ is the encoding of $M$, $|x|$ denotes string length, and the recursion theorem is being invoked to assert the existence of such an $M$. Then $M$ has some encoding length, $\ell=|\langle M\rangle|$, and it S-prints some string, $x_M\in\{0,1\}^*$. By construction, $\tilde K(x_M)>\ell$, and so $x_M$ cannot be S-printed by any TM with description length $\ell$ or shorter. And yet it is defined as the S-print output of a TM with description length $\ell$ --- a contradiction.

Challenge: convert my "infinitary" argument via computability to a finitary one via efficiency. I think the answer to @minkov's question is negative: You cannot efficiently learn a function class which you cannot efficiently evaluate. I think this will continue to be true if you move beyond proper or realizable PAC. — Aryeh, Oct 09 '17 at 10:29

Functions that are Not Efficiently Computable but Learnable

1 Answers1

Linked