30

We know that (see, e.g., Theorems 1 and 3 of [1]), roughly speaking, under suitable conditions, functions that can be efficiently computed by Turing machine in polynomial time ("efficiently computable") can be expressed by polynomial neural networks with reasonable sizes, and thus can be learned with polynomial sample complexity ("learnable") under any input distributions.

Here "learnable" only concerns the sample complexity, regardless of the computational complexity.

I am wondering about a very closely related problem: does there exist a function that can not be efficiently computed by Turing machine in polynomial time ("not efficiently computable"), but meanwhile, can be learned with polynomial sample complexity ("learnable") under any input distributions?

Minkov
  • 852
  • 5
  • 17
  • 4
    I take issue with "and thus can be learned". There are very efficiently computable functions (say, DFAs) that are VERY difficult to learn, even approximately. – Aryeh Oct 05 '17 at 17:41
  • @Aryeh Yes you are right. But if I am correct the difficulty in learning DFAs is in its computational complexity. Given infinite computational power, we may still be able to learn them with polynomial samples via ERM? – Minkov Oct 05 '17 at 18:23
  • Yes, of course -- just find the smallest DFA consistent with the sample (for the realizable case). – Aryeh Oct 05 '17 at 21:02
  • Good question, and more subtle than appears at first. – Aryeh Oct 05 '17 at 23:19
  • @Aryeh Thanks for pointing out the example of DFA! – Minkov Oct 06 '17 at 00:34
  • 3
    This is probably missing the point, but what about the class of (say) $2^{-\sqrt{n}}$-biased Boolean functions? (I.e., more or less, a random function with each value being independently $1$ with probability $2^{-\sqrt{n}}$). For any $\varepsilon > 2^{-\sqrt{n}}$, PAC-learning under the uniform distribution is trivial (0 sample needed, the constant function $0$ is a good hypothesis), but it seems like any evaluation algorithm would need to spend superpolynomial time (as there is no structure to the function). I'm most likely misunderstanding the question, though. – Clement C. Oct 06 '17 at 20:17
  • 3
    Your terminology is a bit confusing. When we say “efficiently learnable,” we usually refer to computational efficiency. Just saying “learnable” is sufficient to imply sample efficiency. – Lev Reyzin Oct 07 '17 at 13:08
  • @ClementC. Good point! I think we need to impose the same requirement on the approximation accuracy of computation as for learning. In your example, if we only require $\epsilon$ approximation accuracy of computation, then the evaluation algorithm can directly return $0$ as the answer without any computation. – Minkov Oct 07 '17 at 21:36
  • @LevReyzin Thanks for pointing out. I have revised the question accordingly. – Minkov Oct 07 '17 at 21:36
  • 1
    @Minkov To PAC learn, you should learn with respect to any distribution. Otherwise the question is not interesting (as Clement points out). – Lev Reyzin Oct 07 '17 at 21:40
  • @LevReyzin You are absolutely right. But on the computation side, what would be a more natural way to impose the input distribution? – Minkov Oct 07 '17 at 23:07
  • 1
    You don't impose a distribution, you try to prove it for all distributions. – Lev Reyzin Oct 08 '17 at 01:47
  • 2
    Why are people voting to close? I think this is a deep and subtle question! – Aryeh Oct 09 '17 at 10:26

1 Answers1

11

I will formalize a variant of this question where "efficiency" is replaced by "computability".

Let $C_n$ be the concept class of all languages $L\subseteq\Sigma^*$ recognizable by Turing machines on $n$ states or fewer. In general, for $x\in\Sigma^*$ and $f\in C_n$, the problem of evaluating $f(x)$ is undecidable.

However, suppose we have access to a (proper, realizable) PAC-learning oracle $A$ for $C_n$. That is, for any $\epsilon,\delta>0$, the oracle requests a labeled sample of size $m_0(n,\epsilon,\delta)$ such that, assuming such a sample was drawn iid from an unknown distribution $D$, the oracle $A$ outputs a hypothesis $\hat f\in C_n$ which, with probability at least $1-\delta$, has $D$-generalization error not more than $\epsilon$. We will show that this oracle is not Turing-computable.

Actually, we will show that a simpler problem is undecidable: One of determining, given a labeled sample $S$, whether there exists an $f\in C_n$ consistent with $S$. Suppose (to get a contradiction) that $K$ is a Turing machine that decides the consistency problem.

We make the following notational conventions. Identify $\Sigma^*$ with $\mathbb{N}=\{0,1,2,\ldots\}$ via the usual lexicographic ordering. For $x\in\{0,1\}^*$, we say that a TM $M$ "S-prints" $x$ if it accepts all of the strings in $\Sigma^*$ corresponding to the indices $i$ s.t. $x_i=1$ and does not accept (possibly by not halting) any of the strings corresponding to the indices $x_i=0$. Since (by assumption) $K$ is decidable, it follows that the function $\tilde K:x\mapsto k$, defined to be the smallest $k$ such that some TM in $C_k$ S-prints $x$, is Turing-computable. It further follows that the function $g:k\mapsto x$, which maps a $k\in\mathbb{N}$ to the least (lexicographically) string $x\in\{0,1\}^*$ such that $\tilde K(x)>k$, is also computable.

Now define the TM $M$ as follows: $M$ S-prints $g(|\langle M\rangle|)$, where $\langle M\rangle$ is the encoding of $M$, $|x|$ denotes string length, and the recursion theorem is being invoked to assert the existence of such an $M$. Then $M$ has some encoding length, $\ell=|\langle M\rangle|$, and it S-prints some string, $x_M\in\{0,1\}^*$. By construction, $\tilde K(x_M)>\ell$, and so $x_M$ cannot be S-printed by any TM with description length $\ell$ or shorter. And yet it is defined as the S-print output of a TM with description length $\ell$ --- a contradiction.

Aryeh
  • 10,494
  • 1
  • 27
  • 51
  • 2
    Challenge: convert my "infinitary" argument via computability to a finitary one via efficiency. I think the answer to @minkov's question is negative: You cannot efficiently learn a function class which you cannot efficiently evaluate. I think this will continue to be true if you move beyond proper or realizable PAC. – Aryeh Oct 09 '17 at 10:29