Understanding no free lunch theorem in Duda et al's Pattern Classification

Question

I have some questions about the notations used in Section 9.2 Lack of Inherent Superiority of Any Classifier in Duda, Hart and Stork's Pattern Classification. First let me quote some relevant text from the book:

For simplicity consider a two-category problem, where the training set $D$ consists of patterns $x^i$ and associated category labels $y_i = ± 1$ for $i = 1,..., n$ generated by the unknown target function to be learned, $F(x)$, where $y_i = F(x^i)$.

Let $H$ denote the (discrete) set of hypotheses, or possible sets of parameters to be learned. A particular hypothesis $h(x) \in H$ could be described by quantized weights in a neural network, or parameters 0 in a functional model, or sets of decisions in a tree, and so on.

Furthermore, $P(h)$ is the prior probability that the algorithm will produce hypothesis $h$ after training; note that this is not the probability that $h$ is correct.

Next, $P(h|D)$ denotes the probability that the algorithm will yield hypothesis $h$ when trained on the data $D$. In deterministic learning algorithms such as the nearest-neighbor and decision trees, $P(h|D)$ will be everywhere zero except for a single hypothesis $h$. For stochastic methods (such as neural networks trained from random initial weights), or stochastic Boltzmann learning, $P(h|D)$ can be a broad distribution.

Let $E$ be the error for a zero-one or other loss function.

The expected off-training-set classification error when the true function is $F(x)$ and the probability for the $k$th candidate learning algorithm is $P_k(h(x)|D)$ is given by $$ \mathcal{E}_k(E|F,n) = \sum_{x\notin D} P(x) [1-\delta(F(x), h(x))] P_k(h(x)|D) $$

Theorem 9.1. (No Free Lunch) For any two learning algorithms $P_1 (h |D)$ and $P_2(h|D)$, the following are true, independent of the sampling distribution $P(x)$ and the number $n$ of training points:

Uniformly averaged over all target functions $F$, $\mathcal{E}_1 (E|F, n) — \mathcal{E}_2(E|F, n) = 0$

For any fixed training set $D$, uniformly averaged over $F$, $\mathcal{E}_1 (E|F, D) — \mathcal{E}_2(E|F, D) = 0$

Part 1 is actually saying $$\sum_F \sum_D P(D|F) [\mathcal{E}_1 (E|F, n) — \mathcal{E}_2(E|F, n)] = 0$$

Part 2 is actually saying $$\sum_F [\mathcal{E}_1 (E|F, D) — \mathcal{E}_2(E|F, D)] = 0$$

My questions are

In the formula of $\mathcal{E}_k(E|F,n)$, i.e. $$ \mathcal{E}_k(E|F,n) = \sum_{x\notin D} P(x) [1-\delta(F(x), h(x))] P_k(h(x)|D), $$ can I replace $P_k(h(x)|D)$ with $P_k(h|D)$ and move it outside the sum $\sum_{x \notin D}$, because it is really a distribution of $h$ over $H$ given $D$ for the $k$th stochastic learning algorithm?
Given that the $k$th candidate learning algorithm is a stochastic method, why in the formula of $\mathcal{E}_k(E|F,n)$, there is no sum over $h$, i.e. $\sum_{h \in H}$?
How are $\mathcal{E}_i (E|F, D)$ and $\mathcal{E}_i (E|F, n)$ different from each other?

Does $\mathcal{E}_i (E|F, D)$ mean the off-training error rate given a training set $D$?

Does $\mathcal{E}_i (E|F, n)$ mean the off-training error rate, average over all training set given a training size $n$? If yes, why does part 1 in NFL theorem average $\mathcal{E}_i (E|F, n)$ over training sets again by writing $\sum_D$, and why in the formula for $\mathcal{E}_k(E|F,n) $, there is no average over all training set given a training size $n$?
In part 1 of NFL theorem, does $\sum_D$ mean summing over all training sets with a fixed training size $n$?
If further summing over all possible values in $\mathbb{N}$ of training size $n$ in part 1, the result is still 0, right?
In the formula of $\mathcal{E}_k(E|F,n)$, if I change $\sum_{x \notin D}$ to $\sum_x$, i.e. $x$ is not necessarily restricted to be outside the training set, will both parts in NFL theorem still be true?
If the true relation between $x$ and $y$ are not assumed to be a deterministic function $F$ as $y=F(x)$, but instead conditional distributions $P(y|x)$, or a joint distribution $P(x,y)$ which is equivalent to knowing $P(y|x)$ and $P(x)$ (also see my another question), then I can change $\mathcal{E}_k (E|F,n)$ to be $$ \mathcal{E}_k(E|P(x,y),n) = \mathcal{E}_{x,y} [1-\delta(y, h(x))] P_k(h(x)|D) $$ (with the strange $P_k(h(x)|D)$ pointed out in part 1 and 2). Are the two parts in NFL theorem still true?

Thanks and regards!

Is $\delta$ the Dirac/Kronecker-delta? In$$ \mathcal{E}k(E|F,n) = \sum{x\notin D} P(x) [1-\delta(F(x), h(x))] P_k(h(x)|D) $$ — , Aug 25 '16 at 08:54
Is this No Free Lunch theorem the same as the Halting problem ? Are they connected ? — , Aug 25 '16 at 09:07

score 9 · Answer 1 · edited May 02 '12 at 16:48

9

I will answer the questions that I think I know the answers to.

This answer is no because you are picking an $x$ that wasn't part of the fit set $D$ and so $h$ depends on $x$.
$h$ is only evaluated at the values $x$ in the test set to obtain the expected error rate so it is not evaluated over the entire set $H$ but only at the discrete set of $x$'s in the test set.
$\mathcal{E}_i(E|F, D)$ is the expected off training set error rate given the function $F$ and the training set $D$. But $\mathcal{E}_i(E|F, n)$ I think is different because you are only conditioning on the number of training points $n$ and not the actual $x$ values. But this is puzzling given the subsequent statements.
$D$ is the set of training vectors. There are $n$ training vectors in $D$. So you are summing over the fixed $n$ training vectors in $D$. There is only one set $D$.
I think the answer to 5 is no. The notation seems to be a bit confusing.

Can't comment on 6 and 7.

edited May 02 '12 at 16:48

gung - Reinstate Monica

145,122

answered May 02 '12 at 16:18

Michael R. Chernick

42,857

2

+1. Welcome to the site, I'm a big fan of your reviews on Amazon. Excuse my presumption in editing, mathematical notation is mostly done by putting $'s on both sides of something. If you click on the yellow-circle-? in the upper right when writing, you'll see a link for "advanced help" which will give more info; also, you can right-click on some existing mathjax (such as any of the above) & select "Show Math As -> TeX commands" to see how it was done. – gung - Reinstate Monica May 02 '12 at 16:53
2

In other words, @gung is saying: This site supports $\LaTeX$ in (almost) exactly the way you'd expect it to, including display math. Welcome to the site. – cardinal May 02 '12 at 17:15
@Michael Please allow me to add my welcome to these others: I am delighted to see you here. (Michael has made exceptionally knowledgeable contributions on American Statistical Association discussion lists.) – whuber May 03 '12 at 15:04

Understanding no free lunch theorem in Duda et al's Pattern Classification

1 Answers1

Linked