I have some questions about the notations used in Section 9.2 Lack of Inherent Superiority of Any Classifier in Duda, Hart and Stork's Pattern Classification. First let me quote some relevant text from the book:
- For simplicity consider a two-category problem, where the training set $D$ consists of patterns $x^i$ and associated category labels $y_i = ± 1$ for $i = 1,..., n$ generated by the unknown target function to be learned, $F(x)$, where $y_i = F(x^i)$.
- Let $H$ denote the (discrete) set of hypotheses, or possible sets of parameters to be learned. A particular hypothesis $h(x) \in H$ could be described by quantized weights in a neural network, or parameters 0 in a functional model, or sets of decisions in a tree, and so on.
- Furthermore, $P(h)$ is the prior probability that the algorithm will produce hypothesis $h$ after training; note that this is not the probability that $h$ is correct.
- Next, $P(h|D)$ denotes the probability that the algorithm will yield hypothesis $h$ when trained on the data $D$. In deterministic learning algorithms such as the nearest-neighbor and decision trees, $P(h|D)$ will be everywhere zero except for a single hypothesis $h$. For stochastic methods (such as neural networks trained from random initial weights), or stochastic Boltzmann learning, $P(h|D)$ can be a broad distribution.
- Let $E$ be the error for a zero-one or other loss function.
The expected off-training-set classification error when the true function is $F(x)$ and the probability for the $k$th candidate learning algorithm is $P_k(h(x)|D)$ is given by $$ \mathcal{E}_k(E|F,n) = \sum_{x\notin D} P(x) [1-\delta(F(x), h(x))] P_k(h(x)|D) $$
Theorem 9.1. (No Free Lunch) For any two learning algorithms $P_1 (h |D)$ and $P_2(h|D)$, the following are true, independent of the sampling distribution $P(x)$ and the number $n$ of training points:
Uniformly averaged over all target functions $F$, $\mathcal{E}_1 (E|F, n) — \mathcal{E}_2(E|F, n) = 0$
For any fixed training set $D$, uniformly averaged over $F$, $\mathcal{E}_1 (E|F, D) — \mathcal{E}_2(E|F, D) = 0$
Part 1 is actually saying $$\sum_F \sum_D P(D|F) [\mathcal{E}_1 (E|F, n) — \mathcal{E}_2(E|F, n)] = 0$$
Part 2 is actually saying $$\sum_F [\mathcal{E}_1 (E|F, D) — \mathcal{E}_2(E|F, D)] = 0$$
My questions are
- In the formula of $\mathcal{E}_k(E|F,n)$, i.e. $$ \mathcal{E}_k(E|F,n) = \sum_{x\notin D} P(x) [1-\delta(F(x), h(x))] P_k(h(x)|D), $$ can I replace $P_k(h(x)|D)$ with $P_k(h|D)$ and move it outside the sum $\sum_{x \notin D}$, because it is really a distribution of $h$ over $H$ given $D$ for the $k$th stochastic learning algorithm?
- Given that the $k$th candidate learning algorithm is a stochastic method, why in the formula of $\mathcal{E}_k(E|F,n)$, there is no sum over $h$, i.e. $\sum_{h \in H}$?
How are $\mathcal{E}_i (E|F, D)$ and $\mathcal{E}_i (E|F, n)$ different from each other?
Does $\mathcal{E}_i (E|F, D)$ mean the off-training error rate given a training set $D$?
Does $\mathcal{E}_i (E|F, n)$ mean the off-training error rate, average over all training set given a training size $n$? If yes, why does part 1 in NFL theorem average $\mathcal{E}_i (E|F, n)$ over training sets again by writing $\sum_D$, and why in the formula for $\mathcal{E}_k(E|F,n) $, there is no average over all training set given a training size $n$?
- In part 1 of NFL theorem, does $\sum_D$ mean summing over all training sets with a fixed training size $n$?
- If further summing over all possible values in $\mathbb{N}$ of training size $n$ in part 1, the result is still 0, right?
- In the formula of $\mathcal{E}_k(E|F,n)$, if I change $\sum_{x \notin D}$ to $\sum_x$, i.e. $x$ is not necessarily restricted to be outside the training set, will both parts in NFL theorem still be true?
- If the true relation between $x$ and $y$ are not assumed to be a deterministic function $F$ as $y=F(x)$, but instead conditional distributions $P(y|x)$, or a joint distribution $P(x,y)$ which is equivalent to knowing $P(y|x)$ and $P(x)$ (also see my another question), then I can change $\mathcal{E}_k (E|F,n)$ to be $$ \mathcal{E}_k(E|P(x,y),n) = \mathcal{E}_{x,y} [1-\delta(y, h(x))] P_k(h(x)|D) $$ (with the strange $P_k(h(x)|D)$ pointed out in part 1 and 2). Are the two parts in NFL theorem still true?
Thanks and regards!