In classification and regression tasks, we try to learn from a training data set a function mapping a independent variable $X$ to a dependent variable $Y$.
When evaluating the error rate of a learning algorithm in classification or regression in theory, there are usually assumptions on the true relation between independent variable $X$ and dependent variable $Y$.
From my vague memory based on the books I have read so far but I couldn't recall clearly (probably "Mathematical statistics: basic ideas and selected topics, Volume 1" by Bickle and Doksum), the true relation is assumed to be a distribution. So for each value $x$ of $X$, there can be more than one values of $Y$, depending on the distribution $P(Y|X=x)$.
After recently I read Section 9.2 Lack of Inherent Superiority of Any Classifier in Duda, Hart and Stork's Pattern Classification (also see my previous question), I found that it assumed the relation between $X$ and $Y$ to be a deterministic function $F$ with $F(X)=Y$, if I understand correctly. So it does not allow more than one values of $Y$ associated to each value of $X$.
I wonder what the purpose is to consider the relation to be just a deterministic function and lose the generosity of distribution relation?
- In practice, if having a training data set $(x_i, y_i), i=1,...,n$ with some $i \neq j, x_i\equiv x_j, y_i \neq y_j$, will you do some preprocessing, such as to combine $(x_i, y_i)$ and $(x_j, y_j)$, before feeding them to a learning/training algorithm? I am asking this question in classification and in regression separately.
Thanks and regards!