Is the true relation between independent and dependent variables assumed to be a function or a distribution?

Question

In classification and regression tasks, we try to learn from a training data set a function mapping a independent variable $X$ to a dependent variable $Y$.

When evaluating the error rate of a learning algorithm in classification or regression in theory, there are usually assumptions on the true relation between independent variable $X$ and dependent variable $Y$.

From my vague memory based on the books I have read so far but I couldn't recall clearly (probably "Mathematical statistics: basic ideas and selected topics, Volume 1" by Bickle and Doksum), the true relation is assumed to be a distribution. So for each value $x$ of $X$, there can be more than one values of $Y$, depending on the distribution $P(Y|X=x)$.

After recently I read Section 9.2 Lack of Inherent Superiority of Any Classifier in Duda, Hart and Stork's Pattern Classification (also see my previous question), I found that it assumed the relation between $X$ and $Y$ to be a deterministic function $F$ with $F(X)=Y$, if I understand correctly. So it does not allow more than one values of $Y$ associated to each value of $X$.

I wonder what the purpose is to consider the relation to be just a deterministic function and lose the generosity of distribution relation?
In practice, if having a training data set $(x_i, y_i), i=1,...,n$ with some $i \neq j, x_i\equiv x_j, y_i \neq y_j$, will you do some preprocessing, such as to combine $(x_i, y_i)$ and $(x_j, y_j)$, before feeding them to a learning/training algorithm? I am asking this question in classification and in regression separately.

Thanks and regards!

What is the relation of question 2 to question 1 ? Perhaps these should be separate questions. — Peter Ellis, Feb 12 '12 at 08:59
@PeterEllis: Both parts face the same question, but may have different perspectives. — Tim, Feb 12 '12 at 12:48

score 2 · Answer 1 · answered Feb 12 '12 at 08:58

2

For part 1 of your question, I can't think of any real-world situation in which $F(X)=Y$, with no randomness allowed to Y, has been a useful model. It's a nice candidate model and worth checking out in any particular situation but I think it would be easy to show that (in any dataset I've come across) that $F(X)$ would be so complex as to be implausible and of no use either for prediction or for explanation.

answered Feb 12 '12 at 08:58

Peter Ellis

17,650

Usually, statistical models work by assuming that somewhere really far down the rabbit hole, there is a deterministic quantity and that conditional on that unknown thing, the other stuff in the problem becomes independent and often deterministic. But then the problem is to mathematically encode uncertainty about that rabbit-hole quantity. This idea goes by many different names, hierarchical models, latent variables, exchangeable models, hidden variables, etc. etc. But it's all the same idea: what hidden thing should I condition on to get determinism, and why is it that I cannot know that thing – ely Mar 16 '12 at 16:21

Is the true relation between independent and dependent variables assumed to be a function or a distribution?

1 Answers1

Linked