10

As far as I am concerned, statistical/machine learning algorithms always suppose that data are independent and identically distributed ($iid$).

My question is: what can we do when this assumption is clearly unsatisfied? For instance, suppose that we have a data set whith repeated measurements on the same observations , so that both the cross-section and the time dimensions are important (what econometricians call a panel data set, or statisticians refer to as longitudinal data, which is distinct from a time series).

An example could be the following. In 2002, we collect the prices (henceforth $Y$) of 1000 houses in New York, together with a set of covariates (henceforth $X$). In 2005, we collect the same variables on the same houses. Similar happens in 2009 and 2012. Say I want to understand the relationship between $X$ and $Y$. Were the data $iid$, I could easily fit a random forest (or any other supervised algorithm, for what matters), thus estimating the conditional expectation of $Y$ given $X$. However, there is clearly some auto-correlation in my data. How can I handle this?

3 Answers3

16

There is nothing in the theory of statistical learning or machine learning that requires samples to be i.i.d.

When samples are i.i.d, you can write the joint probability of the samples given some model as a product, namely $P(\{x\}) = \Pi_{i} P_i(x_i)$ which makes the log-likelihood a sum of the individual log-likelihoods. This simplifies the calculation, but is by no means a requirement.

In your case, you can for example model the distribution of a pair $x_i,y_i$ with some bi-variate distribution, say $z_i=(x_i,y_i)^T$ , $z_i \sim \mathcal{N}(\mu,\Sigma)$ , and then estimate the parameter $\Sigma$ from the likelihood $P(\{z\}) = \Pi_{i} P(z_i | \mu, \Sigma)$.

It is true that many out-of-the-box algorithm implementations implicitly assume independence between samples, so you are correct in identifying that you will have a problem applying them to you data as is. You will either have to modify the algorithm or find ones that are better suited for your case.

J. Delaney
  • 5,380
  • I see, this is a good answer. I need two clarifications, though. 1) You mean that I could estimate the parameter $\mu$? 2) Your solution works if I am willing to assume a (bivariate) parametric distribution for my data. As far as I understood, ML is all about avoiding such assumptions, and learning the best function from empirical evidence (as discussed in Breiman, 2001). – riccardo-df Feb 07 '22 at 14:04
  • 1
  • Yes you could estimate both $\mu$ and $\Sigma$, but note that I just gave this model for illustration, what is best for you will depend on the details of your data 2) This is a big question by itself, but every ML model is driven by some assumptions, either explicit or implicit (for example: the choice of a loss function). Just as you pointed out, if a certain model assumes that samples are i.i.d, it will not work well for data that is not. There is no "universal" ML model, so you always have to think about which underlying assumptions holds for your data
  • – J. Delaney Feb 07 '22 at 14:34
  • 5
    Assuming iid when it's not satisfied is still a problem for ML. For example if there are non-completely-random dropouts, analyzing the data as if each row comes from a different subject will lead to prediction bias. – Frank Harrell Feb 07 '22 at 14:46
  • @J.Delaney that is exactly my concern. Going back to my example, say I want to fit a random forest (which does not impose any parametric distribution). Is there a way to take into account the repeated measurements characteristic? Clearly, I cannot just fit a random forest as it is provided in many statistical packages. I have a guess for a simplified case where, for some reason, we can believe that there is not auto-correlation: including the time variable (e.g., year) in the set of explanatory variable. Still, this would treat the same units at different time istances as different units. – riccardo-df Feb 07 '22 at 14:54
  • 1
    It depends on what you want your model to predict - assuming that the prices change over time. Do you want it to predict the average price over the period, the price at a given year (when you supply it as a feature) or the price in the future? – J. Delaney Feb 07 '22 at 18:21
  • I am accepting this answer according to vox populi. – riccardo-df Feb 09 '22 at 08:54