What does it mean when statisticians talk about having more predictors than observations in a regression model? How could that even be possible? Why is it a problem in regression? Apologies, I am new to quant analysis and stats so not quite sure why this is the case? I would appreciate the simplest possible explanation -
-
2Consider a dataset consisting of 100 images, each of which is 256 x 256. – Jakub Bartczuk Mar 18 '18 at 18:19
-
For a similar example see Olivetti faces dataset – Jakub Bartczuk Mar 18 '18 at 18:20
-
Sorry, this example isn't very clear to me but thank you – user3424836 Mar 18 '18 at 18:24
-
6For a simple example, imagine if you had 5 students, and you wanted to predict their height from other variables. So you measure their sex, town, number of letters in their last name, shoe size, hair length, and weight. If you put these all in one model, you would have six predictors and only five observations. – Sal Mangiafico Mar 18 '18 at 19:15
-
Thank you, this is very helpful. Your answer made it clear to me what the problem is. – user3424836 Mar 18 '18 at 22:38
-
The problem is that if you apply least squares with 6 predictor and five observations you will get a perfect fit (all residuals would be 0) but it is likely not to be a perfect model. This would be what statisticians call overfitting. The same problem would happen with with 5 data points and 5 predictors. – Michael R. Chernick Mar 18 '18 at 23:45
2 Answers
I think that the confusion comes from the way the word "observation" is sometimes used. Say that you wanted to know how the expression of 20,000 genes was related to some continuous biological variable like blood pressure. You have data both on expression of 20,000 genes and on blood pressure for 10,000 individuals. You might think that this involves 10,000 * 20,001 = 200,010,000 observations. There certainly are that many individual data points. But when people say there are "more predictors than observations" in this case, they only count each individual person as an "observation"; an "observation" is then a vector of all data points collected on a single individual. It might be less confusing to say "cases" rather than "observations," but usage in practice often has hidden assumptions like this.
The problem with more predictors than cases (usually indicated as "$p>n$") is that there is then no unique solution to a standard linear regression problem. If rows of the matrix of data points represent cases and columns represent predictors, there are necessarily linear dependences among the columns of the matrix. So once you've found coefficients for $n$ of the predictors, the coefficients for the other $(p-n)$ predictors can be expressed as arbitrary linear combinations of those first $n$ predictors. Other approaches like LASSO or ridge regression, or a variety of other machine-learning approaches, provide ways to proceed in such cases.
- 92,183
- 10
- 92
- 267
-
Thank you, this is very helpful. From what I gather based on your very comprehensive response, the problem with this kind of situation is that predictors would be correlated or collinear. Is my understanding of your explanation correct? – user3424836 Mar 18 '18 at 22:37
-
Any solution that minimizes the sum of squares will give a perfect fit. I think the overfitting problem is far more serious problem when the number of parameters (coefficients for predictor variables) exceeds the number of data points than the non-uniqueness of the solution. Note also when the number of parameters equals the number of data points there is a unique solution and you still have a perfect fit. – Michael R. Chernick Mar 18 '18 at 23:54
-
The techniques like LASSO that @EdM mentions are all variable selection techniques that reduce the number of predictor variables. Most of them provide ways of deciding which predictor variables are most important. – Michael R. Chernick Mar 19 '18 at 00:00
-
This is very helpful. When you say non-uniqueness of the solution, does that refer to the dataset? – user3424836 Mar 19 '18 at 00:05
-
@user3424836 it has to do with the general structure of the data, not the further details of the data set. Any situation with $p>n$ will have this problem, whether you think about it as non-unique linear-regression solutions like I described or overfitting like Michael Chernick describes. – EdM Mar 19 '18 at 01:51
-
I don't think that is where the confusion comes from. The confusion comes from the fact that is is pretty rare to encounter datasets (at least in tabular format) for which there are fewer observations than there are features (table columns). – Mindaugas Bernatavičius Nov 10 '21 at 15:30
How could that even be possible?
More than possible, that's becoming more and more common as machine learning problems require unstructured features such as Natural Language Process (NLP), pictures, audios and videos.
NLP example: If a feature generation process takes one ore more text features and generate n-grams for each word found in each observation, the bigger the sample size, the much bigger will be the n-gram feature vector, although inherently an sparse matrix.
- 101