Can duplicate examples create multi-collinearity?

Question

We know if any linear dependency exists in training data, therefore, the feature matrix becomes singular and hence, we cannot solve it. But apart from the features (columns), a matrix can still can be singular if it has duplicate rows. More specifically if we provide duplicate examples in training data this will happen. So, logically it should also create multi-collinearity problem. But it seems to me that this issue is underrated compared to the issue where the features are linearly dependent. Will not the linear dependency in rows create multi-collinearity problem? Does this case have another name? Correct me if I am mistaken. And if duplicate or linearly dependent rows exist and if they create problem, how to spot and fix this problem?

Haitao Du · Accepted Answer · 2021-10-28T05:43:35.540

13

As Christoph Hanck suggested, duplicate examples will not cause any problem of multi-collinearity. The multi-collinearity "problem" is caused by duplicate "columns" instead of "rows" in data matrix.

Intuitively, rows are data samples, and columns are features/measures, multi-collinearity means we have redundant features/measures, think about we can measure a person's height in different units, e.g., in cm and in inches.

On the other hand, it is perfectly normal to have "duplicated samples", like the other answer suggested, there are many cases that people have same gender and education status.

Assume if we have a data matrix $X$, if two columns of $X$ are the same, or one column of $X$ can be derived by a linear combination of other columns, $X'X$ will not be full rank and multi-collinearity will occur.

Regularization will help for multi-collinearity. Even small amount of regularization (L1 or L2) will make $X'X$ matrix full rank.

Related posts

What is rank deficiency, and how to deal with it?

Least Squares Regression Step-By-Step Linear Algebra Computation

What algorithm is used in linear regression?

edited Oct 28 '21 at 05:43

answered Oct 27 '21 at 06:42

Haitao Du

36,852
25
145
242

Thanks, but one thing confuses me: Say, if we frame our system of linear equations like this: y = Aw, and here y is the output, A is the feature matrix with dimension (num_of_examples, num_of_features) and w is the co-efficient vector which we want to find out to get this linear mapping. Then, A_inverse * y = A_inverse * A * w => w = A_inverse * y. So, in order to find w the matrix A must be invertible. But even if it has linear dependency across rows then we can still make a row to all zeros and make the calculation impossible to calculate. – hafiz031 Oct 27 '21 at 07:10
4

This line of reasoning would require that $A$ is square for the inverse to exist. With num_of_examples typically bigger than num_of_features, this is not the case, which is why the OLS estimator is $(A'A)^{-1}A'y$, and then the reasoning from the answers applies. – Christoph Hanck Oct 27 '21 at 07:13
2

Only when num_of_examples=num_of_features would this indeed be an issue. You could modify my above code to education <- c(10,12,12) gender <- c(0,1,1) X <- cbind(education, gender) lm(rnorm(3)~X) to try and see. – Christoph Hanck Oct 27 '21 at 07:16
2

@hafiz031 check Christoph's comment, we are working on (A'A) instead of A. – Haitao Du Oct 27 '21 at 07:24
Or to be more precise, "only when" is not rigorous. When the row rank drops, due to duplicates, below the column rank, we get an issue, as in education <- c(10,12,12,12) gender <- c(0,1,1,1) X <- cbind(education, gender) lm(rnorm(4)~X) – Christoph Hanck Oct 27 '21 at 07:28

score 8 · Answer 2 · answered Oct 27 '21 at 06:24

There is no problem with duplicate rows. Consider some applied examples, such as "returns to education", i.e., how much do people earn given their education, gender, experience etc. There can, and often will, be people in the dataset with the same gender, years of education and years of labor market experience. So there is also no fix necessary.

Note that multicollinearity implies that the regressor matrix does not have full column rank, which in turn is what we need for the $X'X$ matrix involved in the OLS estimator to be invertible, or, put differently, for all coefficients to be identifiable.

Finding duplicate rows, if still desired, might be done as follows:

set.seed(1)
n <- 10
education <- sample(10:15, n, replace=T)
gender <- sample(0:1, n, replace=T)
(X <- cbind(education, gender))
duplicated(X)
> (X <- cbind(education, gender))
      education gender
 [1,]        10      0
 [2,]        13      0
 [3,]        10      0
 [4,]        11      1
 [5,]        14      1
 [6,]        12      1
 [7,]        15      1
 [8,]        11      0
 [9,]        12      0
[10,]        12      0
> duplicated(X)
 [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Can duplicate examples create multi-collinearity?

2 Answers2