We know if any linear dependency exists in training data, therefore, the feature matrix becomes singular and hence, we cannot solve it. But apart from the features (columns), a matrix can still can be singular if it has duplicate rows. More specifically if we provide duplicate examples in training data this will happen. So, logically it should also create multi-collinearity problem. But it seems to me that this issue is underrated compared to the issue where the features are linearly dependent. Will not the linear dependency in rows create multi-collinearity problem? Does this case have another name? Correct me if I am mistaken. And if duplicate or linearly dependent rows exist and if they create problem, how to spot and fix this problem?
- 215
2 Answers
As Christoph Hanck suggested, duplicate examples will not cause any problem of multi-collinearity. The multi-collinearity "problem" is caused by duplicate "columns" instead of "rows" in data matrix.
Intuitively, rows are data samples, and columns are features/measures, multi-collinearity means we have redundant features/measures, think about we can measure a person's height in different units, e.g., in cm and in inches.
On the other hand, it is perfectly normal to have "duplicated samples", like the other answer suggested, there are many cases that people have same gender and education status.
Assume if we have a data matrix $X$, if two columns of $X$ are the same, or one column of $X$ can be derived by a linear combination of other columns, $X'X$ will not be full rank and multi-collinearity will occur.
Regularization will help for multi-collinearity. Even small amount of regularization (L1 or L2) will make $X'X$ matrix full rank.
Related posts
What is rank deficiency, and how to deal with it?
Least Squares Regression Step-By-Step Linear Algebra Computation
- 36,852
- 25
- 145
- 242
There is no problem with duplicate rows. Consider some applied examples, such as "returns to education", i.e., how much do people earn given their education, gender, experience etc. There can, and often will, be people in the dataset with the same gender, years of education and years of labor market experience. So there is also no fix necessary.
Note that multicollinearity implies that the regressor matrix does not have full column rank, which in turn is what we need for the $X'X$ matrix involved in the OLS estimator to be invertible, or, put differently, for all coefficients to be identifiable.
Finding duplicate rows, if still desired, might be done as follows:
set.seed(1)
n <- 10
education <- sample(10:15, n, replace=T)
gender <- sample(0:1, n, replace=T)
(X <- cbind(education, gender))
duplicated(X)
> (X <- cbind(education, gender))
education gender
[1,] 10 0
[2,] 13 0
[3,] 10 0
[4,] 11 1
[5,] 14 1
[6,] 12 1
[7,] 15 1
[8,] 11 0
[9,] 12 0
[10,] 12 0
> duplicated(X)
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
- 33,180
y = Aw, and hereyis the output,Ais thefeature matrixwith dimension(num_of_examples, num_of_features)andwis the co-efficient vector which we want to find out to get this linear mapping. Then,A_inverse * y = A_inverse * A * w=>w = A_inverse * y. So, in order to findwthe matrixAmust be invertible. But even if it has linear dependency across rows then we can still make a row to all zeros and make the calculation impossible to calculate. – hafiz031 Oct 27 '21 at 07:10education <- c(10,12,12) gender <- c(0,1,1) X <- cbind(education, gender) lm(rnorm(3)~X)to try and see. – Christoph Hanck Oct 27 '21 at 07:16education <- c(10,12,12,12) gender <- c(0,1,1,1) X <- cbind(education, gender) lm(rnorm(4)~X)– Christoph Hanck Oct 27 '21 at 07:28