9

Given a large amount of data, how would I determine the degree of collinearity between all the variables? Preferably without relying on calculating linear regression between a variable and every other variable? Specifically, I need to find the collinearity between any single variable and every single other variable.

For this instance, we'll say the data has 10k variables and 100 million lines. I don't need a software package per se, a link to a relevant paper is fine.

Ferdi
  • 5,179
  • How much data have you got? The best method, in my opinion, is condition indexes and proportion of variance explained. – Peter Flom Apr 19 '13 at 22:50
  • 1
    As Peter wrote, how big really? You can compute the condition number of your design matrix on a desktop for up to 20k variables in seconds. – user603 Apr 19 '13 at 23:15
  • @PeterFlom For the purpose of this question, I'll say the data has around 10k variables, and around 10 million points. I'll update my question. I'll also check out condition indexes. – saccharine Apr 20 '13 at 08:20
  • 1
    @user603 Around 10k variables and 100 million points. I somehow doubt that a 20k variable dataset can have the condition number computed in seconds, considering the harddrive can only read on the order of 100 mbs a second. – saccharine Apr 20 '13 at 08:21

1 Answers1

9

The condition number is the statistic you want to look at. It's an exhaustive measure of colinearity of your design space and much cheaper to compute than the VIF. The formula is:

$$2\log\left(\frac{d_1(X_s)}{d_p(X_s)}\right) (1)$$

where $d_p$ ($d_1$) is the smallest (largest) singular value of $X_s$ and $X_s$ are the centred, re-scaled variables.

Reading your comment: this is what I was suspecting. To compute the condition # you don't need all 100e6 data points. Just samples, randomly, in the order of 100e3 (by experimenting with simulated datasets you can convince yourself that to get reliable results you should target about 5*k, where k is the # of non collinear variables so even 100e3 is very large already).

That should already give you a pretty good idea what variables are causing collinearity.

Also, you have specialized algorithm to compute the first and last few singular values and the last few singular vector only. There are algorithms do get these w/o computing the full SVD of $X$ (SVD-S). However, I don't know if these are implemented in R so to make the example below accessible I'll just use a small example and the classical SVD from R.

A high condition number (typically when the ratio (1) is larger than 10) tells you that $X_s'X_s$ is ill-conditioned, that there are components that can be re-written as (near) linear combination of the other variables. Below I give a brief (small scale) example of how SVD can be used to uncover such relations.

n<-100
p<-20
#non-ill conditioned part of the dataset.
x<-matrix(rnorm(n*p),nc=p)
x<-scale(x)
#introduce a variable that causes x to be 
#ill conditioned.
y<-x%*%c(rnorm(3),rep(0,p))[1:p]
y<-scale(y)
x<-cbind(x,y)
p<-ncol(x)

A<-svd(x,nu=0)
#x is ill-conditioned: this ratio is larger 
#than 10. (step 1)
2*log(A$d[1]/A$d[p])

#check what is causing it: (step 2)
round(A$v[,ncol(A$v)],2)
#you can write the last variable as (.23*x_1+.5*x_2-.45*x_3)/(-.7) [1]
#here the relation is exact because:
min(A$d)
#is 0. if min(A$d)>0 then this gives you how much there is noise 
#there is arround [1].

#so I remove the last variable. (step 3)
x<-x[,-ncol(x)]
#no more ill-condition.
2*log(A$d[1]/A$d[p-1])

This is for the linear algebra of the problem when there is a single variable causing the ill-regression. In most case you will have more than one (near) exact relationship and you will have to repeat steps 1 through 3. In practice, the computational specifics will depend on how clever is the approach you use to solve the SVD problem.

You can make yourself an idea of how many exact relationships there are in your dataset by computing

$$2\log\left(\frac{d_1(X_s)}{d_j(X_s)}\right)$$

for all $j$'s. To do this you only need the singular values which you can get at cost $O(p^2)$

user603
  • 22,585
  • 3
  • 83
  • 149
  • Can you link me something to read more on condition # and how it would help determining collinearity? Wikipedia seems to indicate that it doesn't provide the answers I need, which is the degree of collinearity between each variable. I'm attempting to simply group highly collinear variables together. – saccharine Apr 20 '13 at 08:46
  • 1
    Read Belsley: Conditioning diagnostics, collinearity and weak data in regression. Or, look up my dissertation Flom, Collinearity diagnostics for multiple regression: A Monte Carlo Study. Belsley also wrote another book that covers this, but is more general. – Peter Flom Apr 20 '13 at 13:03
  • @user603 Can you explain what's going on in step 2? How exactly do you know that the last variable is causing the issue? Specifically, how did you arrive at "round(A$v[,ncol(A$v)],2)"? You're taking the right singular SVD of the last variable? How did you know to look for that? – saccharine Apr 22 '13 at 18:24
  • @PeterFlom Do you have a version of your paper open for downloading anywhere? Fordham Univ won't let me access it. – saccharine Apr 22 '13 at 18:30
  • @saccharine: by definition, the $j^{th}$ singular value is the mean of the squared deviations of the data projected on the $j^{th}$ right singular vector. In R, the singular vectors are ordered by decreasing values of the associated singular values. So, the $j^{th}$ singular vector gives you the linear combination of your data yielding the $j^{th}$ largest deviations. Is this clear? – user603 Apr 22 '13 at 18:31
  • @saccharine I do not; but if you go here you can order a copy; unfortunately, there is a fee, but I have no control over that. – Peter Flom Apr 22 '13 at 19:05
  • @user603 Yes, I believe I understand it now. Thanks. – saccharine Apr 22 '13 at 21:59
  • @user603 Sorry to follow up on this so late, but how can I tell which exact variables a variable is collinear with? Right now I'm simply comparing the sum square differences of the numbers in the singular vectors, but that feels really fuzzy. Removing the multicollinear variables isn't the problem I have, but specifically finding which exact variables are multicollinear with each other. – saccharine May 03 '13 at 23:50
  • @saccharine: collinearity is a linear bond between a set of variable. There is no 'one variable causing it'. You have to break that bond, which you can do by removing any one of the variables in the relation. An alternative approach is to bias all the coefficients away from the collinear solution (ridge regression). A third possibility is to combine some mix of variable deletion and constraints on the solution (lasso regression). Notice that in lasso too if two (or more) variables are very correlated, which one ends up being removed is arbitrary too. – user603 May 04 '13 at 08:07
  • @user603: Just discovered this answer +1, great! I just have one question: I searched a bit about the condition number but haven't found the formula you provide. In Rawlings et al. "Applied Regression Analysis", the following formula is provided: $\kappa(X)=\sqrt{(\lambda_{max}/\lambda_{min})}$. Where can I find the formulation you give and why does Rawlings et al. give a different formulation? Thank you. – COOLSerdash Jul 23 '13 at 07:54
  • @COOLSerdash: the index I used in my answer is the log of the square of the $\kappa(X)$ you use. They're monotone function of one another. I'm not sure what your question is. – user603 Jul 23 '13 at 08:38
  • @user603 Thanks! I just wondered why the formulas in the textbooks were different from yours. I didn't see the relationship. Is there an advantage of taking the log instead of the square root? – COOLSerdash Jul 23 '13 at 08:42
  • @COOLSerdash: no, two monotone functions are essentially similar (so long as you adapt the threshold accordingly). – user603 Jul 23 '13 at 09:00
  • @user603 Thank you very much. Everything is clear now. – COOLSerdash Jul 23 '13 at 09:00