Say I am running a regression
$Y \sim X_1 + X_2$, $X_1$ and $X_2$ have drastically different range. Empirically, will these cause any issues on my prediction of $Y$? If the answer is yes, how do I handle it?
Say I am running a regression
$Y \sim X_1 + X_2$, $X_1$ and $X_2$ have drastically different range. Empirically, will these cause any issues on my prediction of $Y$? If the answer is yes, how do I handle it?
While I believe that the reason to centre/standardise your data should mostly be of statistical nature (as discussed in detail in the link provided in my original comment) numerics do come into play.
Common implementations of linear model regression are based on the QR decomposition and if it fails and you do not notice you are in trouble. See for example the following exampled based on the workhorse routines of two "decent stats software" packages: R and MATLAB; the weapons of choice for many Stats-ML-AI warriors out there (myself included :D ).
set.seed(1234);
n <- 100
xx <- sort(runif(n))
y <- 0.5*(xx-0.5)+(xx-0.5)^2 + rnorm(n)*0.1
x <- xx+1001
(toymod <- lm(y~x+I(x^2))) # Notice that we do not get a single warning!
Call:
lm(formula = y ~ x + I(x^2))
Coefficients:
(Intercept) x I(x^2)
-451.4685 0.4509 NA
If one is careful and checks the model summary he might be suspect something:
summary(toymod)
Call:
lm(formula = y ~ x + I(x^2))
Residuals:
Min 1Q Median 3Q Max
-0.20467 -0.08075 -0.00755 0.07144 0.30560
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -451.46848 43.00008 -10.5 <2e-16 ***
x 0.45088 0.04294 10.5 <2e-16 ***
I(x^2) NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1191 on 98 degrees of freedom
Multiple R-squared: 0.5294, Adjusted R-squared: 0.5246
F-statistic: 110.3 on 1 and 98 DF, p-value: < 2.2e-16
but then again he needs to know what he is looking for (ie. be able to understand that the model singularities come from the ill-conditioned for the design matrix $X$).
The same exact code will not fail in MATLAB:
rng(1234)
n = 100;
xx = linspace(0,1,n)';
y = 0.5*(xx-0.5)+(xx-0.5).^2 + randn(n,1)*0.1;
x = xx+1001;
toymod = fitlm([x x.^2], y)
toymod =
Linear regression model:
y ~ 1 + x1 + x2
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ _______ __________
(Intercept) 1.2401e+06 1.2911e+05 9.6055 9.5382e-16
x1 -2477.1 257.83 -9.6075 9.4446e-16
x2 1.2369 0.12872 9.6094 9.3519e-16
Number of observations: 100, Error degrees of freedom: 97
Root Mean Squared Error: 0.0979
R-squared: 0.77, Adjusted R-Squared 0.765
F-statistic vs. constant model: 162, p-value = 1.14e-31
but that is only due to the fact that MATLAB uses a different LAPACK library. Clearly we can mess-up MATLAB too by simply increasing the magnitude of our $X$:
rng(1234)
n = 100;
xx = linspace(0,1,n)';
y = 0.5*(xx-0.5)+(xx-0.5).^2 + randn(n,1)*0.1;
x = xx+10001;
toymod = fitlm([x x.^2], y);
Warning: Regression design matrix is rank deficient to within machine precision.
> In TermsRegression>TermsRegression.checkDesignRank at 98
In LinearModel.LinearModel>LinearModel.fit at 868
In fitlm at 117
Luckily MATLAB makes some quite specific complaints about the design matrix used but other than that lets you go your merry way without any other issue.
As mentioned we did not do anything insane. We just forced on purpose lm and fitlm to use a model matrix $X$ that its condition number caused QR to fail. Centring, standardising or simply using another model altogether (a spline for example) would have taken care of this numerically problematic situation.
x^2 and using that in lm. Sure enough, R reported quantifiable results. But (1) they differ from those of Matlab and (2) car::vif reports VIFs around $10^8$. Standardizing changed the answer but it didn't solve the problem. Thus, you aren't really studying the effects of ranges of the variables: you are studying how these systems handle what is essentially collinearity. That makes the discrepancies understandable: the output will be extremely sensitive to details of the numerical algorithms.
– whuber
Mar 04 '15 at 02:05
n and standardising x should work and provide a correct fit. See for example: xnew = scale(x); (toymod <- lm(y~xnew + I(xnew^2))); plot(x,y); lines(x, fitted(toymod), type="l", col='red');
– usεr11852
Mar 04 '15 at 02:50
R is around $1.5\times 10^{-14}$, which is darn near zero: you've only got two significant (decimal) digits left to play with.
– whuber
Mar 04 '15 at 03:00
mod_1 <- lm(xnew ~ I(xnew^2)); mean(mod_1$residuals^2); # 0.9147074 and summary(mod_1)$sigma; # [1] 0.9661133.
– usεr11852
Mar 04 '15 at 03:29
n <- 100; x <- rnorm(n); z <- rnorm(n); z <- resid(lm(z ~ x)) * 1e100; y <- x - 1e-100 * z + rnorm(n); summary(lm(y ~ x+z)) This generates two orthogonal variables (whose ranges differ by 100 orders of magnitude!) and fits a model.
– whuber
Mar 04 '15 at 15:35
svd(model.matrix( (lm(y ~ x+z))))$d == 0) in the model you defined. The rank for all intended purposes should be 1. When you have a singular value of the order e100 and two others at e0 that system is rank-deficient for any numerical purposes. We do not standardize at any point and therefore we fail.
– usεr11852
Mar 05 '15 at 02:43