I have several issues with a very simple linear regression. I cannot get Skewness/Kurtosis and Homoscedasticity assumptions to be met, even after removing outliers, adding polynomial terms and using log and Box-Cox transformations.
I have two datasets (sample1 and sample2), both with the columns:
- someX: results of a measurement that's different between sample1 and sample2
- database1: Dollar amounts (in millions) from one database
- database2: Dollar amounts (in millions) from another database
The goal is to do four simple regressions:
- Regression 1: database1 ~ someX (sample1)
- Regression 2: database2 ~ someX (sample1)
- Regression 3: database1 ~ someX (sample2)
- Regression 4: database2 ~ someX (sample2)
I have tried several combinations for each of the four: removing outliers/influential points, using polynomial terms, log and Box-Cox transformations, using per capita values (as someX represents people). But I can only meet all assumptions for (log-log) Regression 1:
lm(formula = log(database1) ~ log(someX) + I(log(someX)^2), data = sample1,
subset = -c(160, 100, 132))
Coefficients:
(Intercept) log(someX) I(log(someX)^2)
0.56298 0.93676 -0.05951
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = mod1)
Value p-value Decision
Global Stat 5.6807 0.22429 Assumptions acceptable.
Skewness 0.3372 0.56144 Assumptions acceptable.
Kurtosis 3.7731 0.05208 Assumptions acceptable.
Link Function 1.4262 0.23239 Assumptions acceptable.
Heteroscedasticity 0.1442 0.70414 Assumptions acceptable.
For the other three, I cannot meet all assumptions. For example for (log-log) Regression 3:
Call:
lm(formula = log(database1) ~ log(someX) + I(log(someX)^2) +
I(log(someX)^3), data = sample2)
Coefficients:
(Intercept) log(someX) I(log(someX)^2) I(log(someX)^3)
-11.90320 7.60609 -1.20549 0.06387
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = mod3)
Value p-value Decision
Global Stat 30.29064 4.271e-06 Assumptions NOT satisfied!
Skewness 21.38611 3.755e-06 Assumptions NOT satisfied!
Kurtosis 1.33551 2.478e-01 Assumptions acceptable.
Link Function 0.05342 8.172e-01 Assumptions acceptable.
Heteroscedasticity 7.51559 6.117e-03 Assumptions NOT satisfied!
And for Box-Cox transformation for regression 3:
Call:
lm(formula = database1.tran ~ log(someX) + I(log(someX)^2) +
I(log(someX)^3), data = sample2)
Coefficients:
(Intercept) log(someX) I(log(someX)^2) I(log(someX)^3)
-22.2267 13.2958 -2.0871 0.1108
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = mod3)
Value p-value Decision
Global Stat 15.26539 0.004181 Assumptions NOT satisfied!
Skewness 1.41446 0.234317 Assumptions acceptable.
Kurtosis 6.35226 0.011723 Assumptions NOT satisfied!
Link Function 0.03864 0.844168 Assumptions acceptable.
Heteroscedasticity 7.46004 0.006308 Assumptions NOT satisfied!
I'm choosing these polynomial terms for regression 3 because it looks like the best fit.
The reproducible code and all data are here: https://github.com/d-paulus/regression-examples
A binder of the code and data is here: https://mybinder.org/v2/gh/d-paulus/regression-examples/HEAD

glvmaoutput might claim). This is mainly because regression is used for so many different purposes that it would be impossible to impose universal assumptions of this nature. For a principled approach to transforming variables in regression, see https://stats.stackexchange.com/a/3530/919 for instance. – whuber Nov 26 '20 at 16:46glvmaoutput mentioned them. I thought they might be an indicator for problems in my data. Can I ignore those assumptions then? Thanks for pointing to the thread. I've experimented before with log and Box-Cox transformations because all my variables are heavily positively skewed. After transformation, the residual plots look better but not so much in the Q-Q plots. Removing outliers/influential points or using per capita data does not help. And for Regression 3, heteroscedasticity persists even after transformation. – dave Nov 26 '20 at 17:47