Simple linear regression with skewness, kurtosis and heteroscedasticity

Question

I have several issues with a very simple linear regression. I cannot get Skewness/Kurtosis and Homoscedasticity assumptions to be met, even after removing outliers, adding polynomial terms and using log and Box-Cox transformations.

I have two datasets (sample1 and sample2), both with the columns:

someX: results of a measurement that's different between sample1 and sample2
database1: Dollar amounts (in millions) from one database
database2: Dollar amounts (in millions) from another database

The goal is to do four simple regressions:

Regression 1: database1 ~ someX (sample1)
Regression 2: database2 ~ someX (sample1)
Regression 3: database1 ~ someX (sample2)
Regression 4: database2 ~ someX (sample2)

I have tried several combinations for each of the four: removing outliers/influential points, using polynomial terms, log and Box-Cox transformations, using per capita values (as someX represents people). But I can only meet all assumptions for (log-log) Regression 1:

lm(formula = log(database1) ~ log(someX) + I(log(someX)^2), data = sample1, 
    subset = -c(160, 100, 132))
Coefficients:
    (Intercept)       log(someX)  I(log(someX)^2)

        0.56298          0.93676         -0.05951
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05
Call:
 gvlma(x = mod1)
                Value p-value                Decision

Global Stat        5.6807 0.22429 Assumptions acceptable.
Skewness           0.3372 0.56144 Assumptions acceptable.
Kurtosis           3.7731 0.05208 Assumptions acceptable.
Link Function      1.4262 0.23239 Assumptions acceptable.
Heteroscedasticity 0.1442 0.70414 Assumptions acceptable.

For the other three, I cannot meet all assumptions. For example for (log-log) Regression 3:

Call:
lm(formula = log(database1) ~ log(someX) + I(log(someX)^2) + 
    I(log(someX)^3), data = sample2)
Coefficients:
    (Intercept)       log(someX)  I(log(someX)^2)  I(log(someX)^3)

      -11.90320          7.60609         -1.20549          0.06387
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05
Call:
 gvlma(x = mod3)
                  Value   p-value                   Decision

Global Stat        30.29064 4.271e-06 Assumptions NOT satisfied!
Skewness           21.38611 3.755e-06 Assumptions NOT satisfied!
Kurtosis            1.33551 2.478e-01    Assumptions acceptable.
Link Function       0.05342 8.172e-01    Assumptions acceptable.
Heteroscedasticity  7.51559 6.117e-03 Assumptions NOT satisfied!

And for Box-Cox transformation for regression 3:

Call:
lm(formula = database1.tran ~ log(someX) + I(log(someX)^2) + 
    I(log(someX)^3), data = sample2)
Coefficients:
    (Intercept)       log(someX)  I(log(someX)^2)  I(log(someX)^3)

       -22.2267          13.2958          -2.0871           0.1108
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05
Call:
 gvlma(x = mod3)
                  Value  p-value                   Decision

Global Stat        15.26539 0.004181 Assumptions NOT satisfied!
Skewness            1.41446 0.234317    Assumptions acceptable.
Kurtosis            6.35226 0.011723 Assumptions NOT satisfied!
Link Function       0.03864 0.844168    Assumptions acceptable.
Heteroscedasticity  7.46004 0.006308 Assumptions NOT satisfied!

I'm choosing these polynomial terms for regression 3 because it looks like the best fit.

The reproducible code and all data are here: https://github.com/d-paulus/regression-examples

A binder of the code and data is here: https://mybinder.org/v2/gh/d-paulus/regression-examples/HEAD

Re " I cannot get Skewness/Kurtosis ... assumptions to be met:" could you please tell us what those assumptions might be? I am not aware of any assumptions whatsoever about skewness and kurtosis in ordinary linear regression (despite what the glvma output might claim). This is mainly because regression is used for so many different purposes that it would be impossible to impose universal assumptions of this nature. For a principled approach to transforming variables in regression, see https://stats.stackexchange.com/a/3530/919 for instance. — whuber, Nov 26 '20 at 16:46
I got concerned about skewness/kurtosis assumptions because the glvma output mentioned them. I thought they might be an indicator for problems in my data. Can I ignore those assumptions then? Thanks for pointing to the thread. I've experimented before with log and Box-Cox transformations because all my variables are heavily positively skewed. After transformation, the residual plots look better but not so much in the Q-Q plots. Removing outliers/influential points or using per capita data does not help. And for Regression 3, heteroscedasticity persists even after transformation. — dave, Nov 26 '20 at 17:47
From the figure above, it seems to me that there is a little linear correlation between log(database1) and log(someX). Could you check the correlation coefficient? Is it close to 0? If yes, then what you are trying to do might not be useful. — TrungDung, Nov 26 '20 at 18:40
Correlation coefficient between log(database1) and log(someX) is 0.2148297, p-value: 3.956e-05. Would this be considered as too weak? What would I have to check or try next then? — dave, Nov 26 '20 at 19:42

Simple linear regression with skewness, kurtosis and heteroscedasticity

0 Answers0