0

I would like to include the sign of X in a linear regression to highlight the impact it has on Y (see the scatter plot below). I first thought of a dummy, taking the value of 1 if positive and 0 if negative but I had difficulties interpreting it, especially due to the dummy variable trap. So I finally just went with the following independent variables:

  • absolute value of X
  • sign of X

The results are as follow

                         OLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.255
Model:                            OLS   Adj. R-squared:                  0.254
Method:                 Least Squares   F-statistic:                     334.7
Date:                Sat, 25 Mar 2023   Prob (F-statistic):          6.51e-187
Time:                        09:08:30   Log-Likelihood:                 2567.9
No. Observations:                2938   AIC:                            -5128.
Df Residuals:                    2934   BIC:                            -5104.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            0.2208      0.004     50.206      0.000       0.212       0.229
sign_X              -0.0700      0.004    -15.913      0.000      -0.079      -0.061
np.abs(X)            0.0088      0.003      2.818      0.005       0.003       0.015
sign_X:np.abs(X)    -0.0157      0.003     -4.987      0.000      -0.022      -0.010
==============================================================================
Omnibus:                     2593.479   Durbin-Watson:                   0.948
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           153874.664
Skew:                           3.917   Prob(JB):                         0.00
Kurtosis:                      37.577   Cond. No.                         20.2
==============================================================================

Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Am I allowed to do this? Shouldn't it be more 'clean' with a dummy? I feel quite unsure as the model will give important variation of Y when X is switching sign. I feel like it will not be the case with a dummy. Also I have seen that the errors are autocorrelated so I'll have to add variables to the model.

The equation of the regression line is as follow:

model.params[0] 
+ model.params[1]*np.sign(x_) 
+ model.params[2]*np.abs(x_) 
+ model.params[3]*np.abs(x_)*np.sign(x_)

enter image description here

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
  • 1
    I'll separate out from my answer a question about the Durbin-Watson statistic. Are these data time series in any sense? – Nick Cox Mar 25 '23 at 09:48
  • Welcome to CV! You are "allowed" to create any explanatory variables you like -- but your diagnostic plots show this is a particularly poor model of the data. A strong transformation of $Y$ would help--something around the reciprocal will do -- likely followed by a transformation of $X,$ perhaps the log. I illustrate one exploratory method of finding such transformations at https://stats.stackexchange.com/a/35717/919. Other options are available, but this kind of exploration is always a useful start. – whuber Mar 25 '23 at 13:24
  • Hi @whubber. In fact i tried some transformation of both X and Y, but my goal being to highlight a rather 'simple' relationship for the sake of client presentation, I decided not to chose it. As commented below, I finally went for a linear spline with a constraint of a null first derivative at max(X). – Paul Lefebvre Mar 27 '23 at 10:26

1 Answers1

0

This is as much a substantive issue as a statistical one.

The argument for such a parameterisation is that there could be a jump at zero. On the other hand, getting a jump out of a model fit is not especially convincing unless you explore other alternative parameterisations, even say linear or cubic splines or scatterplot smoothers.

How much sense that makes for your context would be something that people familiar with your field might be able to comment on if you explained what X is. Anonymising it doesn't make anything clearer.

My short answer to the question is that it is not a matter of whether it is allowed. The issue is what parameterisation fits the data and the application both accurately and parsimoniously, which is the trade-off in virtually any model-fitting.

Even without knowing your context, the model fit looks implausible: the fitted values fall into utterly distinct groups, but observed outcomes don't. And there appears to be some curvature that the lines aren't matching. If the outcome variable can't be negative, a model ever predicting negative values is qualitatively wrong.

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
  • Thanks for the report. DId you try just an exponential of some kind, say by modelling log Y in terms of X? – Nick Cox Mar 27 '23 at 10:25
  • thanks @Nick Cox for pointing out that the negative predictions make the whole model wrong. In fact Y being a volatility, it cannot be negative. I finally ended up with a linear spline, with a constraint of a null first derivative at max(X). The result is a strong negative relationship when X<2, and a constant average value of Y when X>2, which is much more reliable and economically justified. – Paul Lefebvre Mar 27 '23 at 10:26
  • In fact yes i tried with a log(y) transformation. The relationship was quite linear but highly affected by outliers. This issue has been resolved with the linear spline through the constant value of the prediction for X>2. – Paul Lefebvre Mar 27 '23 at 10:42
  • I'd expect log y to dampen outliers! – Nick Cox Mar 27 '23 at 10:51