I would like to include the sign of X in a linear regression to highlight the impact it has on Y (see the scatter plot below). I first thought of a dummy, taking the value of 1 if positive and 0 if negative but I had difficulties interpreting it, especially due to the dummy variable trap. So I finally just went with the following independent variables:
- absolute value of X
- sign of X
The results are as follow
OLS Regression Results
==============================================================================
Dep. Variable: Y R-squared: 0.255
Model: OLS Adj. R-squared: 0.254
Method: Least Squares F-statistic: 334.7
Date: Sat, 25 Mar 2023 Prob (F-statistic): 6.51e-187
Time: 09:08:30 Log-Likelihood: 2567.9
No. Observations: 2938 AIC: -5128.
Df Residuals: 2934 BIC: -5104.
Df Model: 3
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 0.2208 0.004 50.206 0.000 0.212 0.229
sign_X -0.0700 0.004 -15.913 0.000 -0.079 -0.061
np.abs(X) 0.0088 0.003 2.818 0.005 0.003 0.015
sign_X:np.abs(X) -0.0157 0.003 -4.987 0.000 -0.022 -0.010
==============================================================================
Omnibus: 2593.479 Durbin-Watson: 0.948
Prob(Omnibus): 0.000 Jarque-Bera (JB): 153874.664
Skew: 3.917 Prob(JB): 0.00
Kurtosis: 37.577 Cond. No. 20.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Am I allowed to do this? Shouldn't it be more 'clean' with a dummy? I feel quite unsure as the model will give important variation of Y when X is switching sign. I feel like it will not be the case with a dummy. Also I have seen that the errors are autocorrelated so I'll have to add variables to the model.
The equation of the regression line is as follow:
model.params[0]
+ model.params[1]*np.sign(x_)
+ model.params[2]*np.abs(x_)
+ model.params[3]*np.abs(x_)*np.sign(x_)
