3

I'd like to start by saying I'm not a statistician - I have stats education at the Masters level, but no specialization or advanced work experience.

I'm currently trying to regress financial return data across a set of 4 independent variables (previous 78 quarters, no data missing), and I'm running into an issue: When I run a multiple regression using all 5 variables, there is a high adjusted R2, but several of the "key" variables (those that fundamentally should be significant) have high p-values (0.05+).

However, when I run each variable independently as a single regression against financial returns, they are all significant (p<0.05).

I've looked around on this forum but haven't been able to come up with an answer as to why, or how I should proceed. I calculated VIF's for each X and they are all <5, so multicolineairty doesn't seem to be an issue, and I've conducted a BP Test so HS doesn't appear to be an issue.

The purpose of this analysis is to determine importance (what factors have influenced financial return the most), and it is not to forecast financial return.

My questions are:

  1. Should I include the "insignificant" variables in my analysis?
  2. If yes, how can I determine the "level of importance" of each variable (i.e; X1 represents 20% of the total explanatory power of the model).

EDIT: Here is an image of the regression results and correlation tables:

enter image description here

Thank you!

Nick
  • 858
Andy
  • 31
  • Welcome to the site. Nicely posed question. I suggest you make clear what each variable is; think through, and share about, what role each predictor might play relative to the others, in a non-statistical sense; show details on collinearity statistics; study zero-order, partial, and part correlations; and perhaps obtain partial regression plots. Cheers ~ – rolando2 Jan 04 '23 at 20:32
  • Thank you Rolando! I updated the regression table to include the X variables and added a correlation table.

    The variables show some correlation, and the overall model is significant (very low Significance F, which I believe is the p-value on the F-Test). However, from my understanding, as long as the VIF is <5 for all variables (which they are), multicollinearity shouldn't be impacting the overall regression results.

    Any ideas on if its safe to proceed with the model as is?

    – Andy Jan 04 '23 at 21:02
  • Proceed with using the model for what? – Dave Jan 04 '23 at 21:23
  • @Dave - proceed with using the model to determine the level of explanatory power all of these variables have on financial return. Basically, do you think this model is reliable in it's current form? And if not, why/what should I do to make a reliable one? – Andy Jan 04 '23 at 21:28
  • Financial returns (esp stocks) usually lead other economic variables, so think about lead/lag relaionships. The composition of your financial returns data matters a lot. – Graham Bornholt Jan 04 '23 at 21:36

2 Answers2

0

A few thoughts:

Likely, the phenomenon you're experiencing is because the independent variables are correlated, as your correlation table suggests. You might be relying on the rule of thumb about VIF < 5 a little too strongly.

If you really want to think about which independent variable is most correlated with your dependent variable, you might simply look at the correlations between each independent variable and the dependent variable independently.

This doesn't preclude you from also building a larger regression model.

As mentioned in the comments, even if you are looking at simple correlations, it's useful to look at a lag for these indicators. For example, you could have GDPThisQuarter, GDPPreviousQuarter, and so on.

Looking at the simple correlations is often a good place to start. I also advise plotting the bi-variate correlations to see the shape of the relationship.

Sal Mangiafico
  • 11,330
  • 2
  • 15
  • 35
0

Concerning your first question: p-values (which are decreasing monotonously transformed values of the t statistic) can be used as a measure for variable importance, provided there is no strong multicollinearity. Actually, the function varImp of the R package caret returns simply the absolute value of the t statistic. According to this measure, variables with low t (or high p-value) are "unimportant". It might be, however, that including such a variable still reduces the cross-validated mean squared error (MSE), so I would check this, too. A proxy for the leave-one-out MSE is the Akike Information Criterion (AIC) (both are asymptotically equivalent), so you can test this instead. This, however, only measures the effect of dropping each variable alone while keeping all other variables. Moreover it does not describe the contribution of that variable on $R^2$.

Other methods are based on the decomposition of $R^2$ and measuring the contribution of each each variable. There are different ways to do this, and each gives different results. For a comprehensible overview and an R package, see

U. Grömping: "Relative Importance for Linear Regression in R: The Package relaimpo." Journal of Statistical Software 17(1), pp. 1-27 (2006)

The problem of the decomposition by sequential addition of variables in a type I ANOVA is, that the results depend on the order in which the variables are added. A workaround is to average over all variable permutations, which has runtime of order $O(p!)$ for $p$ parameters, and is thus not feasible for models with many parameters. With only a few parameters as in your case, this is no problem and might be a method to try out for measuring variable importance.

cdalitz
  • 5,132