Robust way to add predictors to existing linear model

Question

I'm looking for a robust way to gradually build up a regression model -- namely I have a linear base-model with a robust set of predictors for which I'm fairly certain I have near optimal weights for, and I want to incrementally add features/predictors without degrading performance. In a sense, I'm trying to devise a good way to weed out noisy/uninformative predictors from entering my pipeline.

Using a single model like ridge or lasso, the larger predictor set (with the original set of predictors) does not perform as well as the original set of predictors alone. Ideally I'd like to not change basis (ie. no pca) to retain some explainability.

What I've looked into

I was looking for ways to combine models, and I came upon @whuber's fantastic response to Question on how to normalize regression coefficient, which allows multiple regression models to be expressed as a sequence of single variable regressions. I was able to replicate this, but I noticed the use of one multiple regression degrades performance as it's forcing the model to figure out the cross-term interactions, and I don't have enough data for the model to figure out the correct weights.

I've also looked into step-wise regression, though I've not attempted to implement it as it has some drawbacks.

stacking which will probably be my last resort -- I'd ideally like to find a way to figure out optimal weights for those weak / uninformative predictors -- if that's a possibility.

Where I need help

I'm currently thinking of fitting multiple linear models that each have their own distinct 'flavour', but I need to find a way to combine them. If I combine them by sequential matching, like in @whuber's answer, I'll just end up with multiple regression, which has shown to perform worse than my original predictor set.

What I think is happening is that the errors from one model is negating the correct predictions from other models, and as there are many more weak models than strong models, the errors overwhelm the stronger model predictions.

Because I'm working with linear models, I've been thinking one option is to introduce some kind of non-linearity when combining models, but then what kind of non-linearity?

Another thing I've been thinking about is that if I know my original set of predictors are near optimally weighted, then should I be freezing these coefficients when fitting other predictors?

At any rate, would greatly appreciate any guidance on this.

Edit

Performance is evaluated on hold-out data on 90/10 split, where the hold out has approximately 1000 samples.

I'm evaluating performance using out-of-sample RMSE, percent of prediction errors that exceed a variable threshold, correlation between prediction and realized target. The strength of my predictions are generally in consensus among these metrics.

Regarding sample set, the original dataset is has ~27,000 samples with 75 predictors. In this experiment, I'm looking to scale up to ~800 (many of which should have 0 or near 0 weight) predictors with the same number of samples. There's quite a bit of collinearity and noise in the data, and the predictors are all z-scores-like (odd, unit-variance, etc.). I've attempted to orthogonalize these and have gotten mixed results (not sure if noise is a contributing factor when performing orthogonalization).

Welcome to Cross Validated! When you say "the larger predictor set ...does not perform as well as the original set of predictors alone," is that the performance on your original data set or on some held-out or external data? Please edit the question to clarify how you are evaluating performance and to say more about the nature of your data set (e.g., number of observations, number of potential predictors) and about your purpose in modeling. Please provide that information by editing the question, as comments are easy to overlook and can be deleted. — EdM, Nov 19 '23 at 19:27
@EdM Thanks Ed! I missed a lot of details in the original question, I've added them under the Edit section — ron burgundy, Nov 19 '23 at 19:56
To emphasize previous comment. This sounds like an XY Problem(https://xyproblem.info/). "the larger predictor set ...does not perform as well as the original set of predictors alone,"... you need to ask a different question: why doesn't it (given that you are using regularisation etc)?. — seanv507, Nov 19 '23 at 20:49
In the presence of collinearity, the optimal predictors are unlikely to stay the same. eg predict student's exam score from single test result - weight =1, predict student's score from 10 tests, weight = 1/10 for each test (not keep original coefficient weight 1 and therefore set remaining coefficients zero.)- ie it is better to average, meaning your coefficients are completely changed. for this reason would recommend ridge over lasso in collinear setting. since lasso will also select single coefficient rather than averaging. — seanv507, Nov 19 '23 at 20:52
@seanv507 Thanks for the feedback, regarding why the larger predictor set doesn't perform as well, I believe that I don't have enough data to correctly parameterise all those weights and this is a hard constraint unfortunately. I do favour ridge, but in my attempts, ridge does not recover the performance seen with the original 75 predictors. — ron burgundy, Nov 19 '23 at 21:17
yes but the whole idea of regularisation is that if you have a sufficiently strong regularisation, your model becomes intercept only - so you need correspondingly less data. similarly a coefficient that is used rarely will be penalised more than a coefficient than a coefficient that is used frequently. so I would suggest that there is a problem with your regularisation setup. I would suggest you provide the actual details of your variables (stock prices?/ etc) and data transformations (added to the question) — seanv507, Nov 19 '23 at 23:00

score 2 · Accepted Answer · answered Nov 21 '23 at 17:52

There's a simple way to accomplish what you want, which also might illustrate why your results so far seem difficult to understand.

Perform LASSO without including your "robust set of predictors" among the predictors whose coefficients will be penalized. At the highest penalty level, that will be equivalent to the model restricted to the "robust set of predictors." The choice of penalty is typically evaluated by internal cross-validation over a range of penalty values, for example choosing the penalty that provides minimum mean-squared cross-validation error. That would incorporate predictors that (with penalized coefficients) improve performance over the initial "robust set of predictors."

When you do that, I suspect that the internal cross-validation performance of the model at the highest penalty, restricted to the "robust set of predictors," will be inferior to what you found on your held-out test set.

A single 90/10 train/test split can be unreliable for model assessment, even with 10,000 cases. See Frank Harrell's post here. I suspect that your single train/test split led to an overly optimistic estimate of the performance of the model restricted to the "robust set of predictors." When you used cross-validation to select penalty values in LASSO or ridge, you found something closer to the (poorer) model performance that might have been better estimated by repeated train/test splits or by bootstrapping.

This makes sense. I've been looking into the post you linked as well as bootstrapping. I understand bootstrapping is commonly used to estimate errors, but I've not seen it used in estimating model weights, do you know if that's bad practice? — ron burgundy, Dec 10 '23 at 17:37
@ronburgundy I can't say it's bad practice, but cross-validation is typical practice. For example, the manual for the R glmnet package doesn't contain "boot," versus 15 instances of "cross-validat" (including its cross-validation function cv.glmnet()). Remember that for a fixed data set there is a single path describing how the penalty value sets the coefficients for all predictors. Resampling is just a way to find some "best" penalty choice, or (if repeated) to illustrate the variability in that choice. — EdM, Dec 10 '23 at 18:17
@ronburgundy bootstrapping is a good way to illustrate the variability of predictor selection when you use lasso or elastic net. Start with your full data, use CV to find the optimal penalty, store the retained predictors and their coefficients. Then repeat the process for multiple bootstrapped samples of the entire data set. See Section 6.2 of Statistical Learning with Sparsity. — EdM, Dec 10 '23 at 18:32

Robust way to add predictors to existing linear model

1 Answers1