1

I'm working with this data set trying to implement a model to predict the variable normexam. I've used the following models on sklearn, adding dummies for categorical variables, and got the following training and tests accuracies and MSE:

Linear Regression

Train score: 0.4724095680342516 Test score: 0.47158145208518587 MSE: 0.5086992393971124

Lasso Regression with cross validation

Train score: 0.3371872274722998 Test score: 0.3493746420343967 MSE: 0.6263455853993455

Ridge Regression with cross validation

Train score: 0.47220571167398906 Test score: 0.4705641103119571 MSE: 0.5096786164236958

Least Angle Regression

Train score: 0.4657328365514144 Test score: 0.46021208791002655 MSE: 0.5196443262627322

Random Forest Regression

Train score: 0.5251451214111218 Test score: 0.4124986409581193 MSE: 0.5655772222014314

As you can see most of them are in the 40-50 range. I need to improve this.

I'm not an expert so I was hoping someone could help me with the following questions:

  1. Why is it that neither Lasso nor Ridge improve the prediction accuracy? Does this say anything about the data?
  2. I've tried adding all sorts of interaction variables, adding squares and cubes. What should I do next to improve my model?
Melanie
  • 133

0 Answers0