I'm working with this data set trying to implement a model to predict the variable normexam. I've used the following models on sklearn, adding dummies for categorical variables, and got the following training and tests accuracies and MSE:
Linear Regression
Train score: 0.4724095680342516 Test score: 0.47158145208518587 MSE: 0.5086992393971124
Lasso Regression with cross validation
Train score: 0.3371872274722998 Test score: 0.3493746420343967 MSE: 0.6263455853993455
Ridge Regression with cross validation
Train score: 0.47220571167398906 Test score: 0.4705641103119571 MSE: 0.5096786164236958
Least Angle Regression
Train score: 0.4657328365514144 Test score: 0.46021208791002655 MSE: 0.5196443262627322
Random Forest Regression
Train score: 0.5251451214111218 Test score: 0.4124986409581193 MSE: 0.5655772222014314
As you can see most of them are in the 40-50 range. I need to improve this.
I'm not an expert so I was hoping someone could help me with the following questions:
- Why is it that neither Lasso nor Ridge improve the prediction accuracy? Does this say anything about the data?
- I've tried adding all sorts of interaction variables, adding squares and cubes. What should I do next to improve my model?