I'm getting crazy with this. I hope that someone can help me.
I want to perform LASSO regression on "House Prices: Advanced Regression Techniques" dataset on Kaggle. I'm using R.
This dataset is about predicting house price based on some features.
Data is messy. It has a lot of missing values. So, in order to solve this problem, I began to observe the meaning of the variables and I noticed that almost all missing values are the absence of a feature of the house. For example, in the dataset, there's a categorical variable that measures the type of alley access to property. So, in this variable, NA means "No alley", so I add a proper level to the variable. And so on with almost all variables.
At this point, I'm using glmnet library to perform lasso regression. glment function works with a matrix, so i use model.matrixin order to have a matrix that transform categorical variables in dummy variables. Then i combine with numeric variable and pass to glmnet. CVglmentselects best lambdas and it returns me this absurd lambda( 2540) that brings all coefficients to zero( i know why, due to the penalty factor).
library(glmnet)
house<-house[,-which(names(house)=="GarageYrBlt")]
numeric_house<-house[,sapply(house,is.numeric)]
numeric_house<-numeric_house[,-which(names(house)=="SalePrice")]
X <- model.matrix(house$SalePrice ~ .,data=house[,sapply(house,is.factor)])[,-1]
x.lasso <- as.matrix(data.frame(numeric_house, X))
y<-house$SalePrice
fit.lassoKCV<-cv.glmnet(x.lasso,y, alpha=1, nfolds = 5)
( lambda.KCV<-fit.lassoKCV$lambda.min )
So i tried with other values of lambda
grid<-seq(1,100,length=1000)
fit.lasso<-glmnet(x.lasso,y,alpha=1,lambda=grid,standardize = T)
but all coefficients to zero anyway.
My response variable( SalePrice) isn't in the matrix x.lasso. just checked out several times.
Please help me, i'm desperate. I don't know if the problem is in my code or is a theoretichal problem( multicollinearity???).