I keep running into warnings in RStudio when I use subsets where p > n. ISLR 6.4.3 mentions that forward stepwise can be useful for high dimensional data, which I'm trying to just test out for learning purposes. It seems that all the examples I have found fit a full model first, but in all those examples n > p. Could someone point me in the right direction or fill me in on what I'm missing? Sample code below.
df <- read.csv('my_sample_data.csv')
df1 <- df[,1:200] # using a subset of the features to explore stepwise
set.seed(123)
# x <- df1[,-1]
# y <-df1[,1]
# train test split
trainIndex <- createDataPartition(df1$Age,p=.8,
list=FALSE,
times=1)
training <- df1[trainIndex,]
testing <- df1[-trainIndex,]
dim(training) # 130 200 ======= 130 samples, 200 features
parameter tuning
fitControl <- trainControl(
method = "cv",
number = 5)
set.seed(42)
step.fit <- train(Age~., data=training,
method="glmStepAIC",
trControl=fitControl,
trace=FALSE,
)
glmStepAICis not part of baseR, your code is not reproducible. How are you specifying that it use forward instead of backward stepwise regression? BTW, for learning purposes, applying algorithms to datasets for which they were not intended is usually not productive. – whuber Jan 15 '22 at 18:53