Choosing right set of variables for Logistic regression and decision tree

Question

I am a beginner in R. I am doing logistic regression using around 80 independent variables using glm function in R. The dependent variable is churn which says whether a customer churned or not. I want to know how to identify the right combination of variables to get a good predictive logistic regression model in R. I also want to know how to identify the same for making good decision tree in R ( I am using the ctree function from the party package). So far, I had used drop1 function and anova(LogMdl, test="Chisq") where LogMdl is my logistic regression model to drop unwanted variables in the predictive model. But maximum accuracy I was able to achieve was only 60%.

Also I am not sure if I am using the drop1 and anova functions correctly. I dropped the variables with lowest AIC using drop1 function. Using anova function, I dropped variables with p value > 0.05

Kindly help me how to identify the right set of variables for both logistic regression and decision tree models to increase my model's predictive accuracy to close to 90% or more than that if possible.

library(party)
setwd("D:/CIS/Project work")
CellData <- read.csv("Cell2Cell_SPSS_Data - Orig.csv")
trainData <- subset(CellData,calibrat=="1")
testData <- subset(CellData,calibrat=="0") # validation or test data set
LogMdl = glm(formula=churn ~ revenue  + mou    + recchrge+ directas+ 
               overage + roam    + changem +
               changer  +dropvce + blckvce + unansvce+ 
               custcare+ threeway+ mourec  +
               outcalls +incalls + peakvce + opeakvce+ 
               dropblk + callfwdv+ callwait+
               months  + uniqsubs+ actvsubs+  phones  + models  +
               eqpdays  +customer+ age1    + age2    + 
               children+ credita + creditaa+
               creditb  +creditc + creditde+ creditgy+ creditz + 
               prizmrur+ prizmub +
               prizmtwn +refurb  + webcap  + truck   + 
               rv      + occprof + occcler +
               occcrft  +occstud + occhmkr + occret  + 
               occself + ownrent + marryun +
               marryyes +marryno + mailord + mailres + 
               mailflag+ travel  + pcown   +
               creditcd +retcalls+ retaccpt+ newcelly+ newcelln+ 
               refer   + incmiss +
               income   +mcycle  + creditad+ setprcm + setprc  + retcall, 
               data=trainData, family=binomial(link="logit"),
               control = list(maxit = 50))
ProbMdl = predict(LogMdl, testData, type = "response")
testData$churndep = rep(0,31047)  # replacing all churndep with zero
testData$churndep[ProbMdl>0.5] = 1   # converting records with prob > 0.5 as churned
table(testData$churndep,testData$churn)  # comparing predicted and actual churn
mean(testData$churndep!=testData$churn)    # prints the error %

Link for documentation of variables: https://drive.google.com/file/d/0B9y78DHd3U-DZS05VndFV3A4Ylk/

Link for Dataset (.csv file) : https://drive.google.com/file/d/0B9y78DHd3U-DYm9FOV9zYW15bHM/

I could not produce the output of dput since the data size is more than 5 MB. So I have zipped the file and placed in the above link.

Description of important variables: * churn is the variable that says whether a customer churned or not..... * churndep is the variable that needs to be predicted in the test data (validation data) and has to be compared with the churn variable which is already populated with actual churn. For both churn and churndep, value of 1 means churned and 0 means not churned.

answers will be of limited help unless you provide us with some reproducible code and data. Please add your code and the output of dput(<your data set>). — mlegge, Jan 30 '15 at 19:31
Pls find below the R Code library(party) setwd("D:/") CellData <- read.csv("Cell2Cell_SPSS_Data - Orig.csv")
trainData <- subset(CellData,calibrat=="1") testData <- subset(CellData,calibrat=="0") LogMdl = glm(formula=churn ~ ., data=trainData, family=binomial(link="logit"),control = list(maxit = 50)) ..... Continued in next comment — , Jan 31 '15 at 16:38
ProbMdl = predict(LogMdl, testData, type = "response") testData$churndep = rep(0,31047) testData$churndep[ProbMdl>0.5] = 1 # converting records with prob > 0.5 as churned table(testData$churndep,testData$churn) mean(testData$churndep!=testData$churn) # prints the error % Link for documentation of variables: https://drive.google.com/file/d/0B9y78DHd3U-DZS05VndFV3A4Ylk/view?usp=sharing .....Continued in next comment — , Jan 31 '15 at 16:50
Link for Dataset (.csv file) : https://drive.google.com/file/d/0B9y78DHd3U-DYm9FOV9zYW15bHM/view?usp=sharing I could not produce the output of dput since the data size is more than 5 MB. So I have zipped the file and placed in the above link. I am in desperate need of help. Thank you all for taking the time to help answer my question. I wanted to know how to find the right set of variables using R to get maximum accuracy of prediction possible. — , Jan 31 '15 at 16:51
Have you considered choosing the variables x and y? I've seen a lot of people reporting good results for their regressions when they've chosen x and y. — ely, Jan 31 '15 at 19:37
I have produced my entire code above in my original post. By mistake, I included them in the comments itself. @prpl.mnky.dshwshr, Please clarify what do you mean by x and y variable. I have about 80 variables. — , Jan 31 '15 at 19:40
PS not a full answer, but glm(churn ~ ., <etc>) will fit a model using all of the variables in the data set (except the specified response variable) as predictors — Ben Bolker, Jan 31 '15 at 20:00
@mkemp6.. As you have requested, I have provided the R code above and dataset. Kindly let me know your suggestions. I am in need of desperate help. So anyone who can help me, kindly let me know your suggestions. — Kumar, Feb 04 '15 at 18:14
Can you please add the extra information as an edit to the original Q? — kjetil b halvorsen, Oct 30 '18 at 15:29
Also see https://stats.stackexchange.com/questions/101003/logistic-regression-on-big-data — kjetil b halvorsen, Nov 03 '18 at 18:29

Choosing right set of variables for Logistic regression and decision tree

0 Answers0