I am a beginner in R. I am doing logistic regression using around 80 independent variables using glm function in R. The dependent variable is churn which says whether a customer churned or not. I want to know how to identify the right combination of variables to get a good predictive logistic regression model in R. I also want to know how to identify the same for making good decision tree in R ( I am using the ctree function from the party package).
So far, I had used drop1 function and anova(LogMdl, test="Chisq") where LogMdl is my logistic regression model to drop unwanted variables in the predictive model. But maximum accuracy I was able to achieve was only 60%.
Also I am not sure if I am using the drop1 and anova functions correctly. I dropped the variables with lowest AIC using drop1 function. Using anova function, I dropped variables with p value > 0.05
Kindly help me how to identify the right set of variables for both logistic regression and decision tree models to increase my model's predictive accuracy to close to 90% or more than that if possible.
library(party)
setwd("D:/CIS/Project work")
CellData <- read.csv("Cell2Cell_SPSS_Data - Orig.csv")
trainData <- subset(CellData,calibrat=="1")
testData <- subset(CellData,calibrat=="0") # validation or test data set
LogMdl = glm(formula=churn ~ revenue + mou + recchrge+ directas+
overage + roam + changem +
changer +dropvce + blckvce + unansvce+
custcare+ threeway+ mourec +
outcalls +incalls + peakvce + opeakvce+
dropblk + callfwdv+ callwait+
months + uniqsubs+ actvsubs+ phones + models +
eqpdays +customer+ age1 + age2 +
children+ credita + creditaa+
creditb +creditc + creditde+ creditgy+ creditz +
prizmrur+ prizmub +
prizmtwn +refurb + webcap + truck +
rv + occprof + occcler +
occcrft +occstud + occhmkr + occret +
occself + ownrent + marryun +
marryyes +marryno + mailord + mailres +
mailflag+ travel + pcown +
creditcd +retcalls+ retaccpt+ newcelly+ newcelln+
refer + incmiss +
income +mcycle + creditad+ setprcm + setprc + retcall,
data=trainData, family=binomial(link="logit"),
control = list(maxit = 50))
ProbMdl = predict(LogMdl, testData, type = "response")
testData$churndep = rep(0,31047) # replacing all churndep with zero
testData$churndep[ProbMdl>0.5] = 1 # converting records with prob > 0.5 as churned
table(testData$churndep,testData$churn) # comparing predicted and actual churn
mean(testData$churndep!=testData$churn) # prints the error %
Link for documentation of variables: https://drive.google.com/file/d/0B9y78DHd3U-DZS05VndFV3A4Ylk/
Link for Dataset (.csv file) : https://drive.google.com/file/d/0B9y78DHd3U-DYm9FOV9zYW15bHM/
I could not produce the output of dput since the data size is more than 5 MB. So I have zipped the file and placed in the above link.
Description of important variables:
* churn is the variable that says whether a customer churned or not.....
* churndep is the variable that needs to be predicted in the test data (validation data) and has to be compared with the churn variable which is already populated with actual churn.
For both churn and churndep, value of 1 means churned and 0 means not churned.
dput(<your data set>). – mlegge Jan 30 '15 at 19:31trainData <- subset(CellData,calibrat=="1") testData <- subset(CellData,calibrat=="0") LogMdl = glm(formula=churn ~ ., data=trainData, family=binomial(link="logit"),control = list(maxit = 50)) ..... Continued in next comment
– Jan 31 '15 at 16:38xandy? I've seen a lot of people reporting good results for their regressions when they've chosenxandy. – ely Jan 31 '15 at 19:37glm(churn ~ ., <etc>)will fit a model using all of the variables in the data set (except the specified response variable) as predictors – Ben Bolker Jan 31 '15 at 20:00