0

I am working on Telcom data for Churn modelling. I have 18 categorical and 2 numeric variables (total charges and monthly charges) in my data set. After handling the missing values, I checked the outliers. When I check the boxplot and boxplot statistics, I understand that I have quite a lot outliers. Please see below output:

> data=dataset_churn
> boxplot.stats(data$TotalCharges)
$stats
[1]   18.800  403.775 1411.900 3867.800 8684.800

$n [1] 7043

$conf [1] 1346.683 1477.117

$out [1] 676695 707635 15669 16195 51125 53764 116405 47331 10726 395315 13986 203305 490485 20488 535645 216505 58694 71113 703065 630085 30555 788725 275785 [24] 60299 18131 309465 747585 469065 18605 486935 367315 144365 9187 20127 154035 227185 318295 296405 23754 439125 614585 584465 60815 68975 11086 266675 [47] 40264 31716 561775 16926 10685 75443 > > > > boxplot.stats(data$MonthlyCharges) $stats [1] 18.25 35.80 70.50 90.15 118.75

$n [1] 7043

$conf [1] 69.47676 71.52324

$out [1] 2015 9635 10845 2525 2025 192 7385 443 1051 9385 356 943 644 8315 8215 8485 786 8985 854 10895 1005 8545 4965 10955 8335 8545 2635 [28] 693 6565 243 1079 948 9475 8405 7465 5515 1053 257 6995 1033 209 207 698 8505 10435 9125 5395 998 4455 10645 8985 1965 203 416 [55] 5505 2035 354 686 443 557 988 744 3545 10495

My question is: I am struggling how to handle these outliers. I tried to use sqrt() or log() transformations but none of them worked. So I thought maybe removing all the outliers would be an option or replacing them with the median of the data. (But none of the codes I wrote or found worked, alwys getting the error below) Would you recommend deleting or replacing the data ?

Or, is there any other recommendation you have for me ?

Also, I tried to remove outliers with the code below but it is not working and giving me below error:

> outliers <- boxplot(data$TotalCharges, plot=FALSE)$out

> data$TotalCharges[which(data$TotalCharges %in% outliers),] Error in data$TotalCharges[which(data$TotalCharges %in% outliers), ] : incorrect number of dimensions

> data$TotalCharges = data$TotalCharges[-which(data$TotalCharges %in% outliers),] Error in data $ TotalCharges[-which(data $ TotalCharges %in% outliers), : incorrect number of dimensions

I am quite new to data analysis and struggling a lot with my very first assignment. I would really appreaciate any help you migh provide!

  • 3
    Why not just leave them in? Do you suspect something about those points is incorrect (measurement error or a typo, for instance)? – Dave Sep 09 '21 at 11:48
  • Hey Dave thank you for your response. I believe one of the assumptions of logistic regression is to remove the outliers. And indeed looking at the rest of the data, I suspect that there is a mistake with the very high figures. That's why I decided to remove or replace them. – newbie-data-student Sep 09 '21 at 11:59
  • 2
    Your belief is incorrect: logistic regression makes no assumptions about the presence or absence of "outliers." I use quotation marks here because what constitutes an outlier depends on what you compare it to. It looks like you use univariate methods to identify extreme values. That's OK, but it's practically irrelevant for a regression analysis. You should instead be concerned about high leverage observations and observations with extreme residuals. – whuber Sep 09 '21 at 12:21
  • 1
    You did not give us any details of your logistic regression. It might well be unaffeced by your univariate "outliers". Give us some details ... but make the regression robust at the outset, spline the continuous predictors, ... See https://stats.stackexchange.com/questions/169348/how-should-i-check-the-assumption-of-linearity-to-the-logit-for-the-continuous-i – kjetil b halvorsen Sep 09 '21 at 12:44
  • Hi All, Thanks for the responses. I am trying to model the churn by using 18 categorical (such as gender, age category, whether have phone or internet service, etc) and 2 numeric variables (total charges and monthly charges). When I check the mean and median values of my continuous variables, and plots of them I see that both ot them are highly skewed and have outliers (50-60 out of 7000 data). According to some online sources, outliers might create issue and needs to be handled before the logistic regression model build. Thats why I wanted to fix outlier issue. – newbie-data-student Sep 09 '21 at 12:50

0 Answers0