0

When I do logistic regression in XLstat and do the same in R with the same data (same variables, exactly the same data ) using the following (essential) code I get totally different coefficients. Could somebody explain to me why there is such a difference and how to replicate the results of XLstat in R?

library(caTools) set.seed(88) split <- sample.split(train$Recommended, SplitRatio = 0.75) dresstrain <- subset(train, split == TRUE) dresstest <- subset(train, split == FALSE) model <- glm (one ~two+three+four, data = dresstrain, family = binomial)

XLStat output

R output Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.295e+02 8.058e+06 0 1 Altitude -1.532e-01 1.033e+03 0 1 Pool_length -8.374e+00 8.042e+04 0 1 Pool_breadth 1.063e+01 2.102e+05 0 1 Pool_Depth -4.799e+02 7.066e+06 0 1 pH 8.422e+00 2.344e+05 0 1 Conductivity 3.522e-01 3.790e+04 0 1 TDS -2.709e-01 7.375e+04 0 1 Temperature 6.800e+00 2.010e+05 0 1 Nitrate -1.041e+03 7.301e+06 0 1 Phosphate 3.807e+00 9.269e+04 0 1 Sodium 5.410e+00 1.634e+05 0 1 Ammonium -2.277e+02 1.696e+06 0 1 Potassium -5.502e+01 1.133e+06 0 1 Calcium 1.969e+01 3.628e+05 0 1 Magnesium -4.456e+01 1.221e+06 0 1 Fluride 6.257e+00 7.875e+05 0 1 Chloride 1.982e+01 6.618e+04 0 1 Bromide -5.380e+01 5.328e+05 0 1 Sulphate 4.050e-01 3.086e+04 0 1

Stephan Kolassa
  • 123,354
Girish
  • 101
  • Do the two resulting models include the same variables? This is relevant yet information you would want to add to the original question. – Jesper for President Jan 06 '20 at 15:44
  • Yes. They include the same variables. The data is exactly the same. – Girish Jan 06 '20 at 15:45
  • Could you maybe add output from R? – Jesper for President Jan 06 '20 at 15:50
  • Also add output from excel – kjetil b halvorsen Jan 06 '20 at 15:51
  • 1
    It could be due to the random splitting (even with a set.seed): can you check the models are the same if you estimate them on the whole data? – Vincent Guillemot Jan 06 '20 at 15:53
  • 2
    The reason I am asking about variables is that it could be that there is a factor included and XLstat and R have different conventions for setting reference level. – Jesper for President Jan 06 '20 at 15:55
  • @VincentGuillemot R and XLstat both use logit – Girish Jan 06 '20 at 16:08
  • @StopClosingQuestionsFast I dont understand your comment about factor. Kindly explain. – Girish Jan 06 '20 at 16:11
  • 2
    Ok look at the standard errors ... huge. Your variables are doing a very poor job at explaining the dependent variable. When that is the case optimum for the underlying objective function (here the likelihood) is poorly numerically indentified if identified at all. – Jesper for President Jan 06 '20 at 16:19
  • @StopClosingQuestionFast I forgot to mention that in r there is a warning message 'glm fit:fitted probabilities numerically 0 or 1 occurred.' – Girish Jan 06 '20 at 16:28
  • Yes ok a warning is not an error. Ive seen that message before but dont know when it usually appears, maybe someone else can comment on that. Maybe also chek the variation in dependent variable ... not almost all 0 or all 1. – Jesper for President Jan 06 '20 at 19:57
  • How large is your dataset? If your dataset does not have sufficient enough information, that could explain the warning. Repeat the model fit using your entire dataset, do not split out a test set. How do the results compare now? – Dave2e Jan 06 '20 at 20:17
  • @Dave2e There are 64 datapoints. – Girish Jan 07 '20 at 05:14

1 Answers1

1

Could somebody explain to me why there is such a difference

If you have only 64 data points and estimate 20 parameters, your estimates will vary wildly, and the differences could indeed be due to different splits, or just to machine inaccuracies propagating through the fitting routine. Without your data, we can't tell.

You are overfitting badly, which is also indicated by your very large standard errors. For 64 datapoints, you can't estimate more than three or four parameters. Reduce your model drastically, or collect much more data (on the order of thousands of observations).

and how to replicate the results of XLstat in R?

To be honest, XLSTAT is not covering itself in glory (see also here). I would very much question the goal of replicating its results in R, and stick with R in the first place, but that is only my personal opinion.

Stephan Kolassa
  • 123,354