1

I am using logistic regression as a classifier to predict the variable Group (old/young) from a variable called Sentence_length. When I plot density functions of Sentence_length by Group, it is clear that that two groups are totally non-separable using Sentence_length:

Density plot of the raw scores

And indeed, the coefficient of Sentence_length in a GLM model that predicts Group from only Sentence_length is completely non-significant (p=0.84).

However, when I use logistic regression with leave-one-out cross validation, I obtain an AUC of 0, which implies complete separation in the logit space (though in the opposite direction, predicting old to be young and young to be old). I plotted here the density function of the predicted values (logit scores), so you can see that the Groups are indeed totally separable:

Density plot of the raw scores

This makes no sense, because the predicted values in logistic regression are supposed to be a monotonic transformation of the input. Are they not? I.e., higher score on sentence length should increase the probability of one class vs. the other, and hence increase its logit transformation as well. It can't be that from a completely uninformative variable that provides absolutely no separation between the groups, we receive perfectly separable scores. What am I missing here? How can this be?

Here is my code written in R to obtain the predicted values and the AUC:

# Initialize data frame to save the scores and true labels
Real_labels <- dplyr::select(data_all, Subject, Group)
Real_labels <- data.frame(Real_labels, Score_length=NA)

Subjects <- unique(Real_labels$Subject)

LOO-CV

for (sub in Subjects) { train_set <- data_all[data_all$Subject != sub,]

model_train_length <- glm(data=train_set, Group ~ Sentence_length, family='binomial')

sub_score_length <- predict.glm(model_train_length, newdata = data_all[data_all$Subject == sub,], type = "link")

Real_labels$Score_length[Real_labels$Subject==sub] <- sub_score_length

}

Calculate AUC ROC

roc_length <- pROC::roc(Real_labels$Group, Real_labels$Score_length, auc=T, percent=T, ci=F, plot=F, show.thres=F, quiet=T, direction= "<") roc_length$auc

Please find the .csv file with the relevant data frame here: https://drive.google.com/file/d/1qVbm1ZbQmTsrAapW_pmJWTH6H4XoZ1Ds/view?usp=sharing

Your help is highly appreciated! Thank you!

Galit
  • 177
  • 1
  • 10
  • I’m not so convinced that your groups as so inseparable. Judging from the first KDE plot, a sentence length between $50$ (maybe $70$) and $100$ seems to be a strong signal of the “old” group. – Dave Jun 08 '23 at 21:44
  • But the majority from both classes do overlap... I think this weird situation actually stems from the leave-one-out procedure, because when I use 5-fold cross-validation instead - I get AUC of 0.45, close to what I would expect from a random classifier. But where is the flaw in LOO-CV? – Galit Jun 09 '23 at 17:00
  • Those distributions look rather different to me, and with sufficient data, a flexible model should be able to give better-than-random performance. – Dave Jun 09 '23 at 18:42
  • Let's say we have two individuals, one belongs to the old group and one belongs to the young group. Let's say they happen to have the exact same score of Sentence_length. How can a linear classifier such as logistic regression be able to predict different output for them? This is practically what happens here. With such an overlap between the distribution, how can classification accuracy of 100% or 0% aeven be possible? – Galit Jun 09 '23 at 19:28
  • $1)$ A logistic regression can (and, from the graph, looks like it should here) have nonlinear features like polynomials and splines. $//$ $2)$ I am with you that you probably have a bug in your code. Perhaps consider using LOOCV out of a package instead of coding it yourself. – Dave Jun 09 '23 at 19:33
  • Importantly, to clarify: The task is not to test whether the distributions are different or not, but to classify individuals. With such an overlap in distributions, even if they are truly different, a linear classifier can never be able to perfectly separate all of the tokens into two homogenous groups. – Galit Jun 09 '23 at 19:38
  • If you just regress on the sentence length, then your performance will be poor, sure. If you allow for flexibility with, say, [tag:splines], then your logistic regression should be able to perform decently. $//$ Testing if two distributions are different and using the values of those variables to predict group membership are extremely related. This answer discusses that, and this question tries to dig deeper. – Dave Jun 09 '23 at 19:42

1 Answers1

0

So after thinking about this for a while, a friend suggested that this could be related to the bi-modality in the data (or rather, the fact that there is more than one "peak" in the density plot), because when he log-transformed the data he did not get AUC=0 (though still not close to 0.5).

To test this hypothesis, I synthesized a dataset drawn from a random bi-modal distribution, and I randomly assigned 'A' and 'B' labels such that there is no real group difference:

n <- 100
m <- 0
s <- 1

The degree of bi-modality is determined by delta (distance between peaks):

delta <- 5

d1 <- rnorm(n, m, s) d2 <- rnorm(n, m+delta, s)

d_all <- c(d1,d2)

df_all <- data.frame(Subject = c(1:length(d_all)), Label = rep(c('A', 'B'), 0.5*length(d_all)), Value = d_all)

ggplot(df_all, aes(x=Value)) + geom_histogram(aes(fill=Label)) + theme_bw()

Distribution of the randomly generated dataset

When I run LOO-CV on this random data, and plot the predicted scores (y-axis) against the original values (x-axis), I get a very weird pattern. While it is clear that there is no separability on the x-axis (original variable), we can draw a line that perfectly separates the two groups on the y-axis (logit scores):

enter image description here

And indeed, AUC=0 (or AUC=1, depending on the hypothesis; since my original one resulted in AUC=0, I will stick to that).

The degree of bi-modality is determined by the delta parameter. Here is the probability of perfect separability (AUC=0) depending on delta:

iterations <- 100
delta_values <- c(0:10)
n <- 100
m <- 0
s <- 1

df_AUC <- data.frame(Delta = rep(delta_values, each=iterations), Iteration = rep(c(1:iterations), length(delta_values)), AUC=NA)

create_dist <- function(d1, d2, delta) { d_all <- c(d1, d2+delta) df_all <- data.frame(Subject = c(1:length(d_all)), Label = rep(c('old', 'young'), 0.5*length(d_all)), Value = d_all) df_all$Label <- as.factor(df_all$Label) df_all }

run_LOOCV <- function(df_all) { df_all$prediction <- NA for (sub in df_all$Subject) { train_set <- df_all[df_all$Subject != sub,] model_train <- glm(data=train_set, Label~Value, family='binomial') sub_score <- predict.glm(model_train, newdata=df_all[df_all$Subject == sub,], type='link') df_all$prediction[df_all$Subject==sub] <- sub_score } roc_model <- pROC::roc(df_all$Label, df_all$prediction, auc=T, percent=T, ci=F, plot=F, quiet=T, direction='<') roc_model$auc }

for (i in c(1:iterations)) { #print(paste("Iteration:", i)) #start_time <- Sys.time() d1 <- rnorm(n, m, s) d2 <- rnorm(n, m, s) for (d in delta_values) { df_all <- create_dist(d1,d2,d) df_AUC$AUC[df_AUC$Iteration == i & df_AUC$Delta == d] <- run_LOOCV(df_all) } #end_time <- Sys.time() #print(paste('Time of iteration:', round(as.numeric(difftime(end_time, start_time, unit='m')),2), 'minutes.')) }

perfect separability increases with larger bimodality

Iterating 100 times on random datasets, I plotted the probability of AUC=0. It does go up with more bimodality! The "danger" is smaller with multiple independent variables, or with a larger n, but it is still there (I tried, though I do not show the results here... Trust me).

What is the explanation of this?

The fitted logistic curve as a seesaw

Logistic curve fitted on all data

The red points are "category 1", and the black points are "category 0". The y-axis is the probability of being in "category 1" (red). It is almost flat on 0.5, like a horizontal seesaw lying on a balanced random dataset, as there is really no effect. This is true whether data is binomial or not, and regardless of n size. However, this will change if we remove one point.

What happens in LOO-CV, when we exclude one data point?

If we think of the prediction line (gray) as a "seesaw", we can clearly see that with bi-modal data, each point has a high leverage and strongly affects the angle of the seesaw.

Let's focus on the right side of the seesaw. This side will be pushed up if we have more reds than blacks, and pushed down if we have fewer reds than blacks.

If we remove a red point, in one of the LOO-CV iterations, the seesaw will be pushed down. So when we predict the score of that red point, it will get a somewhat lower value than before, on the whole dataset when data was more balanced and seesaw was flatter. You can see the seesaw tilting down after removing one of these red points on the right side (gray line - balanced seesaw; black line - seesaw after leaving one right red point out):

Tilting the seesaw

The same rational only to the other direction works for the left side of the seesaw. Removing a red point will change the red/black balance and push the seesaw down, again resulting in lower scores for the red points.

Here is the code that generated this seesaw toy example:

n <- 10
m <- 0
s <- 1

delta <- 10

d1 <- rnorm(n, m, s) d2 <- rnorm(n, m+delta, s)

d_all <- c(d1,d2) df_all <- data.frame(Subject = c(1:length(d_all)), Label = rep(c('a', 'b'), 0.5*length(d_all)), Value = d_all) df_all$Label <- as.factor(df_all$Label)

model_all <- glm(data=df_all, formula=Label~Value, family='binomial')

df_all$prediction <- predict(object=model_all, newdata=df_all, type='response')

all the red points on the right:

red_subs_right <- df_all$Subject[df_all$Value>5 & df_all$Label=='b']

df_sub <- df_all[df_all$Subject!=red_subs_right[1],]

model_sub <- glm(data=df_sub, formula=Label~Value, family='binomial')

df_sub$prediction <- predict(object=model_sub, newdata=df_sub, type='response')

flat seesaw (all data)

plot(prediction~Value, df_all, col=df_all$Label, ylim=c(0,1)) lines(prediction~Value, df_all, lwd=1, col='gray')

tilted seesaw (subset of the data)

lines(prediction~Value, df_sub, lwd=1, col='black')

Conclusion

Leave-one-out interferes with the balance of the seesaw, changing its tilt for each LOO iteration in a consistent way across data points of the same label, creating this weird result of greater or even perfect separability. But of course using this curve to predict real new data will be a disaster, because the tilt angle will be fixed and completely meaningless on a larger scale for data points that do not take any part in the fitting process.

The danger of bimodality for LOO-CV is mitigated by more variables (that are themselves perfectly normal) or a larger n, but it is still there. We might not get perfect separability but improved separability, and in any event the chances for that increase with the degree of bi-modality regardless of the number of variables in the model or of n. This could happen even if there is no bimodality (delta=0) - try it!

To conclude, when cross validating using leave-one-out, it is important to check the data if it is really normal. If not - chances increase that you get spurious separability on your predicted values!

Galit
  • 177
  • 1
  • 10