I wanna run glm composed of dependent variable named "exposure", and independent variable named "counts" and "distance" respectively.
# since the dependent value I wanna use(exposure) is continuous form and not in range of 0~1,I turned it into 1(upper 25%) and 0(under 75%).
for (i in 1:9) {
a = fread(paste0("first_dataset_", i, ".csv"))
b = quantile(a$exposure, 0.75)
binary_exposure = ifelse(a$exposure > b, 1, 0)
a = cbind(a, binary_exposure)
table(a$exposure)
c = paste0("C:/Users/82109/Documents/week_", i, "_sample.csv")
d = write.csv(a, c)
}
I devided original dataset into the training set(training_set_5) and test set(test_set_2). After that I finally made and validated the model like following. Please look 5th row.
# My final goal was making confusionmatrix but the result of "predict" function was probability of continuous form, from 0 to 1. So I thought prior to make confusionmatrix, I have to convert it into binary factor type 1 and 0 again.
training_glm = glm(binary_exposure ~ count + distance, data = training_set_5, family = binomial)
summary(training_glm)
vif(training_glm)
prediction = predict(training_glm, newdata = test_set_1, type = "response")
prediction = as.factor(ifelse(prediction > 0.5, "1", "0"))
test_set_2 = as.factor(test_set_1$binary_exposure)
confusionMatrix(prediction, test_set_2, positive = "1")
But at this point I became confused. Should I give same threshold value when I convert dependent variable at first and when I convert result of predict into factor? Thus, at 4th row of 1st code, I set 0.75 and if certain dependent variable data belongs to upper 25%, I gave 1, as I want to do. And at 5th row of 2nd code, I set 0.5, as every youtubers do(There was no youtubers modified this 0.5 to others, but the problem is, there were also none that shows the entire process what I wanna do).
And basically when people explain logistic regression they say just "If probability of model is bigger than 0.5 it goes to true and if not it goes to false".
At first I thougt two cutoff are totally separated, but as thinking more I got the question.
First of all, the first one is for deviding continuous dependent variable(not in probability form but just numeric form like 15,834) to 1 and 0 so that I can input it in modeling procedure. And the second one is for making output of "predict"(in probability form like 0.46) to 1 and 0. That is the difference.
And what I wanna identify and the purpose I making model is for "is this data belongs to upper 25%, considering independent variables?".
At this situation, is it right setting cutoff as 0.5, or modifying as 0.75?
exposureand continuous, why do you want to do classification? Why don't you just predict exposure as a continuous variable, using OLS or some other method for continuous data? (I would assume exposure to be nonnegative, so something like a gamma regression or OLS on logged data might be appropriate.) – Stephan Kolassa Aug 01 '22 at 13:53