0

I wanna run glm composed of dependent variable named "exposure", and independent variable named "counts" and "distance" respectively.

# since the dependent value I wanna use(exposure) is continuous form and not in range of 0~1,I turned it into 1(upper 25%) and 0(under 75%).

for (i in 1:9) { a = fread(paste0("first_dataset_", i, ".csv")) b = quantile(a$exposure, 0.75) binary_exposure = ifelse(a$exposure > b, 1, 0) a = cbind(a, binary_exposure) table(a$exposure) c = paste0("C:/Users/82109/Documents/week_", i, "_sample.csv") d = write.csv(a, c) }

I devided original dataset into the training set(training_set_5) and test set(test_set_2). After that I finally made and validated the model like following. Please look 5th row.

# My final goal was making confusionmatrix but the result of "predict" function was probability of continuous form, from 0 to 1. So I thought prior to make confusionmatrix, I have to convert it into binary factor type 1 and 0 again.

training_glm = glm(binary_exposure ~ count + distance, data = training_set_5, family = binomial) summary(training_glm) vif(training_glm)

prediction = predict(training_glm, newdata = test_set_1, type = "response") prediction = as.factor(ifelse(prediction > 0.5, "1", "0")) test_set_2 = as.factor(test_set_1$binary_exposure)

confusionMatrix(prediction, test_set_2, positive = "1")

But at this point I became confused. Should I give same threshold value when I convert dependent variable at first and when I convert result of predict into factor? Thus, at 4th row of 1st code, I set 0.75 and if certain dependent variable data belongs to upper 25%, I gave 1, as I want to do. And at 5th row of 2nd code, I set 0.5, as every youtubers do(There was no youtubers modified this 0.5 to others, but the problem is, there were also none that shows the entire process what I wanna do).

And basically when people explain logistic regression they say just "If probability of model is bigger than 0.5 it goes to true and if not it goes to false".

At first I thougt two cutoff are totally separated, but as thinking more I got the question.

First of all, the first one is for deviding continuous dependent variable(not in probability form but just numeric form like 15,834) to 1 and 0 so that I can input it in modeling procedure. And the second one is for making output of "predict"(in probability form like 0.46) to 1 and 0. That is the difference.

And what I wanna identify and the purpose I making model is for "is this data belongs to upper 25%, considering independent variables?".

At this situation, is it right setting cutoff as 0.5, or modifying as 0.75?

  • I don't understand why you binarize your dependent variable. Why are you discarding valuable information? – Roland Aug 01 '22 at 05:20
  • Don't bin your original data in the first place. As Roland writes, this just throws away important information. – Stephan Kolassa Aug 01 '22 at 06:41
  • @Roland First of all, I wanted to make a classifier and thought logit model would be the best in the situation I was in. But no material told me how to making a model with continuous form of dependent variable. So I'm sorry, but could you tell me what you 're talking exactly? I wonder what should I do if i must not binarize the dependent value. – user364400 Aug 01 '22 at 13:28
  • You should build a model that can predict continuous values, i.e., what you have measured. If you believe you must do so, you can then binarize the predicted values. However, the cut-off should not be a quantile of your data but derived from domain knowledge. – Roland Aug 01 '22 at 13:33
  • @StephanKolassa Of course I am not interested in binarizing continuous formed variable itself. The fundamental reason that I am doing this is to make logit model. In other words, as far as I know(ofc I'm just a novice and learning nowadays) the variable I wanna use as dependent variable is not the contiuous, nor in the 0 ~ 1. So when I input this as it is to R, it outputs error message that dependent variable should be the value of 0~1. I'm not interested in regression, but in classification. So what should I do then? I would be very appreciate if you give me an advice. – user364400 Aug 01 '22 at 13:45
  • If your dependent variable is exposure and continuous, why do you want to do classification? Why don't you just predict exposure as a continuous variable, using OLS or some other method for continuous data? (I would assume exposure to be nonnegative, so something like a gamma regression or OLS on logged data might be appropriate.) – Stephan Kolassa Aug 01 '22 at 13:53
  • @StephanKolassa Umm...yeah it is nonnegative, and I got it. So you are talking if a variable that I wanna use as dependent variable of predictor is continuous, and I don't have the very desperate reason that I have to do so, it is not appropriate for making classifier with that? – user364400 Aug 01 '22 at 14:00
  • @Roland Model that can predict continuous values? But does such things exist among classifiers, not among regressions? Thus if I understand correctly, is it right that there's a model predicts or classifies the category when new data comes into? By continuous formed variable? Or you're talking at the first it is not appropriate making classifiers based on(with the criteria of) continuous variable if I don't have very desperate reason that I must use it as dependent variable? – user364400 Aug 01 '22 at 14:03
  • Exactly. Discretizing data always loses information, and should never be done without a very good reason. It's almost always better to model and predict continuous data, and potentially use thresholds on predictions (see the proposed duplicate). – Stephan Kolassa Aug 01 '22 at 14:05
  • @StephanKolassa Ok...I think I have to study more. Appreciate for your help. – user364400 Aug 01 '22 at 14:07
  • Let's say you are producing wine. The sugar content of the grapes determines if you will produce premium or just regular wine. So, you build a regression model that predicts the sugar content of the grapes based on weather data and such. The classification then comes after the prediction. If your model predicts a sugar content above a certain value (which is based on research and experience), you'll probably have a premium wine. – Roland Aug 01 '22 at 14:07
  • @Roland So, making regression > set certain threshold based on logic > predict the test(validation) data with the model > classify values that over the preset threshold This can be a good method, right? – user364400 Aug 01 '22 at 14:33

0 Answers0