1

I am conducting a regression analysis. I have a first problem and that is that the database has only an n of 41 (n=29 for one group and n=12 for the second group).

But I have gone ahead with the logistic regression. For this I used the following code:

 model<-glm(grupo ~arrayil2+citometriail2+citometriail4+citomtriail5,
     data = BD,
     family = binomial(link="logit"))

The "group" variable is whether the patient is asthmatic or not and the variables arrayil2, cytometriail2, cytometriail4, cytomtriail5 are continuous values.

TABLA<-logistic.display(model,crude.p.value = TRUE, decimal = 3);TABLA<-as.data.frame(TABLA)

This gives me the following result:

enter image description here

I don't quite understand the result, as I usually perform logistic regressions with a larger n and with categorical variables. I would need help in interpreting these results and whether they are wrong or not.

Thanks in advance.

  • The minimal sample size for estimating a single proportion is $n=96$. That is equivalent to estimating only an intercept in a logistic model. So this is pretty hopeless. See https://hbiostat.org/rmsc/lrm.html . By all means ignore all point estimates and look only at confidence intervals. – Frank Harrell Aug 04 '22 at 16:49
  • Thank you @FrankHarrell. I have conducted a random forest as an alternative to this analysis because it does not requiere a minimum sample size. – Adrián P.L. Aug 04 '22 at 16:52
  • https://stats.stackexchange.com/search?q=interpret+logistic+regression – whuber Aug 04 '22 at 17:23
  • 6
    No, random forest, because it is nonparametric and puts no structure on the problem, does not benefit from an additivity assumption and requires 10x higher samples size than logistic regression. See https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-14-137 where estimates of 200 events per candidate feature are required for RF. – Frank Harrell Aug 04 '22 at 19:06
  • I do not understand @FrankHarrell I have seen many articles which explain that RF over the sample size problem. see https://www.sciencedirect.com/science/article/pii/S0165783620300515?casa_token=i28hz0GmV0UAAAAA:pgwG2pMS30ylFoTJbMBIeqzOJBDpTTAQ4b9gx1uB-OUjQkdXN9gQIJrjIOQ609cCJXTMrfLR0A – Adrián P.L. Aug 05 '22 at 16:45
  • 2
    That paper did not use a valid accuracy score and should be ignored. RF like CART (recursive partitioning) needs incrediblly large sample sizes. See again the reference I posted above. In addition to other severe problems with RF in your context, your sample size is not 1/10th as large as what's needed to determine the tuning parameters needed for RF. It is a grave mistake to think that RF can get by with smaller samples than LRM. The exact opposite is the case. – Frank Harrell Aug 05 '22 at 17:29
  • @FrankHarrell Typically the one restriction on random forest is that your number of features should be quite big - the first step of RF is to choose 1/3n or sqrt(n) features to construct a tree (depending on task, regression/classification). So if I have quite a lot of features, I can use RF even on small dataset - there is no algorithm that works really good on small datasets so I do not loose nothing. – Adrián P.L. Aug 05 '22 at 17:41
  • 2
    What you think "works" in fact doesn't. As expert Rob Tibshirani explained at a statistics meeting, RF is for "tall and thin" data which is not what most practitioners think. RF needs a huge number of observations and not many candidate variables to have satisfactory performance. Of course that is the exact situation where regression works quite well. Most people who think that something "works" do not use correct accuracy scores in the assessment, and do not routinely derive smooth nonlinear calibration curves to assess absolute predictive performance. – Frank Harrell Aug 05 '22 at 17:46
  • 3
    Frank Harrell is right that doing this analysis with only 41 observations is likely hopeless and that a random forest approach, no matter how well RF can perform in some situations, is unlikely to save you from your small sample size (fewer assumptions means a greater need for data to figure out what is happening). However, interpreting such a table, had you used (say) 41,000 observations, is a reasonable question. It might help to format the table so it’s easier to read and doesn’t cut off pieces of text. – Dave Aug 06 '22 at 16:57

1 Answers1

3

As indicated in the comments, your particular model with this particular small data set is uninterpretable. The odds-ratio estimates and confidence intervals (as best as I can tell from the image of the table) seem to be enormous. I suspect that you have something close to perfect separation with only 12 cases in the minority class and 4 continuous predictor variables.

Otherwise, the interpretation of a coefficient for a continuous linearly modeled predictor in logistic regression is straightforward: it's the increase in log-odds of an event per unit change in that continuous predictor.

EdM
  • 92,183
  • 10
  • 92
  • 267