Analyse categorial data where best outcome is middle level

Question

I have a dataset where the outcome variable is the result of a blood test that ranges from 10 to 40.

A person is healthy if the result is between 20 and 30. Under 20 and over 30 are equally bad results, one is not worse than the other.

I want to run a regression (if is the best choice) to check the variables that affect these results. On my sample, 5% of people are under 20, 50% are between 20 and 30 and 45% are over 30. These results are aligned with expectation of the whole population. I've seen several studies excluding people that are under 20 (it's rarely more than 5%) and running multiple linear regression to check how the covariates affect the outcome.

I suppose it is not correct to exclude part of the sample, even though it is "small" (5%). What are the implications of doing in this case?
If I run linear regression to the entire sample, how can I interpret a covariate with positive and significant coefficient? E.g. physical activity hours per week =1.2. An extra hour of physical activity increase the value of the outcome. So this is good if you are under 20, but bad if you are close to 30. Does it make sense? The same applies to a logistic regression
If not linear regression, which model should I use in this case?

score 3 · Answer 1 · answered Feb 01 '23 at 20:51

3

Your interpretation in the second bullet point makes sense to me. Your goal as a statistician is to predict the value. Once you are good at this, a clinician can use your predictions. If that means the clinician (or patient) wants to see values around $25$, so be it. Your job is to make sure that a prediction of $25$ should be taken seriously.

Consequently, it makes sense that a positive coefficient means that people at the low end would want to increase their value of that covariate in order to meander into the "good" zone, while people at the high end would want to decrease their value for the same reason.

To address the other two bullet points:

$1)$ Omitting data like that is dicey. Among the issues are that you can lose your ability to speak intelligently about individuals from that population and that you lose sample size. Particularly troubling to me, however, is that the data are omitted based on the outcome. Presumably, this outcome is not something you observe (easily, or maybe not until later), so you want to predict it. However, since you don't get to see it, you cannot know if the individual for whom you are making predictions belongs to the population used for model training. If they do not, the model need not apply.

$3)$ While it might be tempting to run a logistic regression to predict the probability that someone is in the "good" range or the "bad" range, doing so destroys information. This kind of binning has been admonished on Cross Validated many-a-time, such as here. For instance, if someone has a value of $29.9$, that's technically in the "good" range, but it's really close to being in the bad range. I, as a patient, would like to know if I am on thin ice, even if I am dry for now. (That is, the thin ice has not broken and dropped me in the lake.)

answered Feb 01 '23 at 20:51

Dave

62,186

3

Along those lines, use ordinal regression to predict the rawest form of the data, possibly reordered if the best value is in the model. The ordinal model should have has many levels as the raw data have. Resources are here. – Frank Harrell Feb 01 '23 at 20:59
1

@FrankHarrell What do you mean by the reordering? – Dave Feb 01 '23 at 21:06
1

@FrankHarrell The link is broken. – J-J-J Feb 01 '23 at 21:24
1

Why not simply having a binary response variable (Good health = 1, Bad health=0) and then run a logistic regression and interpret the coefficients as the effect of the predictors on the health outcome? – Amin Shn Feb 01 '23 at 22:31
@AminShn Please refer to my final paragraph and the link within it where such an approach is discussed. – Dave Feb 01 '23 at 22:33
The binning problem that causes loss of information is irrelevant in this question. He doesn't ask whether we should bin certain regions of test variable into "Good" and "Bad". He already assumes that's done. The question is about how to make a model with parameters that have reasonable interpretation with regard to the test outcome that is binned into "Good" and "Bad". – Amin Shn Feb 02 '23 at 14:29
@AminShn Please feel free to post an answer that involves your interpretation of the problem this way. – Dave Feb 02 '23 at 16:37
2

For those who'd be looking for the page Frank Harrell mentions above, here is the correct link (I guess): https://www.fharrell.com/post/rpo/ – J-J-J Feb 02 '23 at 19:09
3

The binning problem is very relevant; the problem got started on the wrong foot. See here. – Frank Harrell Feb 02 '23 at 21:15
Thanks @FrankHarrell, totally agree that binning should be avoided. And you and Dave gave the questioner heads up. But his question still can be answered as it is. Whether he wants to pay the cost of binning is up to him. – Amin Shn Feb 03 '23 at 10:25
1

No, the cost of binning needs also to count our time in answering the wrong question. Design problems cannot be fixed by analysis. – Frank Harrell Feb 03 '23 at 20:01

Amin Shn · Answer 2 · 2023-02-03T10:17:20.310

Disclaimer: I know binning is not a good idea but I'll stick to your problem definition anyway and assume that you have binned the test outcome into "Good" and "bad" health and age into three groups.

With regard to your first bullet point, I don't see any reason for excluding an age group. About your other inquiries, I'm going to make a simulation following the context of your problem and will show how you can approach the problem to develop a model with interpretable outputs. First lets simulate a dataset in R similar to what you described:

## ------ Data simulation ------
set.seed(123)
A single continuous predictor
x = rnorm(1000,0,10)
1st group 5%, 2nd group 50%, 3rd group: 45%
age_group = c(rep(1, 50), rep(2, 500), rep(3, 450))
Parameters
alpha = c(10, 5, 6) # Age group specific intercepts
beta = c(5, 4, 2) # Age group specific effects
epsilon = rnorm(1000,0,10)
Blood test outcome
y = alpha[x2] + beta[x2] * x + epsilon
Binary variable for "Good" and "Bad" health
z = ifelse(y > -50 & y < 50, 1, 0) # Only values between -50 & 50 are considered healthy

In this toy example, the data includes a single continuous predictor $x$, which has different effects on a test outcome $y$ (e.g. blood) depending on the age group (group = 1,2,3). Then, we assume that only individuals with their -50 < y < 50 are healthy which creates the binary $z$ variable. Let's plot $x$ versus $z$:

plot(x,z)

Apparently healthy people have their $x$ roughly between -10 & 10. But going below -10 or above 10 deteriorates their health. So the relationship is not linear and we need to build a model that captures the non-linearity and accommodates different relationships between $x$ and $z$ among the age groups. For this, we can develop logistic regression model, with random effects (varying depending on age group) and a non-linear formula. We can write the model as follows:

$$y \sim Bernoulli(p) $$ $$ logit(p) = \alpha_{j} + \beta_{j}x + \eta_{j}x^2, \ j \in Age \ groups = \{1,2,3\}$$

As you can see I defined a polynomial function in the link function to make the model more flexible and that allows it to account for non-linearity. I also let each age group have their own parameters to account for their differences.

I am going to fit this model in a Bayesian way using R2jags package (I'll put the frequentist solution using the glm function as a comment). Bayesian approach lets you to calculate various probabilities of your interest. For example you can calculate the probability of someone in the 2nd age group, with his $x$ value between -15 & -13 to be healthy. Let's fit the model and plot the results:

library(tidyverse)
library(R2jags)
----------- Model development in JAGS ---------
model_code <- "
model
{
Likelihood
for (t in 1:length(z)) {
    z[t] ~ dbin(p[t], 1)
    logit(p[t]) <- alpha[age[t]] + beta_1[age[t]] * x_1[t] + beta_2[age[t]] * pow(x_1[t],2) 
  }
Priors
for (i in 1:max(age)){
alpha[i] ~ dnorm(0.0,1^-2)
beta_1[i] ~ dnorm(0.0,1^-2)
beta_2[i] ~ dnorm(0.0,1^-2)

}
}
"
Model data
model_data = list(z = z, x_1 = x1, age = age_group)
Parameters to save
model_parameters = c('alpha', 'beta_1', 'beta_2', 'p')
Run the model
model_run <- jags(
  data = model_data,
  parameters.to.save = model_parameters,
  model.file = textConnection(model_code))
Extracting the fitted probabilities
expected_probs = model_run$BUGSoutput$mean$p
df = data.frame(age_group = as.factor(age_group), x = x, z = z, exp_probs = expected_probs)
Extracting posterior samples for uncertanity estimation
posterior_probs <- model_run$BUGSoutput$sims.list$p
df_probs = as.data.frame(t(posterior_probs))
only keeping a fraction of the sample for faster computations
samp = sample(1:ncol(df_probs), 200) 
df_probs = df_probs[,samp]
df_probs = cbind(df_probs, df)
Transforming the dataset from wide to long for plotting
df_probs_long = df_probs %>% 
  pivot_longer(names_to = 'group', values_to = 'probs', -c(age_group:exp_probs))
ggplot(df_probs_long) +
  geom_line(aes(x=x , y = probs, group = group, linetype = "Model uncertainty "), color = 'lightblue') +
  geom_line(aes(x , exp_probs), color = 'red') + 
  labs(y = 'Probability of being healthy') +
  facet_wrap(~age_group, 2,2) +
  theme_bw()

From the plot you can interpret how the probability of being healthy changes as the predictor changes from lower values to higher values. You can see the relationship is non-linear and the values in the middle are very likely to be associated with being healthy. Also, you can see the uncertainty in the estimations as ensemble lines in blue. The first age group has the highest uncertainty because it has the lowest number of observations (only 5%). Now as an example lets calculate the probability of someone in the 2nd age group, with his $x$ value between -15 & -13 to be healthy:

# ----- Calculating the probability of P(being healthy| -15 < x < -13, age_group = 2)
df_probs_long %>% 
  filter(age_group == 2, x > -15, x < -13) %>% 
  pull(probs) %>% 
  quantile(., probs = c(0.025, 0.5, 0.975))
 2.5%       50%     97.5% 

0.2762582 0.5202037 0.7102697

our expected probability of health is 0.52 with the 95% credible interval of 0.28-0.71.

The frequentist approach: glm.fit = glm(z ~ poly(x, 2), family = binomial(link = 'logit')). — Amin Shn, Feb 02 '23 at 21:54

Analyse categorial data where best outcome is middle level

2 Answers2

A single continuous predictor

1st group 5%, 2nd group 50%, 3rd group: 45%

Parameters

Blood test outcome

Binary variable for "Good" and "Bad" health

----------- Model development in JAGS ---------

Likelihood

Priors

Model data

Parameters to save

Run the model

Extracting the fitted probabilities

Extracting posterior samples for uncertanity estimation

only keeping a fraction of the sample for faster computations

Transforming the dataset from wide to long for plotting

Linked