Logistic regression with known probabilities for some datapoints

Question

I have a dataset with several features and a binary outcome (0 or 1), from which I want to estimate the probability of getting 1 based on some of those features.

I thought of applying logistic regression, however there’s a constraint I have on some datapoints where the probability is expected to be around 1/2. How do I enforce this in the model training? (Related question: how do I enforce this with sklearn?)

Intuitively I’m thinking of modifying the usual likelihood function by adding the terms with probability at 1/2, but not sure if there’s a better way or if I should use another approach altogether.

You could put a narrow prior probability on those points if you took a Bayesian approach. — Galen, Jan 19 '23 at 14:48
Can you please be more specific? I thought that in the bayesian approach, the prior probability is on the model parameters, not the datapoints — memegame, Jan 19 '23 at 15:08
Could you tell us the context where this arises? One idea could be to add a term to the log-likelihood function, the same way as is done for regularization. That is, $\mathcal{L} + \lambda \sum_{i \in K}(\hat{p}_i - 0.5)^2$ and then use some optimizer — kjetil b halvorsen, Jan 19 '23 at 15:28
@memegame I was alluding to including error parameters on the measurements, and making the error much smaller on these special points you've mentioned. I also like Kjetil's suggestion, especially if you don't want to go with a Bayesian approach. — Galen, Jan 19 '23 at 15:52
The probability of a response is one of the model parameters. In general, the parameters of a logistic regression are the probabilities of every one of the responses. Usually we develop a more parsimonious model than that by supposing those parameters are related to one another by means of the link function. The probabilities you specify could conflict with that supposed link, though, so a Bayesian approach with a generously unfocused prior would be a judicious way of addressing this. — whuber, Jan 19 '23 at 16:39

Jarle Tufto · Answer 1 · 2023-02-10T22:33:02.647

I will assume that the objective here is to fit the model subject to the constraint that the predicted probability for one of the observations should have a value of exactly $p_0$ (not just a value "around" $p_0$). Letting $\mathbf{x}_0$ denote the covariate vector for that observation, this amounts to a linear constraint $$ \mathbf{x}_0^T\boldsymbol\beta=\eta_0=\operatorname{logit} p_0 \tag{1} $$ on the parameter vector $\boldsymbol\beta$. This means that one of the elements of $\boldsymbol\beta$ in effect is redundant and can be eliminated from the model by substitution. This in turn leads to a modified design matrix and an offset term as follows.

Partitioning $\boldsymbol\beta$ into subvectors containing the redundant element $\beta_j$ and the remaining elements $\boldsymbol\beta_{-j}$, and partitioning $\mathbf{x}_0$ the same way such that $x_{0,j}\neq 0$, (1) can be written as $$ \begin{bmatrix} \mathbf{x}_{0,-j}^T & x_{0,j}\end{bmatrix} \begin{bmatrix} \boldsymbol\beta_{-j}\\ \beta_j \end{bmatrix} =\eta_0 $$ or $$ \mathbf{x}_{0,-j}^T \boldsymbol\beta_{-j} + x_{0,j}\beta_j=\eta_0 $$ implying that $$ \beta_j=x_{0,j}^{-1}(\eta_0 - \mathbf{x}_{0,-j}^T \boldsymbol\beta_{-j}).\tag{2} $$

For the other observations we have $$ \boldsymbol\eta=\mathbf{X}\boldsymbol\beta, $$ or, after partitioning of the design matrix $\mathbf{X}$ the same way into its $j$'th column $\mathbf{x}_j$ and the remaining columns $\mathbf{X}_{-j}$, $$ \boldsymbol\eta=\begin{bmatrix} \mathbf{X}_{-j} & \mathbf{x}_j\end{bmatrix} \begin{bmatrix} \boldsymbol\beta_{-j}\\ \beta_j \end{bmatrix}=\mathbf{X}_{-j}\boldsymbol\beta_{-j} + \mathbf{x}_j\beta_j. \tag{3} $$ Substituting (2) into (3) and rearranging terms the model takes the form $$ \boldsymbol\eta=\underbrace{(\mathbf{X}_{-j}-x_{0,j}^{-1}\mathbf{x}_j\mathbf{x}_{0,-j}^T)}_{\mathbf{X}^*}\boldsymbol\beta_{-j} + \underbrace{x_{0,j}^{-1}\eta_0\mathbf{x}_j }_{\text{offset}} $$ where $\mathbf{X}^*$ is the appropriate modified design matrix and the second term the offset term needed to fit the model subject to (1).

As a side note, a potentially useful application of the above method is the computation of profile likelihood based confidence intervals for $\eta_0$ and $p_0$ which may have better performance than the current confidence intervals computed by predict.glm in R that relies on asymptotic normality of $\hat{\boldsymbol\beta}$. The above method can also be straightforwardly extended to fit GLMs under general linear hypotheses on the form $\mathbf{C}\boldsymbol\beta=\mathbf{d}$ to facilitate likilihood ratio tests as opposed to sometimes less accurate Wald tests of such hypotheses.

+1. Very clear and helpful. But please note that this solution implicitly interprets "around 1/2" as "exactly 1/2," which might not be the intended meaning. — whuber, Jan 27 '23 at 12:26
@whuber Thanks, I of course agree with your comment on how I have interpreted the questoin. — Jarle Tufto, Jan 27 '23 at 13:47

score 2 · Answer 2 · edited Apr 02 '23 at 23:30

One way you can do this that doesn't involve customized loss functions is by augmenting your data with "fake data" that reflects your prior beliefs about the data points. The augmented data consists of some data points with the same values for the features as the data points for which you have prior beliefs and, in this case, 50% "0" values and 50% "1" values for the response variable. The number of augmented data points depends on the strength of your prior beliefs; for example, a prior belief of a probability of 50% with a standard deviation of 0.05 corresponds to the mean and standard deviation of 100 draws from a bernoulli distribution with $p=0.5$. For each such data point, we would replicate the features 100 times and make the corresponding target variable values equal to 50 zeroes and 50 ones.

An example follows:

library(data.table)

x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- rnorm(100)

p <- exp(x1 + x2 + x3)/(1 + exp(x1+x2+x3))
y <- rbinom(100, 1, p)

df <- data.table(y=y, x1=x1, x2=x2, x3=x3, p=p)
setkey(df, p)  # orders the data frame by the variable p

m1 <- glm(y ~ x1 + x2 + x3, family="binomial", data=df)
df$predict <- predict(m1, type="response")

# Assume data points 45 - 55 are believed to have probability = 0.5
# with an sd of 0.05 - corresponding to a Binomial(100, 0.5) dist'n

df2 <- df
for (i in 45:55) {
  fake_y <- c(rep(0, 50), rep(1, 50))
  fake_data <- data.table(y = fake_y, x1 = df$x1[i], 
                 x2 = df$x2[i], x3=df$x3[i], p=0.5, predict=NA)
  df2 <- rbind(df2, fake_data)
}

m2 <- glm(y~x1+x2+x3, family="binomial", data=df2)
df2$augmented_predict <- predict(m2, type="response")

Comparing the model parameters shows us there's been a substantial change:

> summary(m1)  # original model
... stuff ...
Coefficients:
            Estimate Std. Error z value Pr(>|z|)

(Intercept)   0.3210     0.2711   1.184 0.236499

x1            1.5915     0.4166   3.820 0.000133 ***
x2            1.0751     0.2884   3.728 0.000193 ***
x3            1.1683     0.3407   3.429 0.000606 ***
> summary(m2)  # model with augmented data
... stuff ...
Coefficients:
            Estimate Std. Error z value Pr(>|z|)

(Intercept)  0.04269    0.06005   0.711    0.477

x1           0.93603    0.19525   4.794 1.64e-06 ***
x2           0.91351    0.19149   4.770 1.84e-06 ***
x3           0.90534    0.19213   4.712 2.45e-06 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

... and the values of the predictions for rows 45 - 55:

> df2[45:55, .(predict, augmented_predict)]
      predict augmented_predict
 1: 0.3953809         0.4566122
 2: 0.4422332         0.4628846
 3: 0.5301502         0.4740564
 4: 0.5512260         0.4767186
 5: 0.5861047         0.4869965
 6: 0.7030460         0.5044203
 7: 0.5987784         0.5128722
 8: 0.6199068         0.5262623
 9: 0.6037670         0.5322686
10: 0.4932125         0.5415195
11: 0.6537622         0.5553389

You do have to be careful about this, however. In this case, we have 100 "original" data points and 1,100 "fake" data points, so our results are driven mostly by our (very strong in toto) prior beliefs. Is our prior information really equivalent to observing 1100 data points? In this example it is, but not likely in the real world. A little humility about how much we know goes a long way!

Note that this approach can (not "should") also be taken in other contexts, e.g., ridge regression can be estimated with augmented data and OLS.

Logistic regression with known probabilities for some datapoints

2 Answers2

Linked