One way you can do this that doesn't involve customized loss functions is by augmenting your data with "fake data" that reflects your prior beliefs about the data points. The augmented data consists of some data points with the same values for the features as the data points for which you have prior beliefs and, in this case, 50% "0" values and 50% "1" values for the response variable. The number of augmented data points depends on the strength of your prior beliefs; for example, a prior belief of a probability of 50% with a standard deviation of 0.05 corresponds to the mean and standard deviation of 100 draws from a bernoulli distribution with $p=0.5$. For each such data point, we would replicate the features 100 times and make the corresponding target variable values equal to 50 zeroes and 50 ones.
An example follows:
library(data.table)
x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- rnorm(100)
p <- exp(x1 + x2 + x3)/(1 + exp(x1+x2+x3))
y <- rbinom(100, 1, p)
df <- data.table(y=y, x1=x1, x2=x2, x3=x3, p=p)
setkey(df, p) # orders the data frame by the variable p
m1 <- glm(y ~ x1 + x2 + x3, family="binomial", data=df)
df$predict <- predict(m1, type="response")
# Assume data points 45 - 55 are believed to have probability = 0.5
# with an sd of 0.05 - corresponding to a Binomial(100, 0.5) dist'n
df2 <- df
for (i in 45:55) {
fake_y <- c(rep(0, 50), rep(1, 50))
fake_data <- data.table(y = fake_y, x1 = df$x1[i],
x2 = df$x2[i], x3=df$x3[i], p=0.5, predict=NA)
df2 <- rbind(df2, fake_data)
}
m2 <- glm(y~x1+x2+x3, family="binomial", data=df2)
df2$augmented_predict <- predict(m2, type="response")
Comparing the model parameters shows us there's been a substantial change:
> summary(m1) # original model
... stuff ...
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.3210 0.2711 1.184 0.236499
x1 1.5915 0.4166 3.820 0.000133 ***
x2 1.0751 0.2884 3.728 0.000193 ***
x3 1.1683 0.3407 3.429 0.000606 ***
> summary(m2) # model with augmented data
... stuff ...
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.04269 0.06005 0.711 0.477
x1 0.93603 0.19525 4.794 1.64e-06 ***
x2 0.91351 0.19149 4.770 1.84e-06 ***
x3 0.90534 0.19213 4.712 2.45e-06 ***
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
... and the values of the predictions for rows 45 - 55:
> df2[45:55, .(predict, augmented_predict)]
predict augmented_predict
1: 0.3953809 0.4566122
2: 0.4422332 0.4628846
3: 0.5301502 0.4740564
4: 0.5512260 0.4767186
5: 0.5861047 0.4869965
6: 0.7030460 0.5044203
7: 0.5987784 0.5128722
8: 0.6199068 0.5262623
9: 0.6037670 0.5322686
10: 0.4932125 0.5415195
11: 0.6537622 0.5553389
You do have to be careful about this, however. In this case, we have 100 "original" data points and 1,100 "fake" data points, so our results are driven mostly by our (very strong in toto) prior beliefs. Is our prior information really equivalent to observing 1100 data points? In this example it is, but not likely in the real world. A little humility about how much we know goes a long way!
Note that this approach can (not "should") also be taken in other contexts, e.g., ridge regression can be estimated with augmented data and OLS.