How to incentivise AI to make risky predictions

Question

I'm trying to build a weather forecasting AI. I have a dataset that contains the peak temperatures for each day. I have trained it with MSE as the loss function and it has worked fairly well. I do notice however that it tends to prefer "safe" predictions, i.e. predictions near the mean temperature for the period. I'm also interested in getting the unexpected predictions right so I've been trying to incentivise it to make riskier predictions, but I've had no luck.

I've tried using exponential and higher order polynomial loss functions instead of squared, thinking the quicker increase in values would reward preciseness. I've tried penalising safe predictions by scaling the loss for predictions made near the mean temp. I've tried surrounding the mean temp by a step function. I don't know what else to try and I also don't know the terminology to search for an answer online.

Do you mean that you want the model to be able to predict the unusual events and it tends to predict values near the overall mean instead of in the extremes of the pooled/marginal distribution? — Dave, Dec 07 '23 at 18:47
Yes, that is correct. The model predictions are within ±10% of the target value 96% of the time for values ±10% from the mean but only 17% of the time for values > ±20% from the mean. I'm looking for a way to increase the 17% in exchange for the 96%. — n-l-i, Dec 07 '23 at 20:02
What in the features distinguishes those outcomes from more routine outcomes? — Dave, Dec 07 '23 at 20:04
Also, do you want to incentivize the model to make extreme (risky) predictions despite the fact that many of these predictions will be grossly wrong, or do you just want to make better predictions? Those are not the same to me. — Dave, Dec 07 '23 at 20:21
Yes, I want to incentivise it to make more risky and more wrong predictions. — n-l-i, Dec 07 '23 at 20:22
You understand that this degrades the credibility of such a prediction, yes? That is, why believe the model prediction in the extremes when it does that all the time? (You can think of this as being like a model with high recall but low precision. Sure, you catch most of these unusual observations, but a prediction of an extreme event lacks credibility when such predictions tend to be incorrect.) — Dave, Dec 07 '23 at 20:27
Yes, I want to experiment with different models and ensemble different configuration. I figure high recall low precision could be an interesting addition and a good oppurtunity to learn. I am just a bit stuck on how to achieve that. — n-l-i, Dec 07 '23 at 20:31
I've asked and possibly answered a question related to our discussion here. Loosely speaking, what you propose means that the distribution of true values, conditioned on predicting an extreme "risky" value, will be far away from that risky value. If this is acceptable (analogous to low precision but high recall), then your ideas make sense. — Dave, Dec 21 '23 at 21:36

score 11 · Answer 1 · answered Dec 07 '23 at 18:33

The MSE is minimized in expectation by the mean. So using the MSE as a loss function rewards your model (whether AI or anything else) to output the conditional mean. Your model is doing what you told it to do. I don't know about higher order polynomials, but suspect that they also elicit the mean. The MAE elicits conditional medians: Why does minimizing the MAE lead to forecasting the median and not the mean? You may find this paper helpful in thinking about error measures and the point forecasts they elicit.

If you want extreme predictions, consider quantile forecasts, i.e., predictions of conditional quantiles. Use high quantiles to predict high temperatures (or else), or low quantiles to predict low ones. These quantile forecasts are elicited by using pinball losses as loss functions, see this thread and Gneiting (2011) as cited there: Pinball loss as a synonym for quantile loss: misleading?

I get the feeling that this is another example of a question along the lines of this question of mine or those discussed in your Meta question. — Dave, Dec 07 '23 at 18:50

Dave · Answer 2 · 2024-01-02T21:18:10.863

OVERFIT

You want the model to chase after extreme values, even though there might be little credibility to extreme predictions. I say the way to do this is to allow for considerable flexibility that will allow the model to chase after these points, at the danger of overfitting (but you seem to be okay with this). In other words, incentivize the risky predictions by penalizing it for missing the observed (extreme) values during training (which is typical of regression), and give it the ability to get close to those values and not have to be penalized.

When there are extreme observations, the enormous flexibility will allow the model to chase after them and get tight fits to them. However, that will drag the predictions away from the mainstream of the data, where true values tend to be. If the features distinguish such extreme points from the bulk of the observations, then this will be desirable, and your extreme predictions will be reliable: predictions far higher or far lower than the overall mean or median will tend to happen only when the observed values tend to be much higher or lower, respectively, than the bulk of the data as measured by the mean or median. If you lack features that distinguish such predictions, then these extreme predictions will not be reliable: most of the time, when you predict something especially high or especially low, the obseration will be fairly mundane.

Let's look at a simulation.

library(nnet)
set.seed(2023)
N <- 100
a <- -10
b <- +10
x <- seq(a, b, (b - a)/(N - 1))
Ey <- sin(x)
d <- rbinom(N, 1, 0.2)
e <- (1 - d)*rnorm(N) + d*rt(N, 1.01)
y <- Ey + e
plot(x, y)
lines(x, Ey)
L <- nnet::nnet(
  y ~ x,
  size = 300, 
  linout = T,
  maxit = 2500
)
lines(x, predict(L), col = 'red')

The simulation gives an expected value of the outcome $y$ that follows a sine wave. Then there is an additive error that is mixed between standard normal and a heavy-tailed $t$-distribution that gives some extreme values. When we fit a highly flexible neural network that is able to chase after the extreme values, we get a tight fit and a model that is willing to make extreme predictions. However, when we get new data, we see that these extreme predictions do not correspond with extreme observations. In such a situation, even though the model predicts extreme values, there is little reason to anticipate an extreme value will be observed.

set.seed(2024)
par(mfrow = c(2, 2))
d1 <- rbinom(N, 1, 0.2)
e1 <- (1 - d1)*rnorm(N) + d1*rt(N, 1.01)
plot(x, Ey + e1, ylim = c(-25, 15))
lines(x, predict(L), col = 'red')
d2 <- rbinom(N, 1, 0.2)
e2 <- (1 - d2)rnorm(N) + d2rt(N, 1.01)
plot(x, Ey + e2, ylim = c(-25, 15))
lines(x, predict(L), col = 'red')
d3 <- rbinom(N, 1, 0.2)
e3 <- (1 - d3)rnorm(N) + d3rt(N, 1.01)
plot(x, Ey + e3, ylim = c(-25, 15))
lines(x, predict(L), col = 'red')
d4 <- rbinom(N, 1, 0.2)
e4 <- (1 - d4)rnorm(N) + d4rt(N, 1.01)
plot(x, Ey + e4, ylim = c(-25, 15))
lines(x, predict(L), col = 'red')
par(mfrow = c(2, 2))

In all four instances, the big spikes toward $\pm10$ are far away from the observed values.

Consequently, if you want to get the AI system to make extreme ("risky") predictions, even if those predictions will not be reliable, overfitting to the training data is a path forward, and this shows the dangers of such wanting the AI system to make such predictions that are not reliable.

EDIT

A more illustrative simulation might be the one below.

library(nnet)
library(data.table)
set.seed(2023)
N <- 100
a <- -10
b <- +10
x <- seq(a, b, (b - a)/(N - 1))
Ey <- sin(x)
d <- rbinom(N, 1, 0.2)
# e <- (1 - d)*rnorm(N) + d*rt(N, 2.01)
e <- rt(N, 4.1)
y <- Ey + e
par(mfrow = c(1, 2))
plot(x, y)
lines(x, Ey)
L <- nnet::nnet(
  y ~ x,
  size = 300, 
  linout = T,
  maxit = 2500
)
preds <- predict(L)
lines(x, preds, col = 'red')
R <- 250
L <- list()
L[[1]] <- data.frame(
  x = x,
  y = y
)
for (i in 1:R){
  d1 <- rbinom(N, 1, 0.2)
  # e1 <- (1 - d1)*rnorm(N) + d1*rt(N, 2.01)
  e1 <- rt(N, 4.1)
  d <- data.frame(
    x = x,
    y = Ey + e1
  )
  L[[i+1]] <- d
}
dat <- data.table::rbindlist(L)
plot(dat)
lines(x, preds, col = 'red')
par(mfrow = c(1, 1))

On the left, when we make those two really extreme predictions beyond $\pm 5$, it looks like we're spot on. However, when we look at a ton of data, we see how silly those predictions are. Imagine telling your boss that, when $x = 7.777778$, then the prediction is about $8.6$. I would imagine a (warranted, I believe) response of, "Well, n-l-i, then why do we observe $x = 7.777778$ all the time yet basically never get $y \approx 8.6$ when we do?"

EDIT 2

The model already has incentive to make extreme ("risky") predictions, as the loss function is high when those value occur yet modest predictions are made. Giving flexibility allows the model to fit to those extreme predictions and reduce the loss, which it wants to do, but there is a risk that you will...

...OVERFIT

How to incentivise AI to make risky predictions

2 Answers2

Linked

Related