MNIST with a TWIST, no labels given, only probabilities

Question

Let's say we have basic MNIST dataset, and we have the same goal to predict the digit, BUT we're swapping all the labels by RED and BLUE. The label becomes BLUE with probability DIGIT/10. So if the digit is FIVE, then probability of label BLUE is 50%, so we have equal number of RED and BLUE labels. And if that digit is TWO, then probability of BLUE is only 20% and we have mostly RED. This on one hand seems like an extremely trivial task, but on the other hand I can't see any way of approaching it, despite many years in the fields of ML and programming.

My question: what architecture or methodology could be used to approach this kind of a model? I understand that traditional feed-forward neural networks won't be able to deal with probabilistic nature of this problem. While Probabilistic Programming tools aren't suitable either. I would appreciate links to similar problems solved or any proposals on how can this be conceptually approached. Right now it doesn't seem that traditional neural network will do any good on this problem.

Dave · Answer 1 · 2022-05-21T16:03:20.190

3

By the way you’ve set up the problem, perfect accuracy is impossible, and we must accept that, even if we know a digit to be a $1$, there’s only a $90\%$ chance of being red, and that’s the best we can do.

The setup is quite simple, arguable more so than the original MNIST. We have input data (features) in the pixels that we will analyze with some model (perhaps a convolutional neural network). Then, as our $Y$, we have the binary red/blue labels (discrete labels, not probabilities).

This winds up being a pretty interesting problem, since image recognition is almost synonymous with using convolutional neural networks, yet neural networks are known to be poor predictors of the probability values that we seek (“neural networks lack calibration and are overconfident”). Thus, while a CNN might be the go-to for getting accurate categorical predictions in a high signal-to-noise ratio problem like in the original MNIST digit recognition, we have a lower signal-to-noise ratio (more uncertainty, inherently) that might warrant a simpler model that gives better-calibrated probability values, such as a logistic regression. After all, for an input image of a $5$, the correct answer is that there is a $0.5$ probability of being red and $0.5$ probability of being blue. There is no definite color label.

edited May 21 '22 at 16:03

answered May 21 '22 at 15:25

Dave

62,186

An alternative, where I admit I lack the experience to be able to discuss in the level of detail I want to include in an answer, could be something hierarchical. First, predict the digit. Then, using the predicted digit (or probability of being a digit), predict the probability of red or blue. – Dave May 21 '22 at 15:29
"there’s only a 90% chance of being red, and that’s the best we can do." - but that's what we want to predict, we don't want to predict the label (red/blue), but rather probability. So if we get 1, we predict (90% it's red, 10% it's blue). We're not predicting the training label, but the probability. (Well that's my idea anyways). – avloss May 21 '22 at 15:33
First, predict the digit -the idea is that model doesn't see the labels. If the task is broken up in two, then each of them separately more-or-less trivial, but require completely different toolset. We assume model doesn't know how many labels there are. It also should ideally be resilient to unbalanced classes, and have some notion of confident. (I know I'm asking for too much, but I've been thinking about this for a while now). – avloss May 21 '22 at 15:37
@avloss Neural networks and logistic regressions output probability values, not discrete categories. Your output is the probability of being red (or blue, depending on how you encode the color labels). You get everything you want from the probability outputs of your model. – Dave May 21 '22 at 15:37
Thanks, I've always thought that those "so-called probabilities", aren't true probabilities, but are called that for convenience. So the predictors are some large numbers - take log, divide by their sum (or something) - then you get a numbers between 0 and 1 which add up to 1, representing probabilities. But those aren't real probabilities. When model gets image of 7, and then proposes label "seven" with "probability" 0.995% it's not that the label can be anything else, it's definitely a 7, it's just that model is uncertain. But I might be mistaken. – avloss May 21 '22 at 15:43
1

That’s why the signal-to-noise ratio is high. There’s basically no chance of most of the handwritten digits being anything other than what they appear to be. When you read someone’s messy handwriting, though, how sure are you that they’re written a $9$ vs a $4$ with the top closed? “That’s probably a 9, but it could be a 4,” is a totally reasonable thought. // The outputs of neural networks not being considered real probability values is related to neural network overconfidence and lack of calibration. This is why a CNN approach to your problem isn’t the obvious way to go (but it might work). – Dave May 21 '22 at 15:47
2

Frank Harrell, founding chairman of the Department of Biostatistics at Vanderbilt Medical School and a frequent poster on Cross Validated, has two good blog posts about modeling and predicting probability values instead of discrete classes. https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ For people in the common situation of having learned machine learning from high signal-to-noise ratio problems like MNIST, they are enlightening. – Dave May 21 '22 at 15:55

MNIST with a TWIST, no labels given, only probabilities

1 Answers1