2

I am playing around with a classification model, and I would like to know if there are any known methods to achieve what I am looking to do.

The data looks something like this:

Class id $x_1$ $x_2$ ... $x_n$
0 1
0 1
0 1
...
1 1
2 1
3 1
0 2
0 2
0 2
...
1 2
2 2
3 2
0 3
0 3
0 3
...
1 3
2 3
3 3
...

Where $\{x_1,\dots,x_n\}$ are the features and $id$ is a flag I would like to use, not a "feature".

There are four classes - $\{0,1,2,3\}$, with the 0 class representing a bit under 90% of the data. The thing is, within an ID tag, I know there to exist exactly one occurrence of class 1, class 2, and class 3, with the rest being class 0. I will know this on unclassified data too. The classes within these ID tags are thus not independent(?) in that if one observation is a 1, then no other observation can be a 1.

So my question is, is there a method I can use that allows me to incorporate my knowledge about these ID groups to force the model to assign exactly one observation to group 1, exactly one to group 2, and exactly one to group 3 within each ID flag?

Edit:

I should mention that I have tried undersampling the majority class, oversampling the minority classes, and modifying the loss function to more heavily penalise the misclassification of the minority classes. These are ideas I have seen used in fraud detection. But in the example of fraud detection, you don't know how many fraudulent transactions there are in a group, ahead of time.

My question is focused on if there is a way to incorporate my pre-existing knowledge of the count of each class (inside an id flag) into my prediction. Could I perhaps customise the loss function to heavily penalise classifying the incorrect number into each class?

Edit (2): To provide more context, the id flag represents a game, and all observations with the same id flag are players in that game. The features are the 'stats' for each player in that particular game. After a game is played, a player is awarded 3 points if they are the best player, 2 points if the second best player, and 1 point if they are the third best player. All other players are awarded 0 points. This is the definition of a class.

Edit (3): The classes are some sort of transformation of outcomes in the game. For example in a soccer match, if a player kicks more goals and touches the ball more times than a second player, the first player has most likely played better and will most likely be ranked higher. Also, I don't need to predict those game outcomes, they are known at the time of ranking, which occurs after the game has been played. I would like to predict what the ranking will be, given the outcomes from the game.

TNoms
  • 75
  • I'm not 100% that a standard supervised learning method is the right approach (what's the definition of a class? For most real-world applications, couldn't you check each example individually?). So if there's further problem assumptions you have, it would be helpful to hear them. – chang_trenton Aug 30 '23 at 00:48
  • Circling back to class definition -- maybe this is an outlier detection problem, and you can pick the "top 3 most outlier-ish" (as defined by the method) points and enumerate all $3! = 6$ sets of predictions. If we are given no assumptions, then maybe your "heavily penalize classifying the incorrect number" idea would work as an optimization constraint/regularization term, but I'm not sure a priori. – chang_trenton Aug 30 '23 at 00:49
  • I have edited my question to hopefully provide more context and information about the classes. The outlier detection idea sounds interesting. Do you think there would be merit in a model to detect the top three outliers as you suggest, and then feed in those outliers to another model that was trained on only classes 1, 2, and 3? – TNoms Aug 30 '23 at 02:05
  • 2
    The answer to any "would there be merit/does X work" question is usually "depends on your problem," unfortunately -- but details can help! The new context is extremely helpful, and makes me think this is a learning-to-rank problem -- I don't have the bandwidth to put together a full answer now but hopefully, this gives you a starting point to google. – chang_trenton Aug 30 '23 at 02:10
  • 1
    Is the class (0, 1, 2, 3) a transformation of some specific outcome in the game (an outcome which you observed)? For example, if players are actively eliminating each other in the game, then no: the class is just the order of elimination (ignoring those outside the top 3). But if players are collecting rewards, then the answer is yes: the class is the rank of rewards. I'm asking b/c if the answer is yes, then the baseline to beat is to predict rewards (assuming they're independent, even if they're not) and then rank based on those predictions. – chicxulub Aug 30 '23 at 04:18
  • 1
    I have edited the question to clarify. – TNoms Aug 30 '23 at 04:31
  • After some quick research, following @chang_trenton's comment about learning to rank, it seems like the ListMLE objective is applicable. It's part of a larger class of ranking approaches called the listwise ranking approach. – chicxulub Aug 30 '23 at 07:25

1 Answers1

-2

Firstly, you have to acknowledge that your dataset is heavily imbalanced (class sizes are disproportionate).

Therefore, you have to train differently by oversampling (the minority classes) or undersampling (majority classes) your training set.

Alternatively, you can also modify your loss function (e.g. focal loss) to penalize the errors made on minority classes more than on majority classes.

You have to do careful exploratory data analysis to identify those features that have high correlation in predicting the minority classes. Also, identify features that have high correlation in predicting the majority classes. Discard features that have high correlation in predicting both minority and majority classes.

You could use an indicator variable for each class and id as a input feature. However, this variable is dependent/conditioned on the output class from the previous input. i.e. the previous output. For this you need to create an augmented training set that includes this indicator variable.

  • 3
    -1 This seems not to address the problem at hand, preferring to focus on the likely-non-problem of imbalance (see the link), and also gives some poor advice that seems to be rooted in common statistical misconceptions, starting with the idea that class imbalance is an inherent problem in need of remedy, which it almost certainly is not. – Dave Aug 30 '23 at 00:47
  • @Dave , care to give a detailed explaination of the misconceptions? – Jose_Peeterson Aug 31 '23 at 07:42
  • Class imbalance and univariate variable screening are two topics that are discussed extensively on here. – Dave Aug 31 '23 at 11:38