Need help with Machine Learning models for imbalanced data

Question

Hi everyone quite new to the ML field and on my masters degree doing my capstone project. So I have done hours of research and tests on how to test and train several models, but I am still getting undesirable results. Accuracy is above 99% and F1 and precision scores are 0. AOC is 0.5 obviously. A little background on what I am doing. I have over 200,000 data points of aircraft approaching a runway at 500 feet. These are airspeed, glideslope deviation, rate of descent and engine thrust. It also has a target variable whether the approach ended with a go around or not. Obviously, the number of successful landings will be the majority class. So here's my code using random forest. I have split the train and test data after normalizing them. Then, SMOTE is applied for the minority class. Then I have applied a very high class weight multiplier for the minority class. Here's the scores

Accuracy: 0.9923627593599265
Precision: 0.0
F1 Score: 0.0
AUC: 0.49779164666346615

Not quite sure what I am doing wrong as this should be a relatively straightforward and simple project. Thank you!

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, f1_score, roc_auc_score
from imblearn.over_sampling import SMOTE
from sklearn.utils import class_weight
Read the CSV file into a DataFrame
data = pd.read_csv("raw.csv")
Select the columns of interest
columns_of_interest = ["CAS", "ROD", "GSD", "N1"]
Apply Min-Max scaling to normalize the data
scaler = MinMaxScaler()
data[columns_of_interest] = scaler.fit_transform(data[columns_of_interest])
data.to_csv("raw_normal.csv", index=False)
data = pd.read_csv("raw_normal.csv")
feature_columns = ["CAS", "ROD", "GSD", "N1"]
target_column = "GA"
X = data[feature_columns]
y = data[target_column]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
smote = SMOTE(random_state=42)
X_train_oversampled, y_train_oversampled = smote.fit_resample(X_train, y_train)
class_weight_multiplier = 999999  # Assign a very high weight to the minority class
class_weights = class_weight.compute_class_weight("balanced", classes=[0, 1], y=y_train_oversampled)
class_weights[1] *= class_weight_multiplier
clf = RandomForestClassifier(class_weight={0: class_weights[0], 1: class_weights[1]})
Train the model on the oversampled training data
clf.fit(X_train_oversampled, y_train_oversampled)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("F1 Score:", f1)
print("AUC:", auc)

Take a look at this recent answer to a closely related question. — Stephan Kolassa, Oct 27 '23 at 09:10
Also, the threads linked here, which overlap with the ones in the other answer. Finally: what exactly is undesirable about your results? — Stephan Kolassa, Oct 27 '23 at 09:32
@StephanKolassa thanks for the links. I have been through some of them already. Basically I just want to make sure I am not wrong with my understanding and the steps I have taken to prepare and train the data. If it's really the case that my data does not have any pattern that any model could use to predict the outcomes then so be it, I will report my findings as such. Right now it's showing the models are not doing any better than the toss of a coin so that's my concern right now as it seems I'm not doing it right. — Joe LI, Oct 27 '23 at 09:46
Thank you, that does help me to understand what your underlying problem is. (Often enough, we get questions here that are predicated on using a specific KPI like accuracy, which means they need a different kind of answer than if the answer can essentially be "use a different KPI", as it seems to be here. Let me write up a little something, it may take a little while.) — Stephan Kolassa, Oct 27 '23 at 09:54
Independent of your comment: why a scaler?? Related to your question: drop SMOTE, use probabilistic classification, and optimize logloss aka binary cross-entropy. — Michael M, Oct 27 '23 at 18:44

Stephan Kolassa · Answer 1 · 2023-10-27T10:35:25.517

"Unbalanced" data are not an inherent problem, but they turn into a bit of a challenge if you cast your problem as one of maximizing accuracy. Especially if the signal in your dataset is weak, maximizing accuracy will incentivize you or your model to pretty much classify every instance as belonging to the majority class. If you want to optimize accuracy, you shouldn't be surprised if your model or pipeline optimizes accuracy. If the results are not to your liking, that may well not be due to the model, but to accuracy as a KPI.

As an aside: no, switching to precision, recall or an average of the two (like the F1 or any other F-beta score) will not address the problem on a fundamental level. All of these suffer from the exact same issues.

You indicate that you actually want to predict. That is different from maximizing accuracy!

I would recommend that you move from hard 0-1 predictions to probabilistic predictions/classifications. If a predictor allows you to reliably distinguish between a 0.01 probability for a goaround and a 0.1 probability, then this predictor carries a lot of information - even if it is still more probable that there will be no goaround for either value of the predictor (so you would still say "no goaround" if you just want to maximize accuracy).

Once you have well-calibrated probabilistic predictions (take a look at proper scoring rules to assess the quality of probabilistic predictions), you can turn these into actions or decisions by tuning one or more thresholds. An approach with a 0.01 probability of a goaround can probably proceed without any intervention. A different one with a 0.1 probability may require explicit increased monitoring by a flight controller. And one with a 0.8 probability may require that flight control proactively contacts the pilot to discuss their options. (So even if there are only two possible outcomes - "goaround" vs. "no goaround" - there may be more than two possible actions based on the classification, a point I expound on elsewhere.)

@StephanKolassa that is an awesome input and really got me thinking further whether my original approach was feasible. Yes I will explore the probability approach. Just something I really want to know is if normalizing the features the norm if the features are on a different scale? — Joe LI, Oct 27 '23 at 11:23
Normalizing can be helpful, mainly for comparison in interpretation, e.g., if you want to compare the coefficients in a linear regression model - if you first normalize, then all coefficients give the change associated with a one standard deviation change in the predictor. In a random forest, there are no coefficients to compare, so to interpret the model, it makes more sense to predict based on "reasonable" predictor values. I don't really see a point in normalizing in this setting. — Stephan Kolassa, Oct 27 '23 at 11:56
I was going to post an answer but this may as well be a comment. Instead of
y_pred = clf.predict(X_test)

you need

y_pred = clf.predict_proba(X_test)[:,1] — Flounderer, Oct 27 '23 at 12:15
The predict method gives the category with the highest predicted probability, while predict_proba gives the probability values themselves. (These should probably be named something like predict_argmax_category and predict, respectively.) — Dave, Oct 27 '23 at 12:47

score 3 · Answer 2 · answered Oct 27 '23 at 18:31

3

First is your features, while I'm no expert in your field, I would say that only having 4 features is unlikely to produce good results. In a case where 4 features could sufficiently explain your problem, I would say that machine learning isn't even an appropriate approach since you could easily spend time manually analyzing each variable to identify how each of those variables affect the outcome (and if it's known these 4 variables are sufficient to explain the problem, would it even be an interesting result that your model can accurately predict the outcome?)
Your dataset is extremely inbalanced/noisy. While supervised learning is very effective at identifying patterns in data, noteably it's not effective at identifying black swan events. If in your dataset the number of failed landings is extremely sparse e.g. <100 it's going to be unlikely that a supervised learning approach will generate effective results.
You've already oversampled your dataset and the class weight you've set seems very extreme
You haven't included the training/validation/test losses/performance so it's hard to say if your model is overfitting or any other possible issues
AUC is computed using the probability output since it computes the FPR, TPR at various decision thresholds

Overall I would really suggest you take a step back and really think about what kind of questions you're trying to answer rather than focusing on this one approach. By understanding the fundamental goals that your project is trying to achieve you can identify the most appropriate approach and produce useful results rather than throwing a black box at it.

answered Oct 27 '23 at 18:31

mf908

61
2

I would say that supervised learning models are ineffective at identifying black awan events because those black swan events are not distinguished from "business as usual" in the features. For instance, the Global Financial Crisis could be considered a black swan event that few people saw coming and was a radical departure from the usual, but some people were looking at particular features and were able to predict such an event to be likely. I discuss in more detail here. – Dave Oct 27 '23 at 18:42
+1, though I must admit to some major reservations about whether oversampling is helpful in any sense, unless collecting data is very expensive. – Stephan Kolassa Oct 27 '23 at 18:50
That is, Michael Burry might not consider the Crisis to be such a black swan event. Mathematically, I would consider this to be Bayes' theorem in action: the conditional probability of the Crisis (conditioned on what Burry observed) was rather high, even if the marginal (unconditional) probability (or conditional probability, conditioned on more routine features to observe) was much lower. – Dave Oct 27 '23 at 18:50
The real question to answer is not if we look back if we identify that these variables a > b, c > d, and e > f would identify the singular event. Noticeably all that would tell you is recall of that single event. If you backtest it across a sufficient large time period, how many false positives does it identify? And in Burry's case I'm not aware of his approach, but did it utilize supervised learning? There doesn't appear to be a lot of information on the statistical approach he utilized. – mf908 Oct 27 '23 at 18:51
@Dave: then again, if your models or forecasters are noisy, then some may be predicting extreme events well just by chance. If we treat this as strong evidence for better forecasting capabilities, we may just be chasing noise again. I would say that just because extreme events are rare, it requires even more evidence to convince us that a particular forecaster is any good at these. – Stephan Kolassa Oct 27 '23 at 18:52
1

@StephanKolassa Ideally, we would be able to observe Burry making predictions many times under similar circumstances (if there is any sense in which the circumstances leading to the Crisis can be considered "ideal") to get a sense of how reliably he can make such predictions or if he just got lucky once, yes. (One of the investment books I've read mentioned that if we flipped coins and were eliminated from the game when we flipped heads, people would brag at parties about their technique for flipping tails. even if it's pure luck that they have done so. Was Burry lucky or good? I don't know.) – Dave Oct 27 '23 at 18:57
1

@Dave: yes, that is exactly the effect of investment funds being closed down if they perform badly. Set up 1000 funds, and after a year, close down the bottom-performing 50%. Iterate. After 10 years, you have a fund where you can brag that it performed well 10 years in a row. – Stephan Kolassa Oct 27 '23 at 19:03

Need help with Machine Learning models for imbalanced data

Read the CSV file into a DataFrame

Select the columns of interest

Apply Min-Max scaling to normalize the data

Train the model on the oversampled training data

2 Answers2