Hi everyone quite new to the ML field and on my masters degree doing my capstone project. So I have done hours of research and tests on how to test and train several models, but I am still getting undesirable results. Accuracy is above 99% and F1 and precision scores are 0. AOC is 0.5 obviously. A little background on what I am doing. I have over 200,000 data points of aircraft approaching a runway at 500 feet. These are airspeed, glideslope deviation, rate of descent and engine thrust. It also has a target variable whether the approach ended with a go around or not. Obviously, the number of successful landings will be the majority class. So here's my code using random forest. I have split the train and test data after normalizing them. Then, SMOTE is applied for the minority class. Then I have applied a very high class weight multiplier for the minority class. Here's the scores
- Accuracy: 0.9923627593599265
- Precision: 0.0
- F1 Score: 0.0
- AUC: 0.49779164666346615
Not quite sure what I am doing wrong as this should be a relatively straightforward and simple project. Thank you!
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, f1_score, roc_auc_score
from imblearn.over_sampling import SMOTE
from sklearn.utils import class_weight
Read the CSV file into a DataFrame
data = pd.read_csv("raw.csv")
Select the columns of interest
columns_of_interest = ["CAS", "ROD", "GSD", "N1"]
Apply Min-Max scaling to normalize the data
scaler = MinMaxScaler()
data[columns_of_interest] = scaler.fit_transform(data[columns_of_interest])
data.to_csv("raw_normal.csv", index=False)
data = pd.read_csv("raw_normal.csv")
feature_columns = ["CAS", "ROD", "GSD", "N1"]
target_column = "GA"
X = data[feature_columns]
y = data[target_column]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
smote = SMOTE(random_state=42)
X_train_oversampled, y_train_oversampled = smote.fit_resample(X_train, y_train)
class_weight_multiplier = 999999 # Assign a very high weight to the minority class
class_weights = class_weight.compute_class_weight("balanced", classes=[0, 1], y=y_train_oversampled)
class_weights[1] *= class_weight_multiplier
clf = RandomForestClassifier(class_weight={0: class_weights[0], 1: class_weights[1]})
Train the model on the oversampled training data
clf.fit(X_train_oversampled, y_train_oversampled)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("F1 Score:", f1)
print("AUC:", auc)