HCV.Egy.Data <- read.csv("~/Stat/Metaheuristic ML/hepatitis+c+virus+hcv+for+egyptian+patients/HCV-Egy-Data.csv")
table(HCV.Egy.Data$Baselinehistological.staging)
HCV.Egy.Data$Baselinehistological.staging <-
as.factor(HCV.Egy.Data$Baselinehistological.staging)
library("randomForest")
library(caTools)
split <- sample.split(HCV.Egy.Data, SplitRatio = 0.7)
split
train <- subset(HCV.Egy.Data, split == "TRUE")
test <- subset(HCV.Egy.Data, split == "FALSE")
set.seed(120) # Setting seed
classifier_RF = randomForest(x = train[,-c(28,29)],
y = train$Baselinehistological.staging,
ntree = 500, type="classification")
classifier_RF$classes
y_pred = predict(classifier_RF, newdata = test[, -c(28,29)])
confusion_mtx = table(test[, 29], y_pred)
confusion_mtx
table(train[, 29], classifier_RF$predicted)
I have low accuracy (20-30%) for my ML models for this model using a lot of different ML methods (random forest, XGboost, adaboost, LR, etc.) Here is the confusion matrix from running the above code:
y_pred
1 2 3 4
1 13 22 36 38
2 21 17 39 33
3 15 20 38 27
4 13 16 37 45
We even applied SMOTE and the discretization steps to increase the accuracy to only 33%. Does anyone know why the papers that have used this dataset are getting accuracies in the 70-95% range? Is something glaringly missing?
We also applied SMOTE 8 times but accuracy only increased to 33%.
Here is the dataset: https://archive.ics.uci.edu/dataset/503/hepatitis+c+virus+hcv+for+egyptian+patients If you know any fixes, it would be much appreciated.
split == "TRUE"is very bad style. – Michael M Jan 11 '24 at 21:32