1
HCV.Egy.Data <- read.csv("~/Stat/Metaheuristic ML/hepatitis+c+virus+hcv+for+egyptian+patients/HCV-Egy-Data.csv")
table(HCV.Egy.Data$Baselinehistological.staging)
HCV.Egy.Data$Baselinehistological.staging <- 
    as.factor(HCV.Egy.Data$Baselinehistological.staging)
library("randomForest")
library(caTools)
split <- sample.split(HCV.Egy.Data, SplitRatio = 0.7) 
split 
train <- subset(HCV.Egy.Data, split == "TRUE") 
test <- subset(HCV.Egy.Data, split == "FALSE") 
set.seed(120)  # Setting seed 
classifier_RF = randomForest(x = train[,-c(28,29)], 
                         y = train$Baselinehistological.staging, 
                             ntree = 500, type="classification") 
classifier_RF$classes
y_pred = predict(classifier_RF, newdata = test[, -c(28,29)]) 
confusion_mtx = table(test[, 29], y_pred) 
confusion_mtx

table(train[, 29], classifier_RF$predicted)

I have low accuracy (20-30%) for my ML models for this model using a lot of different ML methods (random forest, XGboost, adaboost, LR, etc.) Here is the confusion matrix from running the above code:

 y_pred
     1  2  3  4
  1 13 22 36 38
  2 21 17 39 33
  3 15 20 38 27
  4 13 16 37 45

We even applied SMOTE and the discretization steps to increase the accuracy to only 33%. Does anyone know why the papers that have used this dataset are getting accuracies in the 70-95% range? Is something glaringly missing?

For example, here's a paper: https://www.researchgate.net/profile/Md-Satu/publication/341987762_Predicting_Infectious_State_of_Hepatitis_C_Virus_Affected_Patient%27s_Applying_Machine_Learning_Methods/links/5edcab8a45851529453fc609/Predicting-Infectious-State-of-Hepatitis-C-Virus-Affected-Patients-Applying-Machine-Learning-Methods.pdf

We also applied SMOTE 8 times but accuracy only increased to 33%.

Here is the dataset: https://archive.ics.uci.edu/dataset/503/hepatitis+c+virus+hcv+for+egyptian+patients If you know any fixes, it would be much appreciated.

Vons
  • 224
  • 2
    I'm surprised that their accuracy increases after applying SMOTE. Typically on imbalanced datasets, your accuracy is very high. For example, if you are training an algorithm to detect terrorists, and 99.9% of people are not terrorists, then your algorithm can always output "not a terrorist" and get 99.9% accuracy. Oversampling the minority class usually reduces accuracy. – Nick ODell Jan 11 '24 at 20:42
  • 2
    One possibility I can think of is that they did SMOTE prior to their cross validation. Perhaps SMOTE created new data examples that are similar to existing data, and those examples ended up in different CV splits. That would be sloppy, but I could see it happening. – Nick ODell Jan 11 '24 at 20:45
  • 3
    SMOTE is also kind of a bizarre choice here, because the dataset isn't especially unbalanced. Each of the 4 categories makes up 23-26% of the dataset. – Nick ODell Jan 11 '24 at 20:51
  • 3
    I only skimmed the paper you linked to, but they do not mention whether they calculated accuracy on a holdout sample. If they did so in-sample, things are a lot easier. (And not very useful.) And if they did some kind of out-of-bag error calculation after resampling, then the results are again likely unsurprising. In the end, your best bet is likely to contact the authors and ask whether they can send you their scripts. – Stephan Kolassa Jan 11 '24 at 21:06
  • 1
    Not solving the problem, but split == "TRUE" is very bad style. – Michael M Jan 11 '24 at 21:32
  • 1
    Are you able to provide a link to the data? I'd be happy to have a quick go myself, data seems pretty small. Broadly speaking, if you've not made an error and you can't get accuracies in excess of ~30%, then it's exceptionally unlikely that it's possible to get accuracies of 90%. Totally implausible that you'd get that jump just by using a different model. Fairly implausible but I guess possible, that you could get that by doing some complex and domain-knowledge-inspired feature engineering that creates a compound features too difficult for an ML model to learn(especially with such small data) – gazza89 Jan 12 '24 at 15:41
  • @gazza89 That would be helpful! If another person gets the same results then this would indicate something strange happening. But there are a bunch of papers that have 90% accuracy rates, so it couldn't be they're all...false? The dataset is in the link at the bottom of the original post--click "Download" in the upper right. Baselinehistological staging is the response with 4 classes. – Vons Jan 13 '24 at 23:47

0 Answers0