I developed a ML algorithm (Xgboost) to predict a target in my data set. I obtain here the results of my predictions on my test set :
tibble(
truth = factor(c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0)),
response = c(0, 0, 0, 1, 0, 1, 0, 1, 1, 0),
prob_0 = c(0.8829266, 0.8831959, 0.7404993, 0.3993190, 0.6625459, 0.3192227, 0.6028344, 0.2362246, 0.4525665, 0.9415646),
prob_1 = c(0.11707342, 0.11680406, 0.25950068, 0.60068098, 0.33745408, 0.68077731, 0.39716560, 0.76377536, 0.54743350, 0.05843538)
)
I understand that the accuracy here corresponds to this code :
num_total <- nrow(pred_dt_poumon)
num_success <- sum(pred_dt_poumon$response == pred_dt_poumon$truth)
accuracy_poumon <- num_success / num_total
But I'm stuck now, because I don't know what to comparate to have the p-value of this accuracy. Do I must compare the response column versus the truth ? Or the num_sucess versus 0.5 (hazard) ? Or the prob versus the ground truth ? it's not a code question but way more a methodology one.
Or perhaps you are trying to obtain the probability of each observation being of a given class?
Better yet, why don't you simply tell us what you are trying to accomplish? What is it that you are interested in investigating or understanding with statistics with regard to your ML algorithm?
– StatsStudent Jun 12 '23 at 08:11