Usually what I see is that the baseline accuracy and the base model have different accuracies so then the goal is to clean the data and do some feature engineering etc to build a model that performs better than the baseline. Everyone including my professor says that the goal is to build a model than performs better than the stupid model/baseline.
My baseline accuracy is 95.13%. My CART model is also at the exact same performance. In fact, any model I throw at the dataset gives the same accuracy. My target (binary stroke outcome) is highly imbalanced (95% [outcome 0.0] / 5% [outcome 1.0]).
When I perform baseline before feature engineering the accuracy of my CART model and any other model, is 95.13%. After feature engineering they are both still 95.13%.
Is it a coincidence that the target imbalance is also 95%? Not a coincidence right?
Building models for exploration such as KNN, Logistic, C5, CART, NN, they all underperform when comparing with the baseline of 95.13%. Building these models before feature engineering they perform around the same range of 70%-75% and after feature engineering 75%-82%.
Naturally I am performing this baseline analysis without any balancing since the point is to build a stupid model as a benchmark.
So is it ok if my models even after feature engineering, do not perform as well as the benchmark? How would I explain this?
"...more useful to look at the conditional odds ratios or relative risk relations between the outcome and baseline variables"means issues like multicollinearity? And"...identify high risk regions of the covariate space", means inputs that are not significant? --- plotting a histogram it was flat with everything at 5110 (sample size). image here >> https://i.imgur.com/7Yt7BCh.png – Edison Aug 29 '21 at 06:22