Influence of Large Data Size on Logistic Regression Fit and McFadden's $R^2$

Question

I am currently working on a logistic regression analysis and have encountered a situation where I have approximately 16 million data points. I am interested in understanding the influence of such a large dataset on the fit of the logistic regression model.

The data I have is highly skewed, with one class being significantly underrepresented compared to the other. Additionally, I have been using McFadden's pseudo-$R^2$ as a measure of goodness of fit for my logistic regression model.

My question is: How does the volume of data, especially in the context of skewed data, impact the fit of a logistic regression model? I always believed that having more data would improve the model's performance, but my teacher hinted that there might be other considerations.

Unbalanced might be a more appropriate description of your data than skewed since skewness has a statistical definition as one of the higher moments of a distribution. One well-known outcome of large datasets is that p-values tend to significance. Given that, reversion to feature effect sizes is a better expression of whether or not something matters in the model. Consider expanding your metrics of model performance to include predictive accuracy. — user78229, Jun 30 '23 at 12:16
There's a lot of confusion around unbalanced classes: https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he — mkt, Jun 30 '23 at 12:37

Influence of Large Data Size on Logistic Regression Fit and McFadden's $R^2$

0 Answers0