0

I am currently working on a logistic regression analysis and have encountered a situation where I have approximately 16 million data points. I am interested in understanding the influence of such a large dataset on the fit of the logistic regression model.

The data I have is highly skewed, with one class being significantly underrepresented compared to the other. Additionally, I have been using McFadden's pseudo-$R^2$ as a measure of goodness of fit for my logistic regression model.

My question is: How does the volume of data, especially in the context of skewed data, impact the fit of a logistic regression model? I always believed that having more data would improve the model's performance, but my teacher hinted that there might be other considerations.

User1865345
  • 8,202
  • 2
    Unbalanced might be a more appropriate description of your data than skewed since skewness has a statistical definition as one of the higher moments of a distribution. One well-known outcome of large datasets is that p-values tend to significance. Given that, reversion to feature effect sizes is a better expression of whether or not something matters in the model. Consider expanding your metrics of model performance to include predictive accuracy. – user78229 Jun 30 '23 at 12:16
  • 1
    There's a lot of confusion around unbalanced classes: https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he – mkt Jun 30 '23 at 12:37

0 Answers0