1

My dataset consists of 1 million observations.

I am running a linear regression of Y on X (mean-centered and scaled variable), trying to demonstrate that there is a curvilinear relationship between the two variables of the form lm(y ~ x + I(x^2). This results in a significant coefficient on the squared term. As a sanity check, I started adding polynomials of other degrees I(x^3), I(x^4), and so on, to see if these weren't significant. To my chagrin, polynomials all the way up to I(x^30) were significant.

I assume this is partially because my dataset is so large, and thus detecting statistical significance is relatively easier due to high precision estimates. So my question is:

Why is this happening in my dataset? Is there a stricter way to test significance in big datasets? What is the correct way for detecting a curvilinear relationship in big datasets between y and x?

Parseltongue
  • 1,010
  • 2
  • 11
  • 28
  • What are you comparing significance against? You might find that the higher-degree model predicts better than a null model, but not significantly better than the low-degree model. – Nuclear Hoagie Jan 28 '19 at 14:42
  • Prediction isn't really relevant for our task, but I assume I could use AIC/BIC to compare models? – Parseltongue Jan 28 '19 at 14:55
  • 1
    That would work, or run an ANOVA between the models directly. What I'm trying to get at is maybe the coefficients on your higher-order terms are significantly different from 0 due to the large amount of data, but do little to change the model output. If you see no significant difference in the outputs of the higher or lower order models (as measured by ANOVA), the extra terms aren't adding much. – Nuclear Hoagie Jan 28 '19 at 15:06
  • +1 for running that sanity check. – Stephan Kolassa Jan 28 '19 at 15:10
  • 1
    You might try a GAM with a smoothing spline, which will penalize more terms and higher degree polynomials. – Estimate the estimators Jan 28 '19 at 15:14
  • 2
    What is your task, if prediction isn't? If you are just assessing statistical significance for the sheer joy of it, then you already have your answer: there is a statistically significant curvilinear relationship in your data, and it's even more curvilinear than order 2. It will likely make little practical difference, but the other comments here have already been pointing that out. So: what are you trying to achieve? – Stephan Kolassa Jan 28 '19 at 15:20
  • 2
    @StephanKolassa: The task could be explanation, rather than prediction? In other words, understanding and describing the effect of X on Y? But even so, trying to plot Y vs X first via a high-density scatterplot (https://rstudio-pubs-static.s3.amazonaws.com/151690_ac65a180e03641e2adc3cb2ecf6306c3.html) should be the first step in the modeling process. If the scatterplot indicates a smooth, non-linear effect of X on Y, then the bam() function from the mgcv package can be used to estimate it (https://www.rdocumentation.org/packages/mgcv/versions/1.8-26/topics/bam). – Isabella Ghement Jan 28 '19 at 16:13
  • 1
    @IsabellaGhement spot on. Great suggestions. Thanks! – Parseltongue Jan 28 '19 at 16:17
  • You're welcome, @Parseltongue. This thread should also come in handy: https://stats.stackexchange.com/questions/22233/how-to-choose-significance-level-for-a-large-data-set. – Isabella Ghement Jan 28 '19 at 16:21

0 Answers0