Significance test for large sample sizes

Question

This is more of a theoretical question. Super large sample sizes will almost always show a significance when a $\chi^2$ test is done. Is there any other statistical test of significance (an alternative to $\chi^2$) which is good for testing independence when the sample size is very large?

This is the context of my problem: I have 2 large datasets of phrases. Set1 corresponds to the Google n-grams set and set2 is a smaller set corresponding to the phrases found in one single website. Now consider a phrase: say ('Technology') found in Set2. I want to test and see if this phrase is specific to this website (it could be if it is a Technology website) or if it is a general phrase. So I am performing a $\chi^2$ test between the frequency of phrases in the two sets as follows:

                        Set1                        Set2
Not_Technology   (set1) 2,674,797,869,255    (set2) 46,168,477.00 
Technology       (set1) 1710231              (set2) 1991

I understand that this might not be the best method to test whether a phrase is a general phrase or not, so if you have any suggestions or criticisms I am happy to listen to them.

Explain the context of your problem. Are you looking for independence in a contingency table? — Michael R. Chernick, Aug 31 '12 at 20:39
In general any useful test will find significant differences in very large samples because when comparing two populations there will always be at least some small difference and small differences will be detected in very large samples. — Michael R. Chernick, Aug 31 '12 at 20:41
@MichaelChernick right. So finding a statistical difference does not necessarily mean that the difference is large or important. In the case of a very large sample we will almost always find small differences. So is there a way to measure that the difference is important? I have mentioned the context of my problem in the question above. — tan, Aug 31 '12 at 21:00
Look up resampling, or bootstrap, or jackknife techniques to assist is finding actual significant differences in large samples. All three are supported in R. — R. Schumacher, Aug 31 '12 at 21:03
My advice is to stop thinking about statistical significance and start thinking about effect size. In a 2x2 table such as you seem to have, the odds ratio is a good (and easy to calculate) measure of effect size, just find the cross-product. Most software will have tools for confidence intervals around these, if you want them. — Peter Flom, Aug 31 '12 at 21:12
@R.Schumacher How would that help? I don't see how any of those methods help with this particular issue. — Peter Flom, Aug 31 '12 at 21:13
I like a comment here: It sometimes seems to me to border on the perverse that while almost everyone will insist on consistency for their tests, so many will complain that something is wrong with hypothesis testing when they actually get it. — Dave, Jan 18 '23 at 20:15
I cannot make sense of this question. After pointing out how effective a standard test is with large samples, it inquires whether any other tests are effective with large samples. What's the point? What is the actual objective? — whuber, Jan 18 '23 at 20:24
It's not clear what the purpose of the analysis is. It sounds like it might be to find words & phrases that describe the content of the website. Running multiple chi-square tests seem like an ineffective way to go about it. (Any of the approaches outlined by @Dave won't come up with a reasonable website description, were that the goal.) At least after the multiple test correction (for all phrases that appear on the internet), the p-value would not seem so extreme.... — dipetkov, Jan 22 '23 at 15:10
I'm not sure I get your approach. The universe of possible phrases is so large than given a long enough phrase, it will almost always be specific to the site you're studying. — Hugues, Jan 23 '23 at 17:48
how does this question change the answer to this: https://stats.stackexchange.com/questions/323862/given-big-enough-sample-size-a-test-will-always-show-significant-result-unless — Charlie Parker, Mar 03 '23 at 22:08

Dave · Answer 1 · 2023-03-25T21:33:26.540

The test is doing what it should be doing. You ask it if two quantities are equal, in the case of the original question, if zero is equal to some measure of independence that is zero when the distributions are independent (e.g., mutual information). Since the test has considerable sensitivity, due to the large sample size, the test correctly tells you that the two quantities are not equal. This is a design feature, not a bug, of hypothesis testing that is related to consistency (power converges to $1$ as the sample size increases).

If you remember the Princess and the Pea fairy tale, you may recall that, no matter how trivial we might perceive a pea under the mattress, the princess was correct about there being a pea. If you want to assert that a pea under the mattress does not matter to you, that's fine, but it is a mistake to call the princess incorrect for noticing the pea when the pea was indeed under the mattress.

Because of the consistency of most tests are the frequent availability of large amounts of data, hypothesis tests certainly can find differences that, while they are there, are not important or interesting, much like most people would not care if there is a pea under the mattress. This gets into the effect size and what kind of effect size is interesting. While statistics can (and has) come up with interesting ways to quantify effect sizes, determining an interesting effect size mostly falls outside of the realm of statistics and in the domains to which statistics is applied (the experts in medicine decide that for COVID studies, the experts in economics decide that for unemployment studies, etc).

Once you have an effect size of interest, there are a number of statistics tricks related to it. First is that investigators can calculate the sample size required for detecting such an effect size with a certain power and $\alpha$-level. This is not so important for the situation where you already have a ton of data, but it is worth a mention. Examples of this are available in the pwr package in R.

A second trick is equivalence testing, the easiest example of which to understand is two one-sided tests: TOST. Briefly, TOST does two hypothesis tests in order to bound our estimate of the true effect, rejecting that the true effect is too high or too low.

A third trick is interval estimation. A frequentist might calculate a confidence interval to put bounds on the effect size, and a large sample size would lead to a relatively narrow confidence interval and correspondingly high precision in the estimate. A Bayesian might calculate a credible interval for the same purpose. All else equal, the credible interval should be narrower for a larger sample size, with a large sample resulting in a tight estimate of the true effect. Whether you go frequentist or Bayesian, a tight estimate and high precision sounds desirable. Once you have a range of plausible parameter values given by the interval estimate, you can analyze if any of those have any practical significance by comparing to your effect size.

Depending on what you want to do, one of these three might be reasonable for handling situations where large data sets are available. However, they all need some declaration of an effect size of interest!

I strongly support the suggestion for equivalence testing, undoubtedly a proper choice for large sample sizes — Spätzle, Jan 23 '23 at 05:25

score 7 · Answer 2 · answered Jan 18 '23 at 21:30

This is a general phenomenon in hypothesis testing for a point-null hypothesis

What you are dealing with here is a much larger issue in hypothesis testing than just the chi-squared test of independence. This is a phenomenon that arises in classical hypothesis testing whenever you are testing a point-null hypothesis (i.e., a null hypothesis that stipulates a single point for an unknown parameter). In such cases, the null hypothesis is a choice of a single parameter value, usually over an uncountable set of possible values. That is an extremely specific null hypothesis.

To understand the phenomenon you are referring to, let's have a look at the properties of a consistent hypothesis test. Hypothesis tests are designed to test the specified null hypothesis and reject it (in favour of a specified alternative) if the evidence falsifies the null hypothesis. (More information on the mathematical structure of a hypothesis test is avaialble in this related answer.) Suppose you have data $\mathbf{x}_n$ and an unknown parameter $\theta \in \Theta$ and you pick a null hypothesis space $\Theta_0 \subset \Theta$. Let $\alpha$ denote the significance level for the test and let $\beta_n$ denote the resulting power function. A consistent hypothesis test will have the following limiting property for its power function:

$$\lim_{n \rightarrow \infty} \beta_n(\theta) = 1 \quad \quad \quad \quad \quad \text{for all }\theta \in \Theta-\Theta_0 \text{ and } 0< \alpha <1,$$

which implies the following limiting property for its p-value function:

$$\underset{n \rightarrow \infty}{\text{plim}} \ p(\mathbf{x}_n) = 0 \quad \quad \quad \quad \quad \text{for all }\theta \in \Theta-\Theta_0. \quad \quad \quad \quad \quad \quad \quad $$

Consistency under the point-null hypothesis: Hypothesis tests are designed to test the truth or falsity of the hypotheses you actually give them, so if you use an extremely specific null hypothesis, and that hypothesis is even slightly false, the test is designed to correctly infer that the null hypothesis is false. In particular, if the point-null value is $\theta_0$ then for any parameter value $\theta \neq \theta_0$ you will have $\text{plim}_{n \rightarrow \infty} p(\mathbf{x}_n) = 0$ (i.e., the p-value will converge stochastically to zero).

One of the problems that arises in hypothesis testing occurs when we set a point-null hypothesis in a circumstance where that specific hypothesis is almost certainly false, but what we really want to know is something a bit broader --- e.g., whether the specified point-null is almost correct. A common case occurs when the parameter can be considered to be a continuous random variable, such that it will be equal to the stipulated point-null value with zero probability.$^\dagger$ In this case, the null hypothesis is false with probability one and so with a large amount of data the test gives us a tiny p-value which tells us that the null is false. Some users view this as a deficiency of the hypothesis test, but it is actually a case where the test is doing exactly what you ask it to. By specifying a point-null hypothesis you are asking the test to be very specific about the null hypothesis under consideration, and the test is complying with this instruction.

So, what can you do to deal with this "problem". Firstly, you ought to recognise that you need to test the hypothesis you are actually interested in, not a hypothesis that is mathematically close to this but much more specific. Typically you can do this by setting some "tolerance" $\epsilon>0$ on your stipulated point-null value and testing the composite null hypothesis $\theta_0 - \epsilon \leqslant \theta \leqslant \theta_0 + \epsilon$. You can view the tolerance value as a measure of "practical significance", meaning that if the true parameter value is within the stipulated tolerance of the point-null value then it is "practically" equivalent to the point-null value. In this manner you can separate "statistical significance" from "practical significance" and ensure that the consistency property of the hypothesis test does not lead the p-value to converge to zero in cases where you don't want it to.

Implementation of a "tolerance" in the null hypothesis for the chi-squared test of independence is quite complicated and so outside the scope of the present post (but feel free to ask a separate question for how to do this). In general you can alter existing tests to include a tolerance on a point-null hypothesis but you need to re-derive the test as a composite test to determine how the composite hypothesis affects the p-value function. This is a complicated exercise in general, but it can be automated into customised p-value functions once derived.

$^\dagger$ You will sometimes see statistical commentators make a broader assertion that a point-null hypothesis is always false. That is not true --- a point-null hypothesis can be true. Moreover, even in the case where the parameter is viewed as a random variable, it can be equal to a specific value with positive probability. It is only if we are willing to stipulate that the parameter is a continuous random variable that it has probability zero of being equal to any specific value.

Zen · Answer 3 · 2012-09-01T01:19:09.803

6

Take a look at this paper by the late Jack Good: http://fitelson.org/probability/good_bnbc.pdf ; At section 4.3, his "Bayes/Non-Bayes Compromise" leads to the definition of a "standardized" p-value which tries to address the "Huge $n$ $\Rightarrow$ Highly Probable to Reject Null, Whatever Data" effect.

edited Sep 01 '12 at 01:19

answered Sep 01 '12 at 00:23

Zen

24,121

score 5 · Answer 4 · answered Sep 01 '12 at 07:21

5

If your sample is large enough then it seems to me that a statistical test is not needed. You have characterised the effect. Is the effect that you have characterised large enough to be interesting? If so, then make a reasoned and principled argument about the observations without recourse to a testing procedure.

answered Sep 01 '12 at 07:21

Michael Lew

15,102

2

Testing could still be relevant if the estimated difference is close to what is defined as a meaningful difference. Testing is a formal way of assessing how sure you are of the result. – Michael R. Chernick Sep 01 '12 at 11:36
@MichaelChernick Most people, and probably all scientists, are interested in the evidence. The idea of a 'meaningful difference' in the context of testing comes from the Neyman-Pearson approach where it stands as the alternative hypothesis. That approach eschews any evidential basis of interpretation in favour of 'behavioural inference'. Not as useful as looking at the data as evidence, in my opinion. – Michael Lew Sep 02 '12 at 22:52
Neyman-Pearson hypothesis testing doesn't define a meaningful difference. That is determined by the investigator and really determines which alternative hypotheses we really care about. I don't share your pessimistic view about hypothesis testing. – Michael R. Chernick Sep 03 '12 at 00:56

score 5 · Answer 5 · answered Jan 24 '23 at 14:01

Congratulations! You have a big enough sample size that you don't need to bother with significance testing! So don't worry about it. Now you just need to decide if the effect you see is "big enough" to care about, which is an entirely different problem that has nothing to do with significance testing.

A statistical significance test, like a Chi square test, is attempting to answer a very specific problem: "how likely is it that a difference I observe in a random sample is just an artifact of sampling error (the error that arises when we try to make generalizations about a population using only a random sample of that population)?" That's it. The fact that a test is significant doesn't tell us anything about whether the effect is "big" or "meaningful" in some substantive sense, or even that it's actually "real" (it might be due to some measurement error, or confounding with other variables).

Now, as sample size increases the likelihood of an observed difference of a given size being an artifact of sampling error goes down, and so significance tests will tend to be significant basically all of the time. This is just because they problem they are trying to help you solve has already been (largely) solved by the large sample size.

So, in your case sampling error is not a particularly big problem, so a significance test is not very helpful. Rather, what you need to decide is if the relationship you are looking is "big enough" or not. That's not a question that can be answered with a statistical test. You need to use your knowledge of the subject matter to decide if the relationship is large enough to "make a difference in the real world." No statistical test can answer that question for you, and neither can anyone here, unless they also happen to know a lot about Google n-grams, and the specific research question you are asking.

Fantastic answer! I loved the fact you concisely summarized that a significance test tries to disambiguate if the difference you see is due to random errors or if the difference is realy for large sample sizes. — Charlie Parker, Jan 24 '23 at 19:23
In addition, I'd like to know one thing. Can't I use the effect size to see if the effect size is significant? e.g. if it's in the usual ranges e.g. ~0.2 medium, ~0.5 large, ~0.8 large? — Charlie Parker, Jan 24 '23 at 19:26
Also, I do have expert knowledge on what is "significant difference". In my application it's usually eps = 1% or 2%. My hunch is to do eps/pooled_std(group1, group2) to compare it with the effect size to see if it's actually significant difference. — Charlie Parker, Jan 24 '23 at 19:28
In Stats you need to be very careful to distinguish between "statistical significance" (what a sig test is testing for) and the normal English term "significant" (meaning "important" or "worth talking about'). We sometimes call the second thing "substantive significance" but it's better to just use another word, to avoid some of the confusion reflected in your comments. In particular.... — Graham Wright, Jan 25 '23 at 01:01

score 4 · Answer 6 · answered Aug 31 '12 at 21:19

4

When we do sample size determination for clinical trials we define a clinically significant (or clinically important) difference. That is a difference that is large enough to be worth detecting. The definition is given by the clinician. It is not a statistical issue. It depends on the clinical problem and requires a clinical judgement. Once the clinician has decided on that we pick the smallest sample size required to have high power (80% or more) for detecting a difference that large.

In your case where you already have millions of samples what you can do is rephrase the question. Instead of the standard null hypothesis that the difference is 0 which you reject if you can determine that it is any size different from 0, define a delta that represents what you think is an important distance. Then you reject the null hpyothesis only if the test indicates that the difference is greater than delta.

answered Aug 31 '12 at 21:19

Michael R. Chernick

42,857

1

I can follow how that could be done, but is that what is wanted? If delta is meaningful, then you would want to detect not "is the difference greater than delta?" but "is the difference at least delta?" Perhaps a 1 sided test?
Or maybe make delta the "maximum meaningless difference" that is, the largest difference no one would care about.
– Peter Flom Aug 31 '12 at 22:13
1

1)I suspect, in some cases, the difference that is “different enough” is really unknown (maybe until a certain level). Would it be appropriate to make a table of [difference, significance]? Would it help? 2) Isn't equivalence testing related to this topic? – if it is performed with (say on some acceptable delta), would the large sample have the opposite problem ( because of the reversed null hypothesis) ? – Sep 01 '12 at 00:07
@PeterFlom Usually we are dealing with continuous variables,so greater than delta or greater than or equal to delta is a moot point. Anyway the choice of delta is a rough choice. – Michael R. Chernick Sep 01 '12 at 11:05
@6thwhy The choice of delta is not something that can be characterized as known or unknown, it is a subjective choice for the investigator. If the investigator is not sure what he wants to use we do look at tables for sample sizes based on a set of reasoanable choices. Even if you are not sure of what is a large enough difference, you probably have a good idea of what is too close to be meaningful. Equivalence testing is a different problem. There you want to show that two groups are practically the same. So the null and alternative hypotheses are switched around. – Michael R. Chernick Sep 01 '12 at 11:12
But delta does represent a difference that matters. So the interpretation is the same. But the choice could involve different considerations. – Michael R. Chernick Sep 01 '12 at 11:13

AdamO · Answer 7 · 2023-01-18T21:32:32.200

3

The problem here is that it is bad statistical practice to propose a statistical test as part of an analysis plan without any knowledge of the sample size. If you did know the sample size, you should adjust the alpha accordingly. Ideally, a highly calibrated test such as the one you get with having N very, very large should balance the false positive error rate with the power - not in any objective sense, but there's no reason to set the usual (arbitrary) 0.05 when the sample size is so large a "significant" result is basically meaningless.

The only way to know this is by doing simulations and understanding how the test statistic behaves when the null is true. There's no reason you can't set alpha = 0.0000001 and look to the one-minus-one-millionth quantile to define the critical value of the one-sided test, which may be a suitably large quantity.

edited Jan 18 '23 at 21:32

answered Jan 18 '23 at 21:24

AdamO

62,637

5

A strong case against this recommendation can be made: the value of $\alpha$ should have nothing whatsoever to do with the sample size. It is established by your decision-making needs. – whuber Jan 18 '23 at 21:33
3

@whuber this is true, although I suppose two frames of thinking emerge: one as you say, and another where selection of alpha may depend on design- or even results- related features. Hochberg corrections and Pocock alpha-spending rules come to mind as widely accepted examples with a formal treatment. In all cases, specification of alpha should take account of the "cost" of a false negative and a false positive in their respective settings. – AdamO Jan 18 '23 at 21:40
I think this is a good answer against changing significance level: https://stats.stackexchange.com/a/602978/28986 tldr Congratulations! You have a big enough sample size that you don't need to bother with significance testing! So don't worry about it. Now you just need to decide if the effect you see is "big enough" to care about. So no need for statistical significance. Likely better to check effect size and CIs given an epsilon/difference that matters for your application. – Charlie Parker Jan 24 '23 at 19:51
@CharlieParker you'll notice that I don't suggest post-hoc readjustment of $\alpha$ level. To reiterate my main point, you'll see that I say if you're doing an analysis of 4,000,000 values, I claim it would be stupid to set alpha to the usual 0.05 level. I don't disagree with relaxing testing, but again, doing so post-hoc compromises an analysis as much as resetting alpha based on the findings. – AdamO Jan 24 '23 at 20:08

score -1 · Answer 8 · answered Jan 24 '23 at 19:52

This is my current recommendation:

Statistical Tests (Hypothesis Testing)

A statistical significance test, like a Chi square test, is attempting to answer a very specific problem: "how likely is it that a difference I observe in a random sample is just an artifact of sampling error (the error that arises when we try to make generalizations about a population using only a random sample of that population)?"

Small Sample Size

Summary

What is a small sample size? When power is not too high (e.g. 0.9999) and p-value/CIs are not to small, usually n<500 or 300.

Summary what to do for small sample size Hypothesis Testing: - p-value: stat test (e.g. t-test) with p-value & significance level - effect size: report the effect size (e.g. Cohen's d), see if it falls in the common ~0.2 (small), 0.5 (medium), 0.8 (large)
& compare it to eps/pooled_std(group1, group2) - CI: CI's, do they intersect given the epsilon that matters for your application? - Power/sample size: making an estimate of your std (or preliminary data), get Power of your test with a given sample size or compute the sample size you need to achieve good power.

ref: - also fantastic reference: https://stats.stackexchange.com/a/602978/28986

Large Sample Size

Same as previous comment but n>500 good rule of thumb.

Summary what to do for large sample size Hypothesis Testing: - CI: CI's and use the epsilon valid in your applicaiton - Effect size: report effect size, see if it falls in the common ~0.2 (small), 0.5 (medium), 0.8 (large) & compare it to eps/pooled_std(group1, group2) - LRT: todo - eps != 0 p-value: todo

Hands on example

See my_test_using_stds_from_real_expts_() function in effect_size.py.

Todo later (for large sample size)

LRT (theory & python) mainly for large sample size
hypothesis testing with non-zero epsilon (theory & python) mainly for large sample size

ref: - Fantastic reference: Significance test for large sample sizes

What would be the basis for any of these rules of thumb? Whence "$n\gt 500$" for "large" samples, for instance? Cohen's recommendations are somewhat arbitrary and idiosyncratic, too, making them difficult to justify generally. The only current content of this post is a direct quotation from another answer in this thread: at the very least you should acknowledge its source. — whuber, Jan 24 '23 at 21:30