11

This is more of a theoretical question. Super large sample sizes will almost always show a significance when a $\chi^2$ test is done. Is there any other statistical test of significance (an alternative to $\chi^2$) which is good for testing independence when the sample size is very large?

This is the context of my problem: I have 2 large datasets of phrases. Set1 corresponds to the Google n-grams set and set2 is a smaller set corresponding to the phrases found in one single website. Now consider a phrase: say ('Technology') found in Set2. I want to test and see if this phrase is specific to this website (it could be if it is a Technology website) or if it is a general phrase. So I am performing a $\chi^2$ test between the frequency of phrases in the two sets as follows:

                        Set1                        Set2
Not_Technology   (set1) 2,674,797,869,255    (set2) 46,168,477.00 
Technology       (set1) 1710231              (set2) 1991

I understand that this might not be the best method to test whether a phrase is a general phrase or not, so if you have any suggestions or criticisms I am happy to listen to them.

Dave
  • 62,186
tan
  • 607
  • 2
  • 7
  • 15
  • 1
    Explain the context of your problem. Are you looking for independence in a contingency table? – Michael R. Chernick Aug 31 '12 at 20:39
  • In general any useful test will find significant differences in very large samples because when comparing two populations there will always be at least some small difference and small differences will be detected in very large samples. – Michael R. Chernick Aug 31 '12 at 20:41
  • @MichaelChernick right. So finding a statistical difference does not necessarily mean that the difference is large or important. In the case of a very large sample we will almost always find small differences. So is there a way to measure that the difference is important? I have mentioned the context of my problem in the question above. – tan Aug 31 '12 at 21:00
  • Look up resampling, or bootstrap, or jackknife techniques to assist is finding actual significant differences in large samples. All three are supported in R. – R. Schumacher Aug 31 '12 at 21:03
  • 12
    My advice is to stop thinking about statistical significance and start thinking about effect size. In a 2x2 table such as you seem to have, the odds ratio is a good (and easy to calculate) measure of effect size, just find the cross-product. Most software will have tools for confidence intervals around these, if you want them. – Peter Flom Aug 31 '12 at 21:12
  • 1
    @R.Schumacher How would that help? I don't see how any of those methods help with this particular issue. – Peter Flom Aug 31 '12 at 21:13
  • 4
    I like a comment here: It sometimes seems to me to border on the perverse that while almost everyone will insist on consistency for their tests, so many will complain that something is wrong with hypothesis testing when they actually get it. – Dave Jan 18 '23 at 20:15
  • 3
    I cannot make sense of this question. After pointing out how effective a standard test is with large samples, it inquires whether any other tests are effective with large samples. What's the point? What is the actual objective? – whuber Jan 18 '23 at 20:24
  • It's not clear what the purpose of the analysis is. It sounds like it might be to find words & phrases that describe the content of the website. Running multiple chi-square tests seem like an ineffective way to go about it. (Any of the approaches outlined by @Dave won't come up with a reasonable website description, were that the goal.) At least after the multiple test correction (for all phrases that appear on the internet), the p-value would not seem so extreme.... – dipetkov Jan 22 '23 at 15:10
  • I'm not sure I get your approach. The universe of possible phrases is so large than given a long enough phrase, it will almost always be specific to the site you're studying. – Hugues Jan 23 '23 at 17:48
  • how does this question change the answer to this: https://stats.stackexchange.com/questions/323862/given-big-enough-sample-size-a-test-will-always-show-significant-result-unless – Charlie Parker Mar 03 '23 at 22:08

8 Answers8

14

The test is doing what it should be doing. You ask it if two quantities are equal, in the case of the original question, if zero is equal to some measure of independence that is zero when the distributions are independent (e.g., mutual information). Since the test has considerable sensitivity, due to the large sample size, the test correctly tells you that the two quantities are not equal. This is a design feature, not a bug, of hypothesis testing that is related to consistency (power converges to $1$ as the sample size increases).

If you remember the Princess and the Pea fairy tale, you may recall that, no matter how trivial we might perceive a pea under the mattress, the princess was correct about there being a pea. If you want to assert that a pea under the mattress does not matter to you, that's fine, but it is a mistake to call the princess incorrect for noticing the pea when the pea was indeed under the mattress.

Because of the consistency of most tests are the frequent availability of large amounts of data, hypothesis tests certainly can find differences that, while they are there, are not important or interesting, much like most people would not care if there is a pea under the mattress. This gets into the effect size and what kind of effect size is interesting. While statistics can (and has) come up with interesting ways to quantify effect sizes, determining an interesting effect size mostly falls outside of the realm of statistics and in the domains to which statistics is applied (the experts in medicine decide that for COVID studies, the experts in economics decide that for unemployment studies, etc).

Once you have an effect size of interest, there are a number of statistics tricks related to it. First is that investigators can calculate the sample size required for detecting such an effect size with a certain power and $\alpha$-level. This is not so important for the situation where you already have a ton of data, but it is worth a mention. Examples of this are available in the pwr package in R.

A second trick is equivalence testing, the easiest example of which to understand is two one-sided tests: TOST. Briefly, TOST does two hypothesis tests in order to bound our estimate of the true effect, rejecting that the true effect is too high or too low.

A third trick is interval estimation. A frequentist might calculate a confidence interval to put bounds on the effect size, and a large sample size would lead to a relatively narrow confidence interval and correspondingly high precision in the estimate. A Bayesian might calculate a credible interval for the same purpose. All else equal, the credible interval should be narrower for a larger sample size, with a large sample resulting in a tight estimate of the true effect. Whether you go frequentist or Bayesian, a tight estimate and high precision sounds desirable. Once you have a range of plausible parameter values given by the interval estimate, you can analyze if any of those have any practical significance by comparing to your effect size.

Depending on what you want to do, one of these three might be reasonable for handling situations where large data sets are available. However, they all need some declaration of an effect size of interest!

Dave
  • 62,186
7

This is a general phenomenon in hypothesis testing for a point-null hypothesis

What you are dealing with here is a much larger issue in hypothesis testing than just the chi-squared test of independence. This is a phenomenon that arises in classical hypothesis testing whenever you are testing a point-null hypothesis (i.e., a null hypothesis that stipulates a single point for an unknown parameter). In such cases, the null hypothesis is a choice of a single parameter value, usually over an uncountable set of possible values. That is an extremely specific null hypothesis.

To understand the phenomenon you are referring to, let's have a look at the properties of a consistent hypothesis test. Hypothesis tests are designed to test the specified null hypothesis and reject it (in favour of a specified alternative) if the evidence falsifies the null hypothesis. (More information on the mathematical structure of a hypothesis test is avaialble in this related answer.) Suppose you have data $\mathbf{x}_n$ and an unknown parameter $\theta \in \Theta$ and you pick a null hypothesis space $\Theta_0 \subset \Theta$. Let $\alpha$ denote the significance level for the test and let $\beta_n$ denote the resulting power function. A consistent hypothesis test will have the following limiting property for its power function:

$$\lim_{n \rightarrow \infty} \beta_n(\theta) = 1 \quad \quad \quad \quad \quad \text{for all }\theta \in \Theta-\Theta_0 \text{ and } 0< \alpha <1,$$

which implies the following limiting property for its p-value function:

$$\underset{n \rightarrow \infty}{\text{plim}} \ p(\mathbf{x}_n) = 0 \quad \quad \quad \quad \quad \text{for all }\theta \in \Theta-\Theta_0. \quad \quad \quad \quad \quad \quad \quad $$


Consistency under the point-null hypothesis: Hypothesis tests are designed to test the truth or falsity of the hypotheses you actually give them, so if you use an extremely specific null hypothesis, and that hypothesis is even slightly false, the test is designed to correctly infer that the null hypothesis is false. In particular, if the point-null value is $\theta_0$ then for any parameter value $\theta \neq \theta_0$ you will have $\text{plim}_{n \rightarrow \infty} p(\mathbf{x}_n) = 0$ (i.e., the p-value will converge stochastically to zero).

One of the problems that arises in hypothesis testing occurs when we set a point-null hypothesis in a circumstance where that specific hypothesis is almost certainly false, but what we really want to know is something a bit broader --- e.g., whether the specified point-null is almost correct. A common case occurs when the parameter can be considered to be a continuous random variable, such that it will be equal to the stipulated point-null value with zero probability.$^\dagger$ In this case, the null hypothesis is false with probability one and so with a large amount of data the test gives us a tiny p-value which tells us that the null is false. Some users view this as a deficiency of the hypothesis test, but it is actually a case where the test is doing exactly what you ask it to. By specifying a point-null hypothesis you are asking the test to be very specific about the null hypothesis under consideration, and the test is complying with this instruction.

So, what can you do to deal with this "problem". Firstly, you ought to recognise that you need to test the hypothesis you are actually interested in, not a hypothesis that is mathematically close to this but much more specific. Typically you can do this by setting some "tolerance" $\epsilon>0$ on your stipulated point-null value and testing the composite null hypothesis $\theta_0 - \epsilon \leqslant \theta \leqslant \theta_0 + \epsilon$. You can view the tolerance value as a measure of "practical significance", meaning that if the true parameter value is within the stipulated tolerance of the point-null value then it is "practically" equivalent to the point-null value. In this manner you can separate "statistical significance" from "practical significance" and ensure that the consistency property of the hypothesis test does not lead the p-value to converge to zero in cases where you don't want it to.

Implementation of a "tolerance" in the null hypothesis for the chi-squared test of independence is quite complicated and so outside the scope of the present post (but feel free to ask a separate question for how to do this). In general you can alter existing tests to include a tolerance on a point-null hypothesis but you need to re-derive the test as a composite test to determine how the composite hypothesis affects the p-value function. This is a complicated exercise in general, but it can be automated into customised p-value functions once derived.


$^\dagger$ You will sometimes see statistical commentators make a broader assertion that a point-null hypothesis is always false. That is not true --- a point-null hypothesis can be true. Moreover, even in the case where the parameter is viewed as a random variable, it can be equal to a specific value with positive probability. It is only if we are willing to stipulate that the parameter is a continuous random variable that it has probability zero of being equal to any specific value.

Ben
  • 124,856
6

Take a look at this paper by the late Jack Good: http://fitelson.org/probability/good_bnbc.pdf ; At section 4.3, his "Bayes/Non-Bayes Compromise" leads to the definition of a "standardized" p-value which tries to address the "Huge $n$ $\Rightarrow$ Highly Probable to Reject Null, Whatever Data" effect.

Zen
  • 24,121
5

If your sample is large enough then it seems to me that a statistical test is not needed. You have characterised the effect. Is the effect that you have characterised large enough to be interesting? If so, then make a reasoned and principled argument about the observations without recourse to a testing procedure.

Michael Lew
  • 15,102
  • 2
    Testing could still be relevant if the estimated difference is close to what is defined as a meaningful difference. Testing is a formal way of assessing how sure you are of the result. – Michael R. Chernick Sep 01 '12 at 11:36
  • @MichaelChernick Most people, and probably all scientists, are interested in the evidence. The idea of a 'meaningful difference' in the context of testing comes from the Neyman-Pearson approach where it stands as the alternative hypothesis. That approach eschews any evidential basis of interpretation in favour of 'behavioural inference'. Not as useful as looking at the data as evidence, in my opinion. – Michael Lew Sep 02 '12 at 22:52
  • Neyman-Pearson hypothesis testing doesn't define a meaningful difference. That is determined by the investigator and really determines which alternative hypotheses we really care about. I don't share your pessimistic view about hypothesis testing. – Michael R. Chernick Sep 03 '12 at 00:56
5

Congratulations! You have a big enough sample size that you don't need to bother with significance testing! So don't worry about it. Now you just need to decide if the effect you see is "big enough" to care about, which is an entirely different problem that has nothing to do with significance testing.

A statistical significance test, like a Chi square test, is attempting to answer a very specific problem: "how likely is it that a difference I observe in a random sample is just an artifact of sampling error (the error that arises when we try to make generalizations about a population using only a random sample of that population)?" That's it. The fact that a test is significant doesn't tell us anything about whether the effect is "big" or "meaningful" in some substantive sense, or even that it's actually "real" (it might be due to some measurement error, or confounding with other variables).

Now, as sample size increases the likelihood of an observed difference of a given size being an artifact of sampling error goes down, and so significance tests will tend to be significant basically all of the time. This is just because they problem they are trying to help you solve has already been (largely) solved by the large sample size.

So, in your case sampling error is not a particularly big problem, so a significance test is not very helpful. Rather, what you need to decide is if the relationship you are looking is "big enough" or not. That's not a question that can be answered with a statistical test. You need to use your knowledge of the subject matter to decide if the relationship is large enough to "make a difference in the real world." No statistical test can answer that question for you, and neither can anyone here, unless they also happen to know a lot about Google n-grams, and the specific research question you are asking.

  • 1
    Fantastic answer! I loved the fact you concisely summarized that a significance test tries to disambiguate if the difference you see is due to random errors or if the difference is realy for large sample sizes. – Charlie Parker Jan 24 '23 at 19:23
  • In addition, I'd like to know one thing. Can't I use the effect size to see if the effect size is significant? e.g. if it's in the usual ranges e.g. ~0.2 medium, ~0.5 large, ~0.8 large? – Charlie Parker Jan 24 '23 at 19:26
  • Also, I do have expert knowledge on what is "significant difference". In my application it's usually eps = 1% or 2%. My hunch is to do eps/pooled_std(group1, group2) to compare it with the effect size to see if it's actually significant difference. – Charlie Parker Jan 24 '23 at 19:28
  • 2
    In Stats you need to be very careful to distinguish between "statistical significance" (what a sig test is testing for) and the normal English term "significant" (meaning "important" or "worth talking about'). We sometimes call the second thing "substantive significance" but it's better to just use another word, to avoid some of the confusion reflected in your comments. In particular.... – Graham Wright Jan 25 '23 at 01:01
  • 1
  • You can't look at effect size alone to tell if an effect is STATISTICALLY significant (and the "usual ranges" you talk about are not as standard or usual as you might have been told). A "big" effect might be non-significant in a small dataset (or the variance is large). That's what significance tests are there to tell you.
  • – Graham Wright Jan 25 '23 at 01:05
  • 2
  • Expert knowledge can tell you what an "important" or "substantively meaningful" difference is, but can't tell you whether an effect is statistically significant. But in you giant dataset, statistical significance is not an issue, so if your expert knowledge says a difference of 1% is worth talking about then that's the guideline you should follow to decide
  • – Graham Wright Jan 25 '23 at 01:05
  • Agreed. So what is the technical statistical term to use when N is large and we still want to make conclusions about difference or no difference if it's not statistical significance. Is it practical significance? – Charlie Parker Jan 25 '23 at 02:19
  • on a related note, if we have the confidence intervals (CIs) of two means is it correct to claim that the difference is significant if the intersect/overlap intersect? As a follow up, if N is large, the intervals will shrink. But if we take eps=1% or 2% (arbitrary numbers based on the application) into account (by making the intervals 1% longer on each side) -- would that be practical significance? (analogously to the discussion we had to effect size)? – Charlie Parker Jan 25 '23 at 02:25