9

I am having a hard time understanding what a sufficient statistic actually helps us do.

It says that

Given $X_1, X_2, ..., X_n$ from some distribution, a statistic $T(X)$ is sufficient for a parameter $\theta$ if

$P(X_1, X_2, ..., X_n|T(X), \theta) = P(X_1, X_2, ..., X_n|T(X))$.

Meaning, if we know $T(X)$, then we cannot gain any more information about the parameter $\theta$ by considering other functions of the data $X_1, X_2, ..., X_n$.

I have two questions:

  1. It seems to me that the purpose of $T(X)$ is to make it so that we can calculate the pdf of a distribution more easily. If calculating the pdf yields a probability measure, then why is it said that we cannot "gain any more information about the parameter $θ$"? In other words, why are we focused on $T(X)$ telling us something about $\theta$ when the pdf spits out a probability measure, which isn't $\theta$?

  2. When it says: "we cannot gain any more information about the parameter θ by considering other functions of the data $X_1,X_2,...,X_n$.", what other functions are they talking about? Is this akin to saying that if I randomly draw $n$ samples and find $T(X)$, then any other set of $n$ samples I draw give $T(X)$ also?

user123276
  • 2,077

1 Answers1

18

I think the best way to understand sufficiency is to consider familiar examples. Suppose we flip a (not necessarily fair) coin, where the probability of obtaining heads is some unknown parameter $p$. Then individual trials are IID Bernoulli(p) random variables, and we can think about the outcome of $n$ trials as being a vector $\boldsymbol X = (X_1, X_2, \ldots, X_n)$. Our intuition tells us that for a large number of trials, a "good" estimate of the parameter $p$ is the statistic $$\bar X = \frac{1}{n} \sum_{i=1}^n X_i.$$ Now think about a situation where I perform such an experiment. Could you estimate $p$ equally well if I inform you of $\bar X$, compared to $\boldsymbol X$? Sure. This is what sufficiency does for us: the statistic $T(\boldsymbol X) = \bar X$ is sufficient for $p$ because it preserves all the information we can get about $p$ from the original sample $\boldsymbol X$. (To prove this claim, however, needs more explanation.)

Here is a less trivial example. Suppose I have $n$ IID observations taken from a ${\rm Uniform}(0,\theta)$ distribution, where $\theta$ is the unknown parameter. What is a sufficient statistic for $\theta$? For instance, suppose I take $n = 5$ samples and I obtain $\boldsymbol X = (3, 1, 4, 5, 4)$. Your estimate for $\theta$ clearly must be at least $5$, since you were able to observe such a value. But that is the most knowledge you can extract from knowing the actual sample $\boldsymbol X$. The other observations convey no additional information about $\theta$ once you have observed $X_4 = 5$. So, we would intuitively expect that the statistic $$T(\boldsymbol X) = X_{(n)} = \max \boldsymbol X$$ is sufficient for $\theta$. Indeed, to prove this, we would write the joint density for $\boldsymbol X$ conditioned on $\theta$, and use the Factorization Theorem (but I will omit this in the interest of keeping the discussion informal).

Note that a sufficient statistic is not necessarily scalar-valued. For it may not be possible to achieve data reduction of the complete sample into a single scalar. This commonly arises when we want sufficiency for multiple parameters (which we can equivalently regard as a single vector-valued parameter). For example, a sufficient statistic for a Normal distribution with unknown mean $\mu$ and standard deviation $\sigma$ is $$\boldsymbol T(\boldsymbol X) = \left( \frac{1}{n} \sum_{i=1}^n X_i, \sqrt{\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2} \right).$$ In fact, these are unbiased estimators of the mean and standard deviation. We can show that this is the maximum data reduction that can be achieved.

Note also that a sufficient statistic is not unique. In the coin toss example, if I give you $\bar X$, that will let you estimate $p$. But if I gave you $\sum_{i=1}^n X_i$, you can still estimate $p$. In fact, any one-to-one function $g$ of a sufficient statistic $T(\boldsymbol X)$ is also sufficient, since you can invert $g$ to recover $T$. So for the normal example with unknown mean and standard deviation, I could also have claimed that $\left( \sum_{i=1}^n X_i, \sum_{i=1}^n X_i^2 \right)$, i.e., the sum and sum of squared observations, are sufficient for $(\mu, \sigma)$. Indeed, the non-uniqueness of sufficiency is even more obvious, for $\boldsymbol T(\boldsymbol X) = \boldsymbol X$ is always sufficient for any parameter(s): the original sample always contains as much information as we can gather.

In summary, sufficiency is a desirable property of a statistic because it allows us to formally show that a statistic achieves some kind of data reduction. A sufficient statistic that achieves the maximum amount of data reduction is called a minimal sufficient statistic.

heropup
  • 5,406
  • What would be the general relation between $T(X)$ and our parameter $p$ or $\theta$? Does $T(X)$ always have to be related to the parameter? Also, intuitively, am I correct to say that the factorization theorem works because once we separate out pdf so that it is the product of the parameter/sufficient stat and some function of x, that we can take logs and thus obtain an MLE estimate? thanks! – user123276 Feb 03 '14 at 05:22
  • 3
    A sufficient statistic is not necessarily an estimate of the parameter(s); e.g., the original sample doesn't estimate anything. You have to do something to it to get an estimate. The only requirement is that a sufficient statistic doesn't discard any information you could get about the parameter(s) that was in the original sample. The factorization theorem shows sufficiency because it expresses the joint PDF conditioned on the parameter in such a way that the part that remains conditional on the parameter is only a function of the sufficient statistic. – heropup Feb 03 '14 at 05:29
  • 1
    To continue, in that sense, when you factor the PDF $f(\boldsymbol x \mid \theta) = g(T(\boldsymbol x) \mid \theta) h(\boldsymbol x)$, the factor that gives you "information" about the parameter is the conditional part $g(T(\boldsymbol x) \mid \theta)$. The factor $h(\boldsymbol x)$ is not conditional on $\theta$ so it doesn't furnish information about it. Thus, all you need to know is $T(\boldsymbol X)$, and not anything else. – heropup Feb 03 '14 at 05:31
  • So when they say that "$T(X)$ is sufficient for $\theta$", it means that I can use the conditional part "$g(T(X)|\theta)$ to find an estimate of $\theta$? – user123276 Feb 03 '14 at 05:35
  • Okay, so parameter estimation is a distinct but related concept from sufficiency. The next step after sufficiency is to think about the joint density as a likelihood function; i.e., given the sample $\boldsymbol X$, what is the value of the parameter(s) that maximizes the likelihood $f(\boldsymbol \theta \mid \boldsymbol X = \boldsymbol x)$? Ignoring the unconditional component of the joint density is then a natural consequence of maximum likelihood estimation. – heropup Feb 03 '14 at 05:38
  • ah ok, thanks, I just have one more question regarding the factorization theorem where it says the factor $g(T(X)|\theta)$ is said to depend on $\theta$ only through $T(X)$

    If my example for $g(T(X)|\theta)$ is:

    $ \mathrm{e}^{-n\lambda}\cdot\lambda^{\sum{x_i}}$, what does it mean for the factor to depend on $\lambda$ only through $\sum x_i$? How is that lambda to the power of something means we depend on it? Thank you!!

    – user123276 Feb 03 '14 at 05:44
  • 1
    Notice that the only place where the sample appears in $g$ is when it is expressed as the sum $T(\boldsymbol x) = \sum x_i$, so that is our sufficient statistic. Now, hypothetically, if we were only able to obtain a factor of the form $$g(T(\boldsymbol X) \mid \lambda) = e^{-n \lambda \prod x_i} \lambda^{\sum x_i},$$ then our sufficient statistic would be vector-valued: $\boldsymbol T (\boldsymbol x) = (\sum x_i, \prod x_i)$. – heropup Feb 03 '14 at 05:52
  • Ok, so basically the so called "interaction" is $\lambda$ multiplied to something? – user123276 Feb 03 '14 at 06:23
  • Sometimes; sometimes not. You have to look at the algebraic expression. Any expression that contains a parameter must be included in $g$. Of those parts, any factor containing a function of the sample that cannot be separated from the parameter must also be included in $g$. So for instance, we can't separate/factor $\exp(-\lambda \sum x_i)$, but we can separate $\exp(-\lambda + \sum x_i) = e^{-\lambda} e^{\sum x_i}$. But if I write $\lambda + \sum x_i$, I can't separate this out--$g$ has to be the entire expression including $\sum x_i$, because it has to factor. – heropup Feb 03 '14 at 17:03
  • @heropup In your examples of uniqueness, why don't we need to know N as well? – dimitriy Nov 13 '14 at 15:40
  • @DimitriyV.Masterov We do need to know the sample size. That is assumed to be a known, fixed value that is neither a parameter nor a random variable, because it pertains to the amount of data we observe. That said, don't confuse the sample size with (for example) a parameter for a binomial distribution; e.g., we could try to find a sufficient statistic for $X_i \sim \operatorname{Binomial}(n,0.5)$ where $n$ is the unknown parameter; but in this case, our observations might be $X_1, X_2, \ldots, X_m$ and $m$ is the sample size. – heropup Nov 13 '14 at 15:59
  • Could you provide some references with proofs of sufficient statistics in given examples? Intuitively your explanations are cool. – hans Jul 31 '19 at 08:02