8

From Newey and McFadden's book (chapter 36).

This definition of identification (the bracketed part) is confusing to me because (based on my obvious misunderstanding) it fails for probit:

Probit with 2 covariates: $f=\Theta(X_1\theta_1+X_2\theta_2)^y\cdot(1-\Theta(X_1\theta_1+X_2\theta_2))^{1-y}$. If we fix the data (which might be the mistake) such that $y=1,X_1=1,X_2=1$ and suppose $\theta_0=(1,1)$ then it's easy to find $\theta'=(2,0)$ such that $f(z|\theta_0)=f(z|\theta')$.

What am I not getting here? Does it need to hold for all z in the data generating process?

3 Answers3

6

Yes, it needs to hold for all $z$ (subject to some technical caveats)

The idea of non-identifiability is that two different parameter values give the same sampling distribution, making them impossible to distinguish based on the data. Subject to some technical caveats which you can usually ignore,$^\dagger$ this means that non-identifiability occurs when the sampling density is the same for all observable data values and identifiability occurs when the sampling density is different for at least one observable data value.

Consequently, in the definition you cite in your question, I think they mean to say that identifiability occurs when $f(z|\theta) \neq f(z|\theta_0)$ for some observable value $z$. Your attempted counter-example gives one case where $f(z|\theta) = f(z|\theta_0)$ but it does not show that this holds for all $z$, so it does not establish non-identifiability.

Incidentally, a reasonable way to view identifiability is in terms of the concept of minimal sufficient parameters (see e.g., O'Neill 2005). Just as you can derive a minimal sufficient statistic from the likelihood function, you can similarly derive a "minimal sufficient parameter" by the same essential method. The minimal sufficient parameter is what can be "identified" from data from that sampling distribution, so any parameter vector that is not a function of the minimal sufficient parameter is not fully identifiable.


$^\dagger$ A slight complication to identifiability occurs because density functions are not generally unique representations of probability distributions. For instance, for continuous random variables it is possible to alter the points in a density function on an arbitrary countable set of points and it still represents the same distribution. This means that when you are assessing identifiability based on a parameterised class of sampling density functions, strictly speaking, identifiability occurs when $f(x|\theta) \neq f(x|\theta_0)$ over a set of values of $z$ that has positive probability under at least one of those densities. If you form the sampling densities so that they are all continuous then this is enough to allow you to simplify things to say that identifiability occurs when $f(x|\theta) \neq f(x|\theta_0)$ for any $z$. For the reasons discussed here, identifiability is generally not defined in terms of density functions.

Ben
  • 124,856
  • Thank you for the answer: to confirm, you are saying they should have written: $\theta_0$ is identified $\iff$ $\exists z$ such that $f(z|\theta) \not= f(z|\theta_0)$. Do you have a possible citation for such a definition (the link you provided is cool but a different framework)? – Panel Noob Oct 04 '22 at 03:26
  • If you would like literature beyond the linked paper, I recommend looking at some of the papers cited in that paper. – Ben Oct 04 '22 at 08:18
  • 1
    Your lead-in seems to contradict the rest of your post. I am having difficulties determining what you're trying to say and reconciling it with the standard definition of identifiability (namely, that each $\theta\in\Theta$ determines a unique distribution). Indeed, it seems you make some additional implicit assumptions in referring to "a set of outcomes with probability one": probability with respect to what distribution? – whuber Feb 14 '23 at 22:10
  • 1
    @whuber: I've rewritten the answer to try to deal with this. Essentially, I am trying to get across the "essence" of identifiability without labouring the technical caveat that applies due to non-uniqueness of density functions. Usually identifiability would not be defined in terms of density functions (because of this problem), but since the OP is dealing with a case where it is, I would like to give him an answer that side-steps this complication for the most part. – Ben Feb 15 '23 at 03:43
1

I see no problem with how identification is defined. The bracketed part is saying that $\theta_0$ is identified if whenever $\theta \neq \theta_0$ and $\theta \in \Theta$, then it must be that $f(z|\theta) \neq f(z|\theta_0)$.

Here, the $\neq$ is for two functions. Two functions are not equal if there is some $z$ for which they are not equal. So I see no ambiguity in the definition.

If you wanted, you could certainly add "if there exists some $z$ such that...", but I presume the author was trying to keep things concise.

Relating all this to your example, identification deals with population distributions, not finite draws of data as in your example (I write more about this here: What is the difference between Consistency and Identification?). So if you are going to "fix" the data, then identification will be about the corresponding finite sample distribution (which is a bit strange, but can be done). It could certainly be the case that for certain finite sample DGPs, something like the coefficients for a probit are actually not identified.

doubled
  • 4,877
  • Typically when we refer to an entire function, we avoid specifying the argument variable. Usually when the argument variable is specified that is a reference to the output of that function at the specified value. – Ben Oct 04 '22 at 08:19
  • @Ben not sure who "we" is here... I certainly agree it could be clearer to emphasize the "for some $z$" but I've certainly seen this kind of notation, especially given that the author's notation also involves $|\theta$, so that writing $f( | \theta) \neq f( |\theta_0)$ could feel weird. But these are stylistic choices that certainly depend on the author, so I'm a bit surprised you mention this so strongly. But I agree with you it's clearer :)! – doubled Oct 05 '22 at 04:24
1

Below I quote the parameter identifiability definition from Section 4.6, Statistical Models, by A. C. Davison:

There must be a 1-1 mapping between models and elements of the parameter space, otherwise there may be no unique value of $\theta$ for $\hat{\theta}$ to converge to. A model in which each $\theta$ generates a different distribution is called identifiable.

This model identifiability definition is just a more heuristic way of paraphrasing the definition you cited in the question. The first sentence in the above quotation requires that the mapping from the parameter space $\Theta$ to the model space $\mathscr{P}_\Theta$: $\theta \mapsto f_\theta$ is a bijection (as the mapping is inherently surjective, it is only necessary to verify the mapping is injective). Therefore, the identifiability definition should not rely on the specific observations of data and can be discussed from a pure probabilistic perspective, as other answers have already pointed out.

A few concrete examples will elucidate this concept. Consider the probit model in your question: for parameter $\theta = (\theta_1, \theta_2)' \in \Theta$, the distribution of the response variable $y$ given the explanatory variable $x = (x_1, x_2)'$ is \begin{align} f_\theta(y|x) = \Phi(\theta'x)^y(1 - \Phi(\theta'x))^{1 - y}. \tag{1} \end{align}

Hence $f_{\theta_1}(y|x) = f_{\theta_2}(y|x)$ requires, in particular $f_{\theta_1}(1|x) = f_{\theta_2}(1|x)$ for all $x$, i.e., $\Phi(\theta_1'x) = \Phi(\theta_2'x)$ holds for all $x$. Since $\Phi$ is strictly increasing, this implies $\theta_1'x = \theta_2'x$ for all $x$, which can only hold when $\theta_1 = \theta_2$. This shows that the mapping is injective hence the probit model $(1)$ is identifiable.

Now consider a model that is non-identifiable. General mixture models are usually non-identifiable according to the very original definition quoted above, which is known as the label switching problem. The two-component mixture model below (Exercise 4.6.1 from the same reference) is a very simple example:

Data arise from a mixture of two exponential populations, one with probability $\pi$ and parameter $\lambda_1$, and the other with probability $1 - \pi$ and parameter $\lambda_2$. The exponential parameters are both positive real numbers and $\pi$ lies in the range $[0, 1]$, so $\Theta = [0, 1] \times \mathbb{R}_+^2$, and \begin{align} f(y; \pi, \lambda_1, \lambda_2) = \pi\lambda_1e^{-\lambda_1y} + (1 - \pi)\lambda_2e^{-\lambda_2y}, \quad y > 0, 0 \leq \pi \leq 1, \lambda_1, \lambda_2 > 0. \tag{2} \end{align}

There are many ways to show $(2)$ is non-identifiable. One way is by noting that as long as $\lambda_1 = \lambda_2$, then no matter what the value of $\pi$ is, the model degenerates to the single exponential distribution: for example, $\theta_1 = (0.5, 1, 1) \neq \theta_2 = (0.2, 1, 1)$ give the same density $e^{-y}$. The other way corresponds to the label switching, which means all the permutations of one parameter give the same density: for example, $\theta_1 = (0.2, 1, 2) \neq \theta_2 = (0.8, 2, 1)$ yield the same density $0.2e^{-y} + 0.8 \times 2e^{-2y}$.

For more examples and discussions on this topic, you can look into the referenced section.

Zhanxiong
  • 18,524
  • 1
  • 40
  • 73