Here is my understanding of Likelihood function, maximum likelihood estimator (MLE) and consistency and efficiency of MLE.
(Notes: Comments are not main parts of this post and can be skipped. Only Questions are the most important parts, the rest is the context of the questions and explains why I ask these questions. I'm not expecting anyone to answer all these questions. Any comment is very much welcome. Hints to question 3, 4 are what I am expecting most.)
I. Likelihood function
Likelihood function, though written as $f(\mathbf{x}|\theta)=f(x_1,\dots, x_n|\theta)$, is defined for both pmf of discrete random variable and continuous random variable. Note that it is not pmf or pdf for a single random variable, but one for joint distribution of random variables. That's why, when $X_i$'s are iid, it is naturally equal to $\prod_i f(x_i|\theta)$.
Comment 1:
(1) To make it look more familiar to one who is used to analysis notation, we can also write likelihood function $f(\mathbf{x}|\theta)$ for discrete variable as $P(X_1=x_1, \dots, X_n=x_n |\theta)=P_{\text{given }\theta}(X_1^{-1}(x_1), \dots, X_n^{-1}(x_n))$ (where $x_i\in \mathbb{R}$), since random variable is indeed function from sample space to $\mathbb{R}$.
(2) we can also write likelihood function $f(\mathbf{x}|\theta)$ for continuous variable as $f_{X_1, \dots, X_n}(x_1, \dots, x_n)=\frac\partial{\partial x_1}\dots\frac\partial{\partial x_n}P(X_1<x_1, \dots, X_n<x_n)\\
=\frac\partial{\partial x_1}\dots\frac\partial{\partial x_n}P(X_1^{-1}(-\infty, x_1), \dots, X_n^{-1}(-\infty, x_n)).$
(I guess $\frac\partial{\partial x_1}\dots\frac\partial{\partial x_n}$ here might be not an accurate way of saying pdf of joint distribution is derivative against multiple values $x_i\in \mathbb{R}$ of random variables.)
Question 1: In Def 6.3.1 of Casella, Likelihood function doesn't require random variables $X_1,\dots, X_n$ (of the sample) to be i.i.d, while MLE (in Def 7.2.4) requires so?
II. MLE
MLE is maximum of likelihood function, we don't always need to calculate max of log of likelihood function, but sometimes it makes the calculation easier.
III. consistency and efficiency of MLE
Theorem 10.1.6 and 10.1.12 define consistency and efficiency of MLE.
III-1. Consistency
Consistency is that MLE $\hat \theta$ of $\theta$ as well as functions $\tau(\hat \theta)$ of $\hat \theta$ can 'give $\tau( \theta)$' arbitrarily narrow distribution (give tiny interval estimate of $\tau( \theta)$ with extremely high (almost 100%) confidence level) when n is sufficiently large, i.e. $$P(\tau (\hat \theta)-\epsilon < \tau( \theta) < \tau (\hat \theta)+\epsilon ) \to 1, \text{ as } n\to\infty.$$
Comment 2: Here the probability is probability about $\hat \theta$ (which is a function of random variables $X_i$ and therefore a random variable itself), not $\theta$ which is regarded as fixed (since the population and its distribution don't change), we can make it more explicit by writing $P(\tau (\hat \theta)-\epsilon < \tau( \theta) < \tau (\hat \theta)+\epsilon )=P(\tau (\hat \theta) < \tau( \theta)+\epsilon, \tau( \theta)-\epsilon < \tau (\hat \theta) )\\=P((\tau (\hat \theta))^{-1}(-\infty, \tau( \theta)+\epsilon), (\tau (\hat \theta) )^{-1}(\tau( \theta)-\epsilon, \infty))\\=P(( \hat \theta^{-1}(\tau^{-1}(-\infty, \tau( \theta)+\epsilon))), ( \hat \theta^{-1}(\tau^{-1}(\tau( \theta)-\epsilon, \infty))).$
so it's, overall, probability about the random variable $\hat \theta$, i.e. probability of events where $\hat \theta$ satisfies certain conditions.
III-2. Asymptotic efficiency
Asymptotic efficiency is that MLE $\hat \theta$ of $\theta$ as well as functions $\tau(\hat \theta)$ of $\hat \theta$, beside being unbiased (converges to $\theta, \tau(\theta)$), has variance that converges to the lower bound of/smallest possible variance (of unbiased estimator)--which is Cramer-Rao Lower Bound $v(\theta)$--divided by n, i.e. $\sqrt n[\tau(\hat \theta) -\tau( \theta)]\to n[0, v(\theta)].$
Question 2: It seems $\frac n {v(\theta)}$ is fisher information $I_n(\theta)$ (*), so the larger is fisher information, the smaller the lower bound of variance, is it correct? (This is related to Delta method and Fisher information, if (*) is correct, then it's reasonable to use $\frac 1{I_n(\theta)}$ to describe variance/n in the asymptotic setting, i.e. when n is large.)
(Updated: this question is partly discussed in the quoted post. My guess is close, but there are still unclear and very confusing details.)
III-3. limiting variance, asymptotic variance, and CR Lower Bound
So overall 'consistent' is an estimator being unbiased (when n, the number of iid sample, is large), 'efficient' is an estimator having the mallest possible variance an (unbiased/consistent) estimator can have, i.e. the limiting variance ($\lim_{n\to\infty} k_n\hat\theta$) or Asymptotic variance ($\sigma^2$ in $k_n[\hat\theta-\theta]\to n[0,\sigma^2]$.) equals CR Lower Bound. ($k_n$ seems to be any sequence of n, e.g. it can be $n^2, \log n, \frac{n^2+3}{-3n-1}\dots$, for MLE $k_n=\sqrt n$ )
Question 3: The two variances are both variance of (more exactly, linear function of) the estimator, the only difference is that in one we substract from the estimator the parameter it estimates, (such substraction would not change the variance, since $\mathrm{Var}(aX+b)=a^2\mathrm{Var}(X)$), then why we need to define two variance (of estimator for sample of large n)?
Question 4: Besides, why we need $k_n$ to define the variances, and why for MLE $k_n=\sqrt n$? (My guess is that MLE is similar to mean $\frac{\sum_i X_i}n$ (though I don't know why, this question is mentioned in another post The two estimators of mean of Gamma distribution and the estimators' variances), and so with the square of $\sqrt n$ we can balance the effect (amplifying by factor n) of $\sum_i$ in the nominator and the effect (shrinking by factor $\frac1 {n^2}$ of n in the denominator.