Let's handle the simplest case to try to provide the most
intuition. Let $X_1, X_2, \ldots, X_n$ be an iid sample from a discrete
distribution with $k$ outcomes. Let $\pi_1,\ldots,\pi_k$ be the
probabilities of each particular outcome. We are interested in the
(asymptotic) distribution of the chi-squared statistic
$$
X^2 = \sum_{i=1}^k \frac{(S_i - n \pi_i)^2}{n\pi_i} \> .
$$
Here $n \pi_i$ is the expected number of counts of the $i$th outcome.
A suggestive heuristic
Define $U_i = (S_i - n\pi_i) / \sqrt{n \pi_i}$, so that $X^2 = \sum_i
U_i^2 = \newcommand{\U}{\mathbf{U}}\|\U\|^2_2$ where $\U =
(U_1,\ldots,U_k)$.
Since $S_i$ is $\mathrm{Bin}(n,\pi_i)$, then by the Central Limit Theorem,
$$
\newcommand{\convd}{\xrightarrow{d}}\newcommand{\N}{\mathcal{N}}
T_i = \frac{U_i}{\sqrt{1-\pi_i}} = \frac{S_i - n \pi_i}{\sqrt{ n\pi_i(1-\pi_i)}} \convd \N(0, 1) \>,
$$
hence, we also have that, $U_i \convd \N(0, 1-\pi_i)$.
Now, if the $T_i$ were (asymptotically) independent (which they aren't), then we could argue that
$\sum_i T_i^2$ was asymptotically $\chi_k^2$ distributed. But, note that $T_k$ is a deterministic function of $(T_1,\ldots,T_{k-1})$ and so the $T_i$ variables can't possibly be independent.
Hence, we must take into account the covariance between them somehow. It turns out that the "correct" way to do this is to use the $U_i$ instead, and the covariance between the components of $\U$ also changes the asymptotic distribution from what we might have thought was $\chi_{k}^2$ to what is, in fact, a $\chi_{k-1}^2$.
Some details on this follow.
A more rigorous treatment
It is not hard to check that, in fact,
$\newcommand{\Cov}{\mathrm{Cov}}\Cov(U_i, U_j) = - \sqrt{\pi_i
\pi_j}$ for $i \neq j$.
So, the covariance of $\U$ is
$$
\newcommand{\sqpi}{\sqrt{\boldsymbol{\pi}}}
\newcommand{\A}{\mathbf{A}}
\A = \mathbf{I} - \sqpi \sqpi^T \>,
$$
where $\sqpi = (\sqrt{\pi_1}, \ldots, \sqrt{\pi_k})$. Note that
$\A$ is symmetric and idempotent, i.e., $\A = \A^2 =
\A^T$. So, in particular, if $\newcommand{\Z}{\mathbf{Z}}\Z =
(Z_1, \ldots, Z_k)$ has iid standard normal components, then $\A
\Z \sim \N(0, \A)$. (NB The multivariate normal distribution in this case is degenerate.)
Now, by the Multivariate Central Limit Theorem, the vector $\U$ has
an asymptotic multivariate normal distribution with mean $0$ and
covariance $\A$.
So, $\U$ has the same asymptotic distribution as $\A \Z$, hence, the same asymptotic distribution of
$X^2 = \U^T \U$ is the same as the distribution of $\Z^T \A^T
\A \Z = \Z^T \A \Z$ by the continuous mapping theorem.
But, $\A$ is symmetric and idempotent, so (a) it has orthogonal
eigenvectors, (b) all of its eigenvalues are 0 or 1, and (c)
the multiplicity of the eigenvalue of 1 is $\mathrm{rank}(\A)$. This means that $\A$ can be decomposed as $\A = \mathbf{Q D Q}^T$ where $\mathbf{Q}$ is orthogonal and $\mathbf{D}$ is a diagonal matrix with $\mathrm{rank}(\A)$ ones on the diagonal and the remaining diagonal entries being zero.
Thus, $\Z^T \A \Z$ must be $\chi^2_{k-1}$ distributed since
$\A$ has rank $k-1$ in our case.
Other connections
The chi-square statistic is also closely related to likelihood ratio
statistics. Indeed, it is a Rao score statistic and can be viewed as a
Taylor-series approximation of the likelihood ratio statistic.
References
This is my own development based on experience, but obviously influenced by classical texts. Good places to look to learn more are
- G. A. F. Seber and A. J. Lee (2003), Linear Regression Analysis, 2nd ed., Wiley.
- E. Lehmann and J. Romano (2005), Testing Statistical Hypotheses, 3rd ed., Springer. Section 14.3 in particular.
- D. R. Cox and D. V. Hinkley (1979), Theoretical Statistics, Chapman and Hall.