I am reading Tutz & Schmid "Modeling Discrete Time-to-Event Data" (2016) chapter 4 Evaluation and Model Choice section 4.2 Residuals and Goodness-of-Fit. A goodness-of-fit statistic called deviance is defined as $$ D = 2 \sum_{i=1}^N n_i \sum_{t=1}^k p_{it} \log\left( \frac{p_{it}}{\hat\pi_{it}} \right) $$ where $\hat\pi_{it}$ is the estimated probability for a person belonging to group $i$ (of size $n_i$, among $N$ groups in total) to experience an event in the time period $t$ (from 1 to $k$), and $p_{it}$ is the corresponding observed proportion of persons in group $i$ that indeed experienced the event in the time period $t$.
A well-fitting model will produce values $\hat\pi_{it}$ that are close the the observed proportions $p_{it}$, though there will be discrepancies due to sampling variability inherent in $p_{it}$. This will yield $\frac{p_{it}}{\hat\pi_{it}}\approx 1$ and thus $\log\left( \frac{p_{it}}{\hat\pi_{it}} \right)\approx 0$ and thus $D\approx 0$. (Correct me if I am wrong.)
On the other hand, there is a later statement on the same page that asymptotically $D\sim\chi^2(N(k-1)-p)$. Now, I know $\chi^2(N(k-1)-p)$ can be obtained as a sum of squares of $N(k-1)-p$ independent N(0,1) random variables or as a sum of squares of a different number of dependent ones. However, I do not get the intuition why $p_{it} \log\left( \frac{p_{it}}{\hat\pi_{it}} \right)$ should behave like a square of N(0,1). Perhaps it should not? More generally, how is the asymptotic $\chi^2(N(k-1)-p)$ distribution obtained? (Intuition is welcome.)
Moreover, it seems to me that $\log\left( \frac{p_{it}}{\hat\pi_{it}} \right)$ and thus $p_{it} \log\left( \frac{p_{it}}{\hat\pi_{it}} \right)$ can be both lesser and greater than zero, and so $D$ could end up being negative. That is incompatible with a $\chi^2(N(k-1)-p)$ distribution. Am I getting this wrong?
