chi square GLM inference

Question

Suppose at $m$ different positions on a line $a_1,....,a_m$, we sample from a i.i.d normal distribution $N(\mu_i,\sigma_i^2)$, $n_i$ times for each of the $1\le i\le m$ different points. Here of course $\mu_i=\mu_i(a_i)$ and $\sigma_i=\sigma_i(a_i)$. Thus we produce a triangular array of normal variables given by $X_{ij}$ for $1\le i\le m$ and $1\le j\le n_i$, where for each $i$, $X_{i1},....X_{{in_i}}$ are i.i.d $N(\mu_i,\sigma^2_i)$. Setting, $\bar{X}_{i\cdot}=\frac{1}{n_i}\sum_{j=1}^{n_i} X_{ij}$ we see that $Y_i=\sum_{j=1}^{n_i} (X_{ij}-X_{i\cdot})^2$ is of course a $\sigma_i^2\chi_{n-1,\lambda}$ random variable where $\lambda=\frac{n\mu_i^2}{\sigma_i^2}$ for each $1\le i\le n_i$.

I am trying to see how a GLM could be applied here to see how the variances $\sigma_1^2...,\sigma_m^2$ depend on the positions $a_1,...a_m$. Can we also consider the standard link function for this model? would we have any issues from this.

Actually $Y_i$ follows an ordinary $\chi^2_{n-1}$ distribution so $\lambda=1$. — Gordon Smyth, Dec 14 '20 at 03:27
here we have different variances. I think that only works if we assume that $X_{I,j}$ are standard normals? — user593295, Dec 14 '20 at 03:38
Yes, the variances are different but $\lambda=1$: $Y_i\sim \sigma^2_i\chi_{n-1}$. The distribution of $Y_i$ does not depend on $\mu_i$. — Gordon Smyth, Dec 14 '20 at 08:13

Gordon Smyth · Accepted Answer · 2021-02-19T22:13:39.400

Yes, can you use a standard generalized linear model (GLM) approach to model the relationship between the positions $a_i$ and the variances $\sigma^2_i$. The nicest way is to compute the sample variance at each position: $$s^2_i=\frac{1}{n_i-1}Y_i=\frac{1}{n_i-1}\sum_{j=1}^{n_i}\left(X_{ij}-\bar X_i\right)^2$$

Then the $s^2_i$ are Gamma distributed. They follow a GLM response distribution with quadratic variance function, means $\nu_i=\sigma^2_i$, prior weights $w_i=n_i-1$ and dispersion $\phi=2$. In other words: $$E(s^2_i)=\nu_i=\sigma^2_i$$ and $${\rm var}(s^2_i)=\frac{\phi}{w_i}\nu^2_i=\frac{2\sigma^4_i}{n_i-1}$$

I have given the theoretical calculations for a gamma GLM as part of another answer on this site but the above summary is all you really need. Let's suppose you are using R, the sample variances are stored in a vector s2, the positions are in a vector called a and the sample sizes are in n. You can estimate a log-linear relationship between the positions and the variances by:

w <- n-1
fit <- glm(s2 ~ a, family=Gamma(link="log"), weights=w)
summary(fit, dispersion=2)

You can test significance of the trend by

anova(fit, dispersion=2, test="Chisq")

If you want to estimate a more general smooth trend you could use regression splines. Here I fit a regression spline with 3 parameters, which is often enough:

fit <- glm(s2 ~ ns(a, df=3), family=Gamma(link="log"), weights=w)
anova(fit, dispersion=2, test="Chisq")

In my experience, there is seldom any reason not to use the log-link for a Gamma GLM. While the inverse-link is the default in R, the log-link has the advantage of transforming the positive reals (the range of $\nu_i$) to the whole real line, meaning that the coefficients of the GLM linear model are unconstrained. The log-link has two important advantages:

It avoids numerical problems with negative means and
It models relative changes in the variances, which makes intuitive sense.

The inverse link is canonical for the Gamma GLM family but is seldom very compelling in practice and, frankly, just causes unnecessary problems. In my opinion, the inverse link should only be used when there is a theoretical mathematical model justifying the inverse link. Such well-defined models are very rare and are obviously not present in your application. If you did want to try the canonical link however it is done by:

fit <- glm(s2 ~ a, family=Gamma(link="inverse"), weights=w)

Yeah I ended up getting the GLM fitting part, that makes sense as Y_i is given in a form to read off the sample variances. Do you have any idea about the using canonical link function? (this is if we write the chi-square density as a GLM form). — user593295, Dec 19 '20 at 03:48
@user593295 I have answered your question about the canonical link as part of my answer. Using the canonical link is easy but also pointless. The GLM calculations are exactly the same no matter which link function is used, but the log-link is the most natural and the most frequently used. — Gordon Smyth, Dec 19 '20 at 04:20
The default link in Gamma() is not the "log" link but the canonical link, "inverse": https://stat.ethz.ch/R-manual/R-patched/library/stats/html/family.html — Gavin Simpson, Feb 19 '21 at 02:46
@GavinSimpson Thanks for correction and edit. I've made another edit myself. — Gordon Smyth, Feb 19 '21 at 22:17

chi square GLM inference

1 Answers1

Linked