What is $\mu_i$ in a GLM / link function

Question

Is it so that:

$y_i$ is not a discrete value, but a range with probability density function
Which means for the same predictor(s) value $y_i$ could have different results
In linear regression this distribution can only be normal
In GLM, this distribution can be any distribution from the exponential family
distribution of a single $y_i$ has nothing to do with distribution of all $y(s)$
$\mu_i$ is expected value of $y_i$
In practical use, $\mu_i$ is the predicted value $y_i$, specially if dataset has only one y for given predictor(s)

Are above correct? Where am I wrong?

Based on the above I've tried simulating glm with lm in R, and it kinda works:

library(boot)
download.file("https://dl.dropbox.com/u/7710864/data/ravensData.rda", 
              destfile="./ravensData.rda",method="curl")
load("./ravensData.rda")
# download manually and loadhere if above fails
# load("/yourpath/ravensData.rda")

# calling logit(ravensData$ravenWinNum) results in 
# [1]  Inf  Inf  Inf  Inf  Inf -Inf  Inf  Inf  Inf  Inf -Inf  Inf  Inf  Inf  Inf -Inf
# [17] -Inf -Inf  Inf -Inf
# that's way too much, as inv.logit goes to 1 at 20
# so we'll write our own dummy "logit" routine
# this will give us 5 when winNum=1 and -5 when it's zero
win <- ravensData$ravenWinNum*10-5

# now we can do a simple lm
fit <- lm(win~ravensData$ravenScore)

# and get probability of win using inv.logit
fitwin <- inv.logit(fit$fitted.values)
plot(ravensData$ravenScore, fitwin)

# now glm
fitglm <- glm(ravensData$ravenWinNum ~ ravensData$ravenScore, family="binomial")
plot(ravensData$ravenScore,fitglm$fitted)

That's a very odd R formula; the right hand side x^2 really isn't doing what you want at all. First, if x is continuous and you mean x-squared, you need to insulate the ^ operator from the formula parsing code. Also, you generally want x and x^2, not just x^2 in the formula. Hence, if x is continuous, you want y ~ x + I(x^2) to get a 2nd order polynomial. — Gavin Simpson, Oct 01 '15 at 19:48
The error is about y and it is clear that y is not a proportion between 0 and 1. When I run your code y has range (1, 101). The problem is that you aren't making the "mean" of y depend on x but you are computing some value from x^2 and adding on to it a random Bernoulli observation (0 or 1). This data isn't suitable for the Binomial GLM, in which you want Bernoullia data (0 or 1s) or Binomial counts; the number of successes from M trials. — Gavin Simpson, Oct 01 '15 at 19:52
OK, I removed that example which made no sense anyway. I'm just trying to understand what is $mu_i$, and how it's computed — n_mu_sigma, Oct 01 '15 at 21:18
It may help you to read my answer here: Difference between logit and probit models. — gung - Reinstate Monica, Oct 01 '15 at 22:20

jlimahaverford · Answer 1 · 2015-10-01T21:33:06.630

5

In GLM, your dependent variable is drawn from a distribution that depends on your independent variables. More specifically, the mean of the distribution from which the DV is drawn is related to a linear combination of your independent variables by the link function (actually the inverse link, $\mu_i = g^{-1}(\theta^Tx_i)$).

In ordinary least squares regression, the DV is modeled as a normal distribution whose mean is related to a linear combination of the independent variables by the identity function.

$$ y_i \sim \mathcal{N}(\theta^T x_i, \sigma^2). $$

In logistic regression, the DV is modeled as being drawn from a Bernoulli distribution whose mean is related to a linear combination of the independent variables by the inverse logit (expit) function.

$$ y_i \sim \mathcal{Ber}(g^{-1}(\theta^T x_i), \sigma^2), $$

where $g$ is the logit function.

In your case I think the issue is that you're trying to draw large numbers from a Bernoulli distribution with an unrealistic mean. If $$ y = x^2 + \epsilon, \epsilon \sim \mathcal{Ber}(0.5), $$ this does not mean that $y$ is being drawn from a Bernoulli distribution with mean $x^2$. It means that $y-x^2$ is being drawn from a Bernoulli distribution with mean 0.5.

Edit: Responding to comment

It is a bit more complex than that. GLM is generalized from linear models in two ways. The distribution that the data comes from can be an distribution in the exponential family, and the relationship between the mean of that distribution and the IVs can be non-linear. These two generalizations are not entirely disconnected in that certain types of distributions are intrinsically connected to certain link functions. This resource handles the subject very well, http://data.princeton.edu/wws509/notes/a2.pdf.

You have a bunch of data $\{(x_i, y_i) : 1 \leq i \leq N \}$. Our assumption is that there is a family of distributions with various means, we'll call them $D_{\mu}$ from which the $y_i$ were drawn. We believe that the mean was different for the different $y_i$, so we say specifically, $y_i \sim D_{\mu_i}$. What's more we believe that $\mu_i$ has a deterministic relationship with $x_i$, so $\mu_i = f(x_i)$. What's even more, we believe that this function lives in a family that we have parameterized by $\theta$, so for some $\theta^{*}$, $\mu_i = f_{\theta^{*}}(x_i)$.

OUR GOAL: Estimate $\hat{\theta} \approx \theta^{*}$, so that for future $x$ we can predict a corresponding $\hat{y} = f_{\hat{\theta}}(x)$, the mean of the distribution that the corresponding $y$ will be drawn from.

edited Oct 01 '15 at 21:33

answered Oct 01 '15 at 19:01

jlimahaverford

3,615

Hm, ok... So just thinking very practically (without going into how coefficients are compute etc.), GLM is basically LM, but the result of LM is plugged into inverse link function to get the mu?
And all the fancy math like link function needs to be continous differentiable is important only to understand how GLM can compute coefficients (betas)?
– n_mu_sigma Oct 01 '15 at 19:27
I still have hard time understanding what $mu_i$ is. How is it different from $y_i$?
Regarding lm/glm similarity, I added some R code in the original post, which indicates it's kinda, sorta similar. Thanks.
– n_mu_sigma Oct 01 '15 at 21:17
@n_mu_sigma updated my post. Read the bottom, starting with "You have a bunch." Then look at that paper. It may not be the easiest read but it's very good. – jlimahaverford Oct 01 '15 at 21:31
@n_mu_sigma Consider rbinom(21, 1, p = 0.5). This gives you a vector of 1s and 0s. These values are $y_i$. The mean of a binomial distribution is $np$ where $n$ is number of trials, and $p$ is the per trial probability of success (getting a 1). As $n = 1$ here the mean is $p$, which is 0.5. This mean of the distribution is $\mu_i$. The $y_i$ observations are drawn from a binomial distribution with mean $\mu_i$. Assuming the number of trials $n$ is constant over all observations, then we can let $p$ depend on $x$ such that $p$ is not 0.5 for all observations but varies as a function of $x$ – Gavin Simpson Oct 01 '15 at 22:26
@GavinSimpson ok, I kinda get it. What I'm not sure I get correctly is this: if I have Y={0,1,1,1,0,1,0}, now $Y_1$=0, $Y_2$=1, $Y_3$=1... but how is Yi different from $mu_i$? Y is not a matrix, e.g. Y={{1,0,1},{1,1,1}..} If for each i there were multiple values I would understand the mu. But it's a single value, so how is it different from Yi? – n_mu_sigma Oct 02 '15 at 16:47
@n_mu_sigma. Forget about GLM for a minute. Imagine I have a bunch of coins, and the probability that coin $i$ lands on heads is $\mu_i$. Then I flip each coin once and if the $i^{th}$ coin is heads I record $(y_i = 1)$, otherwise it is 0. Can you understand the relationship? $y_i$ was drawn from a distribution with mean $mu_i$. – jlimahaverford Oct 02 '15 at 17:10
@n_mu_sigma Note that $y_i$ takes values 0 or 1, only. But $\mu_i$ is some value, 0.5 say for a fair coin. $\mu_i$ could be any value in range 0-1, but the $y_i$ values will only ever be 0 or 1 - a head or a tail. We aim to recover the parameters of the distribution from which the $y_i$ were observed. Hence we model $\mu_i$, which for the binomial happens to be $np$. – Gavin Simpson Oct 02 '15 at 18:31
@GavinSimpson ok, I think I do finally get it! $mu_i$ is expected value of $y_i$ if multiple test were performed at the same conditions (predictors). So if we have $mu_i$=0.2, it simply means if we perform test 10 times, we expect to get two 1 and 8 0. So yes, $y_i$ can only be 0 or 1, but there is a probability how likely 1 is, and that's $mu_i$, right? – n_mu_sigma Oct 03 '15 at 10:34

What is $\mu_i$ in a GLM / link function

1 Answers1