When fitting a GAM using, say mgcv in R, one must specify the distribution and link function. For example, the default is Gaussian and identity, respectively. Now, I understand the link function (e.g., g in the first equation here) as the model definition explicitly includes it. I can point to it and say, "That's the link function". The distribution I find harder to understand.
The equation defining a GAM here suggests to me that the distribution describes the distribution of errors. I've seen that equation elsewhere, too (e.g.). Nevertheless, a CV answer posted here takes exception to that, clearly stating
You don't specify the "error" distribution, you specify the conditional distribution of the response.
Needless to say, I'm confused. So, my questions.
- What does the distribution actually refer to? Errors or conditional distribution of the response? (If the latter, what does the conditional part refer to, specifically?)
- How is the distribution used by a GAM? Is it an assumption that is used when fitting the GAM to data?
Any insights would be greatly appreciated.
By error, I meant exactly as you say (i.e., difference between observed and true) whereas I thought of distribution of response as representing the values that a variable can take. For example, if I'm measuring the height of people, I wouldn't consider the distribution of heights as the error distribution. But I am not clear on the correct terminology. #ImNotAStatistician
– Dan Sep 25 '17 at 14:44