6

So this has really been bothering me and I was hoping for a (simple!) explanation if possible.

Suppose I've specified a linear regression model: $$ Y = \beta_0 + \beta_1 X + \epsilon $$ And an alternative: $$ Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon $$ And I'm trying to estimate the $\beta$s, say perhaps through OLS (the exact method I don't think is relevant).

My question is: what is the exact interpretation of the $\beta$s I am trying to estimate?

The confusion arises from the fact that the population values of $\beta_1$ in either specification are presumably different, and this doesn't couch with my understanding of the population coefficients.

I had always interpreted the $\beta$s as the partial derivative of $X$ on $Y$ 'in reality'. That is, if you were to change X holding other regressors constant the change in the expected value of Y. By providing a better and better specified model, one ensured that the estimate of $\beta_1$ became more accurate (by separating out correlated variables in the error term).

This was important to my understanding; $\beta_1$ was not contingent on the specification of my model -it remained an invariant feature of the population - but rather the estimator we had for $\beta_1$ (b1) changed and became more or less accurate depending on the model.

All well and good, but this interpretation doesn't quite work in the example above. Suppose that the relationship between $X$ and $Y$ is curvilinear. If you were restricted to only include $X$ and not any higher order polynomials, then presumably the $\beta_1$ that would best describe the change in $E[Y]$ given a change in $X$ would be different than if you were to allow for higher order polynomials (in specification 2).

So say, for arguments sake, the DGP was $$ E[Y] = 1 + 10 X - 2 X^2 $$ where $0<X<2$ to ensure the polynomial doesn't influence too heavily. In this case should the true value of $X$ in specification 1 be 10? Or, to fit it to that DGP when $X^2$ is not specified should it be ~6?

It seems if it is the latter my understanding that the population coefficients do not depend on the specification go up in smoke! Please help!

  • 1
    You might want to investigate the use of orthogonal polynomials. – Glen_b Aug 21 '14 at 07:54
  • I don't think the rest of the message is showing:So say for arguments sake the DGP was y = 1 + 10X - 2.X^2 (0<X<2 to ensure the polynomial doesn't influence too heavily). In this case should the true value of X in specification 1 be 10? Or, to fit it to that DGP when X^2 is not specified should it be ~6? It seems if it is the latter my understanding that the population coefficients do not depend on the specification go up in smoke! Please help! – Sue Doh Nimh Aug 22 '14 at 13:50
  • I've edited to make the rest of the question show up. – conjugateprior Aug 22 '14 at 15:06

2 Answers2

5

The problem is with this:

I had always interpreted the betas as the partial derivative of X on Y 'in reality'

That's not always true in a model with interactions or various other forms of complexity.

Take a simpler example. Assume your model is $$ E[Y] = \beta_0 + \beta_1 X + \beta_2 Z + \beta_{12} XZ $$ Here the partial derivative of $E[Y]$ with respect to $X$ is $\beta_1 + \beta_3 Z$. Put another way, $\beta_1$ is only the partial derivative of $Y$ with respect to $X$ when $Z = 0$. Your model is a special case of this one.

The population marginal effect of X (the partial derivative you're talking about) is indeed one of the things you're interested in modeling with this regression. But think of it as just a happy coincidence when this quantity corresponds to particular model parameter. Generally speaking, it won't.

  • In that case would I be correct in saying it is the parameters rather than the model specification that are prior? (I.e. The parameters don't depend on the specification for their actual value?) – Sue Doh Nimh Aug 22 '14 at 13:42
  • More pragmatically, what is the correct value of beta_1 in the example given? Is it 10 (as per the DGP) or ~6? (In case the rest of post is not viewable I'm referring to this):

    So say for arguments sake the DGP was y = 1 + 10X - 2.X^2 (0<X<2 to ensure the polynomial doesn't influence too heavily). In this case should the true value of X in specification 1 be 10? Or, to fit it to that DGP when X^2 is not specified should it be ~6? It seems if it is the latter my understanding that the population coefficients do not depend on the specification go up in smoke! Please help!

    – Sue Doh Nimh Aug 22 '14 at 13:52
  • I'm not sure I understand the question, but perhaps this helps: $\beta_1$ in eqn 1 is the marginal effect on Y of a unit increase in X. $\beta_1$ in eqn 2 isn't. These identically named parameters are different things. Now, when the population is eqn 2 and the model is eqn 1, OLS will try to find the $\beta_1$ that gives the least squares best fit to the true marginal effect, but will not ever quite succeed. When the population is eqn 1 and the model is eqn 2, $\beta_1$ will eventually be estimated correctly and $\beta_2$ will be estimated to be zero. – conjugateprior Aug 22 '14 at 15:28
  • Notice also the following (unproblematic but nonintuitive) ambiguity with parameter identification. Assume $E[Y]$ is, in the population, quadratically related to $X$. There are a lot of ways to express that. Here are 3: a) eqn 2, or b) an orthogonal polynomial over $X$ (a transformation of $X$ that is not just $[X, X^2]$) or c) with a spline. The population may generate $X$ and $Y$ in one of these ways, but a model fitted any of these ways will end up getting the right partial derivative / marginal effect relationship because observation pairs cannot distinguish between them. – conjugateprior Aug 22 '14 at 15:34
  • I think I'm starting to see - I think the main thing I'm trying to get is how model specification alters what we are the real parameters we are 'looking for'. I.e. So we have Reality -> Model -> Data, and the question is where the parameters fall into place. So, for example, is this accurate: If I run a regression specifying linear dependents, then the real parameters I'm trying to find (via whatever preferred estimation method) is the average marginal effect of a linear increase in X? Whereas if I specify X, X^2, I'm trying to find the average marginal effect of of a non-linear increase in X? – Sue Doh Nimh Aug 23 '14 at 16:23
  • (So, broadly, are the values of the parameters set at the stage of Reality, or the Model? And we use Data to approximate to Reality, or the Mode?) – Sue Doh Nimh Aug 23 '14 at 16:28
  • On your first comment: you have to decide what you want your regression to do. Maybe you want a model of P(Y | X). Maybe you just want a model of E[Y] for various Xs. Maybe $\beta_1$ refers to some physical quantity you want to estimate using Y and X. Some of these things assume that there are $\beta$s in a population. Some don't. – conjugateprior Aug 23 '14 at 17:10
  • On your second comment: I try never to make pronouncements about Reality with a capital R. Least of all in stackexchange comments. – conjugateprior Aug 23 '14 at 17:12
4

Your understanding is correct--provided we look at the model in the right way.

Because the question concerns interpreting a predictive model, we may focus on its predictions and ignore the error term. The example is sufficiently general that we might as well address it directly, so consider a model of the form

$$Y = \beta_0 + \beta_1 X + \beta_2 X^2.$$

This can be viewed as the composition of two functions, $Y = g(f(X)),$ where

$$f:\mathbb{R}\to \mathbb{R}^3,\quad f(x) = (1, x, x^2)$$

and

$$g:\mathbb{R}^3\to \mathbb{R},\quad g((x,y,z)) = \beta_0 x + \beta_1 y + \beta_2 z = (\beta_0,\beta_1,\beta_2)(x,y,z)^\prime.$$

Figure

This figure (which suppresses the unvarying first coordinate) depicts the graph of $1 + 10y - 2z$ as a blue planar surface, shows hypothetical data as red points, and plots the graph of $x\to (x, x^2)$ as a black curve. The points all lie along this curve and the planar surface, which is fit to the points, contains the curve. The following discussion distinguishes between moving about in the plane (which is described by the partial derivatives of $g$) and motion constrained to the curve (which is described by the partial derivatives of the composite function $g\circ f$.)

It is indeed the case that the betas are the partial derivatives of $g$ with respect to its arguments:

$$\beta_0 = \frac{\partial g}{\partial x},\ \beta_1 = \frac{\partial g}{\partial y},\ \beta_2 = \frac{\partial g}{\partial z},$$

all of which are constant (because $g$ is a linear transformation). In this sense, it is indeed correct to understand the betas as partial derivatives.

However, the partial derivatives of $Y$ with respect to $X$ are obtained via the Chain Rule from those of $g$ and those of $f$:

$$\frac{\partial Y}{\partial X}(X) = Dg(f(X)) Df = (\beta_0, \beta_1, \beta_2) (0,1,2X)^\prime = \beta_1 + 2\beta_2 X.$$

The function $f$ captures the fact that the three variables in the model--the constant, $X$, and $X^2$--are not functionally independent: the third is determined by the second. This lack of independence means that $X$ and $X^2$ cannot be changed separately, the way unrelated variables $X$ and $Z$ could be changed in a model of the form $Y = \beta_0 + \beta_1 X + \beta_2 Z$. In general, this is exactly what it means for any model to be "curvilinear."

In practice, $f$ is realized by the dataset itself: a separate column of values $X^2$ has to be created (either explicitly by the user or internally in response to a nonlinear model formula) out of other data columns, in this case that of $X$. The function $g$--specifically, its coefficients $(\beta_0,\beta_1,\beta_2)$--is what least squares regression estimates. By separating the nonlinear behavior ($f$) from the linear behavior ($g$) in this fashion, least squares techniques can fit nonlinear functional forms.

Only by considering these two aspects of the model--$f$ and $g$--can the coefficients be properly and fully interpreted.

whuber
  • 322,774