3

This question arose from this question.

Does anyone have some worked examples of an OLS question where the observations are not linear? e.g. $y_i = \alpha + \sin (x_i) + \epsilon_i$

I tried to find to find the least estimators for the coefficient of $x$ by differentiating $(y_i - \alpha - \sin(x))^2$ with respect to $x$ (I'm not even sure if that is the correct thing to do) and end up with some complicated expression involving $\sin$ and $\cos$. So, I thought I would ask how do so such questions before proceeding further.

I am interested in an answer for this because, I want to see that $E(y_i - \hat{y}_i)$ is not always equal to 0 when $y_i$ is not a linear function of $x_i$. Also, is there a more general way to show that this statement is true i.e. without actually finding the OLS estimators?

EDIT 1:

Forgot to include an error term in the example.

EDIT 2:

The model I am trying to fit is $y = \sin(x) + \epsilon_i$.

Given a certain data set, according to R I should get $\hat{y} = 0.60330 + 0.01797x$:

set.seed(1234)

 n <- 5
 df <- data.frame(x=runif(n, 1, 10))
 df$mean.y.given.x <- sin(df$x)
 df$y <- df$mean.y.given.x + rnorm(n)
 model <- lm(y ~ x, data=df)
 summary(model)

 Call:
 lm(formula = y ~ x, data = df)

 Residuals:
     1       2       3       4       5 
 0.6190 -1.1402 -0.4852 -0.2877  1.2941 

 Coefficients:
              Estimate Std. Error t value Pr(>|t|)
 (Intercept)  0.60330    1.45534   0.415    0.706
 x            0.01797    0.22460   0.080    0.941

 Residual standard error: 1.107 on 3 degrees of freedom
 Multiple R-squared:  0.00213,  Adjusted R-squared:  -0.3305 
 F-statistic: 0.006404 on 1 and 3 DF,  p-value: 0.9413

I would like to obtain $\hat{y}$ by hand now. However, all the questions I have done thus far is always of the form $y_i = \alpha + \beta x$, where $\alpha$ and $\beta$ are constants, so that $y_i$ is a linear function of $x_i$. Therefore, I am unsure of how to proceed when $y_i$ is not a linear function of $x$, hence my request for a worked examples for these kind of questions.

After that, I expect to be able to show that $E(y_i-\hat{y_i}) \neq 0$ because the plot of the residuals against the predicted value (for large $n$) is:

set.seed(1234)
n <- 1000
df <- data.frame(x=runif(n, 1, 10))
df$mean.y.given.x <- sin(df$x)
df$y <- df$mean.y.given.x + rnorm(n)
model <- lm(y ~ x, data=df)
plot(predict(model, newdata=df), residuals(model))
abline(a=0,b=0,col='blue')

enter image description here Based on the above plot, $E(y_i-\hat{y_i}) \neq 0$ must be true in this case, right?

mauna
  • 576
  • 2
    Could it be that you actually want to fit the model $y = \alpha + \beta sin(x) + \epsilon$? If so, note that the model is still linear in parameters, i.e. there is only $\alpha$ and $\beta$, and for instance no weird function of $\beta$. The non-linearity in $x$ is not really a problem. Also note that strictly speaking, the model is linear in $sin(x)$. – coffeinjunky Jun 01 '14 at 21:07

1 Answers1

8

You don't really believe that your data ($y_i$) are exactly equal $\alpha+\sin(x_i)$ do you? Your equation is missing an important term (the error), and how you write it matters as to whether you can do least squares or not.

If your model is $y_i = \alpha + \sin(x_i)+\epsilon_i$, then it would certainly make sense to estimate $\alpha$ by ordinary least squares. The important thing for least squares is that the model is linear in the parameters, which it is.

Let $y^*_i= y_i-\sin(x_i)$, and your model becomes $y^*_i= \alpha+\epsilon_i$, and the LS estimate,

$$\hat{\alpha}= \bar{y^*} = \frac{1}{n}\sum_{i=1}^n (y_i-\sin(x_i))$$

and so

$$\hat{y_i}=\hat{\alpha}+\sin(x_i)$$

That $E(y_i-\hat{y}_i) =0$ is pretty straightforward from there.

Glen_b
  • 282,281
  • it seems that you are regressing $y$ against $\sin x$. Could you please show how do I go about regressing $y$ against $x$ too? – mauna May 31 '14 at 14:39
  • I am not regression $y$ against $\sin(x)$. In your model, the term $\sin(x)$ has coefficient 1, so it's an offset, not a regressor. Can you clarify what model you want to fit and what difficulty you have? – Glen_b Jun 01 '14 at 00:05
  • I've made an edit to my question to clarify the difficulty I am facing. – mauna Jun 01 '14 at 20:46
  • The model that you fitted in R isn't the model that you specified in LaTeX – Glen_b Jun 01 '14 at 23:56
  • if that's the case, then does this mean that the example provided in this answer: http://stats.stackexchange.com/questions/100597/help-clarify-the-implication-of-linearity-in-an-ordinary-least-squares-ols-reg/100605#100605 is incorrect too (the example on where the linearity assumption is violated)? – mauna Jun 02 '14 at 07:18
  • No, that's okay, if less clear than it could be. The example there is making a point about fitting the wrong model, so the example is correctly illustrating fitting the wrong model. The fitted model is linear-in-x, and the assumption that the actual relationship is linear-in-x is violated. Linearity in parameters is equivalent to linearity in the predictor columns included in the model. Consider instead $y=\alpha+\beta \sin(x)+\epsilon$; it's linear in $\sin(x)$ but not linear in $x$. If you regress on $x$, the assumption that the model is of the form $y=\alpha+\beta x+\epsilon$ is violated... – Glen_b Jun 02 '14 at 09:05
  • (ctd) ... and you'll see that in a residual plot. – Glen_b Jun 02 '14 at 09:08