How to do Ordinary Least Squares (OLS) when the observations are not linear?

Question

This question arose from this question.

Does anyone have some worked examples of an OLS question where the observations are not linear? e.g. $y_i = \alpha + \sin (x_i) + \epsilon_i$

I tried to find to find the least estimators for the coefficient of $x$ by differentiating $(y_i - \alpha - \sin(x))^2$ with respect to $x$ (I'm not even sure if that is the correct thing to do) and end up with some complicated expression involving $\sin$ and $\cos$. So, I thought I would ask how do so such questions before proceeding further.

I am interested in an answer for this because, I want to see that $E(y_i - \hat{y}_i)$ is not always equal to 0 when $y_i$ is not a linear function of $x_i$. Also, is there a more general way to show that this statement is true i.e. without actually finding the OLS estimators?

EDIT 1:

Forgot to include an error term in the example.

EDIT 2:

The model I am trying to fit is $y = \sin(x) + \epsilon_i$.

Given a certain data set, according to R I should get $\hat{y} = 0.60330 + 0.01797x$:

set.seed(1234)

 n <- 5
 df <- data.frame(x=runif(n, 1, 10))
 df$mean.y.given.x <- sin(df$x)
 df$y <- df$mean.y.given.x + rnorm(n)
 model <- lm(y ~ x, data=df)
 summary(model)

 Call:
 lm(formula = y ~ x, data = df)

 Residuals:
     1       2       3       4       5 
 0.6190 -1.1402 -0.4852 -0.2877  1.2941 

 Coefficients:
              Estimate Std. Error t value Pr(>|t|)
 (Intercept)  0.60330    1.45534   0.415    0.706
 x            0.01797    0.22460   0.080    0.941

 Residual standard error: 1.107 on 3 degrees of freedom
 Multiple R-squared:  0.00213,  Adjusted R-squared:  -0.3305 
 F-statistic: 0.006404 on 1 and 3 DF,  p-value: 0.9413

I would like to obtain $\hat{y}$ by hand now. However, all the questions I have done thus far is always of the form $y_i = \alpha + \beta x$, where $\alpha$ and $\beta$ are constants, so that $y_i$ is a linear function of $x_i$. Therefore, I am unsure of how to proceed when $y_i$ is not a linear function of $x$, hence my request for a worked examples for these kind of questions.

After that, I expect to be able to show that $E(y_i-\hat{y_i}) \neq 0$ because the plot of the residuals against the predicted value (for large $n$) is:

set.seed(1234)
n <- 1000
df <- data.frame(x=runif(n, 1, 10))
df$mean.y.given.x <- sin(df$x)
df$y <- df$mean.y.given.x + rnorm(n)
model <- lm(y ~ x, data=df)
plot(predict(model, newdata=df), residuals(model))
abline(a=0,b=0,col='blue')

enter image description here Based on the above plot, $E(y_i-\hat{y_i}) \neq 0$ must be true in this case, right?

Could it be that you actually want to fit the model $y = \alpha + \beta sin(x) + \epsilon$? If so, note that the model is still linear in parameters, i.e. there is only $\alpha$ and $\beta$, and for instance no weird function of $\beta$. The non-linearity in $x$ is not really a problem. Also note that strictly speaking, the model is linear in $sin(x)$. — coffeinjunky, Jun 01 '14 at 21:07

Glen_b · Accepted Answer · 2015-06-29T09:49:20.063

8

You don't really believe that your data ($y_i$) are exactly equal $\alpha+\sin(x_i)$ do you? Your equation is missing an important term (the error), and how you write it matters as to whether you can do least squares or not.

If your model is $y_i = \alpha + \sin(x_i)+\epsilon_i$, then it would certainly make sense to estimate $\alpha$ by ordinary least squares. The important thing for least squares is that the model is linear in the parameters, which it is.

Let $y^*_i= y_i-\sin(x_i)$, and your model becomes $y^*_i= \alpha+\epsilon_i$, and the LS estimate,

$$\hat{\alpha}= \bar{y^*} = \frac{1}{n}\sum_{i=1}^n (y_i-\sin(x_i))$$

and so

$$\hat{y_i}=\hat{\alpha}+\sin(x_i)$$

That $E(y_i-\hat{y}_i) =0$ is pretty straightforward from there.

edited Jun 29 '15 at 09:49

answered May 30 '14 at 23:09

Glen_b

282,281

it seems that you are regressing $y$ against $\sin x$. Could you please show how do I go about regressing $y$ against $x$ too? – mauna May 31 '14 at 14:39
I am not regression $y$ against $\sin(x)$. In your model, the term $\sin(x)$ has coefficient 1, so it's an offset, not a regressor. Can you clarify what model you want to fit and what difficulty you have? – Glen_b Jun 01 '14 at 00:05
I've made an edit to my question to clarify the difficulty I am facing. – mauna Jun 01 '14 at 20:46
The model that you fitted in R isn't the model that you specified in LaTeX – Glen_b Jun 01 '14 at 23:56
if that's the case, then does this mean that the example provided in this answer: http://stats.stackexchange.com/questions/100597/help-clarify-the-implication-of-linearity-in-an-ordinary-least-squares-ols-reg/100605#100605 is incorrect too (the example on where the linearity assumption is violated)? – mauna Jun 02 '14 at 07:18
No, that's okay, if less clear than it could be. The example there is making a point about fitting the wrong model, so the example is correctly illustrating fitting the wrong model. The fitted model is linear-in-x, and the assumption that the actual relationship is linear-in-x is violated. Linearity in parameters is equivalent to linearity in the predictor columns included in the model. Consider instead $y=\alpha+\beta \sin(x)+\epsilon$; it's linear in $\sin(x)$ but not linear in $x$. If you regress on $x$, the assumption that the model is of the form $y=\alpha+\beta x+\epsilon$ is violated... – Glen_b Jun 02 '14 at 09:05
(ctd) ... and you'll see that in a residual plot. – Glen_b Jun 02 '14 at 09:08

How to do Ordinary Least Squares (OLS) when the observations are not linear?

1 Answers1

Linked