Why use linear regression instead of average y per x

Question

Concretely, if we're interested in predicting house price (dollars) from house size (square meters), we can calculate the best fitting line and use that for predicting new values.

But why not simply calculate the average price per square meter and use that for our predictions? We can use that to plot a line as well, can't we?

I'm just a bit confused over the difference between the two.

Hint: Exactly what would be the equation of the line based only on average price per square meter? — whuber, Mar 23 '17 at 13:58
You got it. Just note they won't be exactly the same because you have an intercept too in the case of a linear model. — usεr11852, Mar 23 '17 at 15:59
Wow. So why bother going through the linear regression formulas if you can just divide the mean of y with the mean of x? — , Mar 23 '17 at 16:41
@trevorDashDash One reason would seem to be that it doesn't always make sense to assume that the intercept is $0$. (Think of one-time fees as an example.) — Chill2Macht, Mar 23 '17 at 20:21

Philippe · Accepted Answer · 2017-03-23T22:11:41.743

It comes down to how you would judge the quality of your model. The general approach most would agree on is that a good prediction model minimizes the unexplained portion, or errors (predicted - observed value). You could define a model that minimizes the errors overall. Or you could define a model that minimized the sum of squared errors ($\hat{\epsilon}$) overall $\sum_{i=1}^N\hat{\epsilon}_i^2 \rightarrow Minimum$. This last version is the least squares method and, if all assumptions are met, it will come up with the best linear unbiased estimator (instead of e.g. your means ratio).

Basically, taking the average of the house price by square meters will not minimize your prediction error as it can not accommodate large departures from your average house price per square meter. Only least squares, i.e. minimal sum of all squared deviations of your predicted value minus the observed values, comes up with a line that fits your data cloud best.

For a minimal example in R consider this:

hp = c(500, 750, 800, 900, 1000, 1000, 1100)
sm = c(100, 120, 130, 130, 150, 160, 165)

with house prices (hp) and square meters (sm).

When plotting, you obtain a figure where increasing sm goes hand-in-hand with increasing hp

Now, you could do what you suggested:

apsm = mean(hp/sm)

That is, you divide hp by its sm and take the average to obtain the average per squre meters (apsm).

To predict a the house price you could obtain a vector of predicted values pred ($\hat{hp}$)

pred = apsm*sm

Your predicted line now looks like this:

The problem with this line is that it is not the line that minimizes the error (hp-pred = error). Or to be more precise, it does not minimize the sum of all your squared errors.

If you were to run a linear model with eg.

lm(hp ~ sm)

your fitted line (red) would be different and it would be more efficient and unbiased:

Can you give an intuitive reason for which the line of best fit minimises the error, and not the average? Isn't the average supposed to be the 'midvalue'? I have heard of the Gauss -Markov Theorem, but is there an intuitive explanation for this? Thanks. — Ambica Govind, Jul 02 '22 at 10:28

Glen_b · Answer 2 · 2017-03-23T20:46:25.180

There are two issues; the first has to do with a potential intercept, and the second has to do with the variability about the mean.

If the model should go through the origin (in effect, if there are no fixed costs and the true model is really linear (perfectly proportional to area) across the whole range, then it may make sense to force the fit through the origin.

But if there are costs that affect the price are not proportional to the area then you will probably need an intercept.

In the case that you choose to model the relationship as a line through the origin, it might make sense to consider the mean of the ratios ($r_i= y_i/x_i$) -- it depends on whether the spread of prices about the line is proportional to the size of house (equivalently, proportional to the mean price) :

If this is the case then taking logs of both variables should leave you with a constant spread about a line with slope 1:

-- if this is the case, then the average ratio might make some sense (though there are other ways to estimate the slope - such as the geometric mean of those ratios for example - that might sometimes be better choices). If the spread is not proportional to area (/proportional to expected price), then it's not the best way to estimate the coefficient and some form of (possibly weighted) regression through the origin might be better

Why use linear regression instead of average y per x

2 Answers2