How Do Researchers Measure "Non-Linear Effects" in Statistics?

Question

I am an MBA Student taking courses in Statistics.

In our courses, we are learning about Regression Models and how to interpret the meaning of the parameters/coefficients in a Regression Model.

For example, when studying "vanilla" Regression Models (i.e. Simple Linear Regression/ Multiple Linear Regression), we are told that the regression coefficients represent a "unit change effect in the response variable". For example, if you build a Regression Model where the independent variables are "age and weight", and the dependent variable is "salary" - the regression coefficients (provided they are "Statistically Significant") could answer questions such as "the effect of every additional kilogram on salary". It would appear that these effects are always "linear" - the effect of weight on salary keeps getting bigger as the weight of an individual increases.

Given this, I had the following question. Suppose in some imaginary universe, very skinny people and very fat people earn a lot of money - but people with average weights do not earn a lot of money. Conceptually, this would appear to be a "non-linear" effect (e.g. imagine a graph between weight and salary - it would appear like a "U shape"). If I were to fit a Regression Model to data from this universe, the coefficients of this Regression Model would tell me that on average: as the weight of an individual increases by one unit, the average effect on the salary of an individual is 0.83 (for example).

But this is not the case - we know that in this example, there is a non-linear relationship between the dependent variable and the independent variable. But it seems to me that the Regression Model would still "insist" that there is a linear relationship.

Are there any types of Regression Models that can address this issue? For example, could something like a Polynomial Regression (e.g. https://en.wikipedia.org/wiki/Polynomial_regression) address this issue? What kind of Regression Models could help me specifically recover the non-linear effect of an independent variable (e.g. weight) on the dependent variable (e.g. salary)? It seems to me that a Polynomial Regression Model might fit such a dataset better, but I am not sure if it would address my question on the interpretation of regression coefficients. Is there some type of Regression Model in which the coefficient for weight would tell me that:

between 0 to 100 lbs, the effect of weight on salary is 0.91
between 101 to 200 lbs, the effect of weight on salary is 0.34
between 201 lbs to 250 lbs, the effect of weight on salary is 0.86

I think I could split the data into 3 groups of people based on arbitrary weight ranges and fit a Regression Model to each one of these groups - and "hope" for linear relationship effects between weight and salary within each group. But this would result in "discrete and chunky" (how many groups should I make : 3? 4? 5? 6???) interpretation whereas I was hoping for a "continuous and smooth relationship" interpretation (e.g. a mathematical function that shows a smooth continuous relationship between a unit increase in the independent variable on the dependent variable ) . I was wondering - is this possible to do within a single model?

Do you expect your stated function at the end of your question to be continuous, but have a 'bend' at/just after 100 and 200 pounds? — Glen_b, Oct 14 '22 at 02:37
I would use GAMs and see the shape of the associated $\beta(x_i)$. — usεr11852, Oct 14 '22 at 02:41
You might benefit from reading Do statisticians assume one can't over-water a plant, or am I just using the wrong search terms for curvilinear regression? — kjetil b halvorsen, Oct 14 '22 at 12:03

Dave · Answer 1 · 2022-10-14T12:40:34.090

Are there any types of Regression Models that can address this issue?

Absolutely! This is possible without even venturing outside of linear regression (remember that linear regression refers to linearity in the parameters).

A relationship like you describe kind of looks like a cubic function. Consequently, you might want to fit a regression model like $y_i=\beta_0+\beta_1x_i+\beta_2x_i^2+\beta_3x_i^3+\epsilon_i$. A more sophisticated approach would use orthogonal polynomials (can be more numerically stable). An arguably more sohpisticated approach still would model this relatioship by using splines that discover the relationship, rather than forcing the relationship to be cubic.

The interpretation is harder, however. In a linear regression with just linear variables, such as $\hat y_i=\hat\beta_0+\hat\beta_1x_i$, the change in $y$ for a one-unit change in $x$ is just the derivative, so $\dfrac{d\hat y_i}{dx} = \hat\beta_1$.

When you apply the same same logic to a model with nonlinear features, the interpretation gets murkier: for $\hat y_i=\hat\beta_0+\hat\beta_1x_i+\hat\beta_2x_i^2$, $\dfrac{d\hat y_i}{dx} = \hat\beta_1+2\hat\beta_2x_i$. That is, the derivative depends on the value from which you begin. The change in $\hat y_i$ is not the same moving from $x=0$ to $x=1$ as it is moving from $x=1$ to $x=2$.

Reality sometimes works this way, and a concern of mine is that people force their data into models with linear features because they want the easy interpretation, ignoring the fact that the phenomenon under study might not work that way. For instance, if you were studying a phenomenon that behaved as you described in the question and forced it into a model with only linear features, it would miss the dynamics.

Regarding cubics and splines not being lines, the “MathematicalMonk” YouTube channel (by Jeffrey Miller) has a great video about why linear regression is more than just lines and planes.

score 2 · Answer 2 · answered Oct 14 '22 at 20:01

Dave's answer explains well your final question. I wanted to write a few words on your first bold sentence

But it seems to me that the Regression Model would still "insist" that there is a linear relationship.

This is the not a good way to think about what a Regression Model actually does. It is rather the other way around. You tell the model that there is a linear relationship and then the computation just takes that as a given and tries to find the best possible linear fit. Even if the underlying relationship is very non-linear there is still some linear relationship that is the least bad model for your data and that is what you will find. It will be a very bad approximation of your data but that is what you gave the model as input.

You first need to decide what kind of relationship you expect and then fit a suitable regression model, maybe a linear one or maybe a polynomial or whateverelse else you think best fits the relationship you are trying to understand. Then do the computation to get the coefficients. Then use plots or measures like R^2 to find out whether your initial expection was any good.

How Do Researchers Measure "Non-Linear Effects" in Statistics?

2 Answers2