Should individual $R^2$ of a predictor always be greater than $\Delta R^2$ when removing that predictor from an expanded model?

Question

I'm running some regressions with a set of somewhat correlated predictors. Let's call these predictors $x$, $y$ and $z$, and my dependent variable $d$.

I'm focused on the effect of $x$ on $d$.

I first calculate $R^2_x$ by running a regression using only $x$ as a predictor.

I can also calculate $R^2_{full}$ by running a regression using $x$, $y$ and $z$ as predictors.

I calculate $R^2_{reduced}$ by running a regression using only $y$ and $z$ as predictors.

I calculate $\Delta R^2$ as $R^2_{full} - R^2_{reduced}$.

My question is, shouldn't $R^2_x$ always be greater than $\Delta R^2$? I would think that it would, since in the full model, the predictive power of $x$ has been reduced by the inclusion of the correlated predictors $y$ and $z$. However I am observing many instances of $\Delta R^2 > R^2_x$. Is this to be expected? Thanks!

I've given an answer for the simple 2 explanatory variables case here: https://stats.stackexchange.com/q/597361/341520 Ben has given a better answer but maybe mine is more approachable. — Lukas Lohse, Feb 10 '23 at 20:29

Ben · Answer 1 · 2023-02-11T03:24:35.970

Not always --- the contrary phenomenon is called "enhancement"

Firstly, you are right that there is a certain natural intuition to the supposition that the overall coefficient of determination for a set of explanatory variables should be no greater than the sum of its parts. However, it turns out that this is not true --- there are situations in regression analysis where the coefficient of determination of a joint model can be greater than the sums of the coefficients of determination for the component models. This occurs due to a regression phenomenon called "enhancement".

This phenomenon is discussed in Cuadras (1993) and it is analysed further within a broader geometric analysis of regression in O'Neill (2021) (see esp. pp. 14-16). Following the analysis in these papers, suppose you have a linear regression model with $m$ explanatory variables with coefficient of determination $R^2$ and suppose that $R_1^2,...,R_m^2$ are the corresponding coefficients of determination for the individual models. Furthermore, let $S_1,...,S_m$ denote the sample correlations between the principal components (of the explanatory vectors) and the response vector. Then the difference between the coefficient of determination for the joint model and the sums of the coefficients of determination for the component models can be written as:

$$\text{Difference} = R^2 - \sum R_k^2 = \sum_{k=1}^m (1-\lambda_k) S_k^2,$$

where $\lambda_1,...,\lambda_m$ are the eigenvalues of the $m \times m$ correlation matrix of the explanatory variables. Now, in the case where there is a small eigenvalue $0 < \lambda_k < 1$ coupled with a high absolute correlation $|S_k|$ between the principal components and response vector, you can see that the corresponding term in the sum can be positive (a phenomenon called "enhancement"). If this occurs with sufficient magnitude then it is possible to get a situation where $R^2 - \sum R_k^2 > 0$, contrary to what you might intuitively imagine.

This is an interesting and counter-intuitive phenomenon in regression and it is not very well-known. If you are observing many instances of it then you must be in a strange situation where you have many explanatory variables that are highly correlated with the response (relative to other explanatory variables) but which have low corresponding eigenvalues in the correlation matrix for the relevant explanatory variables. This is unusual, but it can occur.

I like this answer, but doesn't $S_k, \lambda_k$ correspond to the eigenvectors of the correlation matrix(principle components) instead of the explanatory variables themself. It's not obvious how an additional explanatory variable would impact those — Lukas Lohse, Feb 10 '23 at 20:47
The value $\lambda_k$ is the eigenvalue in the correlation matrix for the explanatory variables, which is what is used when forming the principal components. It's not clear to me if that is what you're suggesting. — Ben, Feb 10 '23 at 22:25
My point is that you saying " $S_1,...,S_m$ denote the sample correlations between the explanatory vectors and the response vector." is misleading since $S_k$ is defined by the eigenvector $v_k$, which in general isn't more related to $x_k$ then to any other $x_{k' \neq k}$ — Lukas Lohse, Feb 10 '23 at 22:54
This follows both from your 2021 paper and because otherwise pairing them up with the eigenvalues doesn't make sense. — Lukas Lohse, Feb 10 '23 at 22:56

Zhanxiong · Answer 2 · 2023-06-12T11:46:07.760

In this answer, I derive a closed-form expression of $R_x^2 - \Delta R^2$ to analyze its sign. Since calculations below heavily rely on geometry/linear algebra, let's first settle down some notations:

$x, y, z, d$ are all viewed as column vectors in $\mathbb{R}^n$. By convention, we use $e$ to denote the column vector consisting of all ones in $\mathbb{R}^n$ (i.e., the intercept).
For a vector $x_0 \in \mathbb{R}^n$, we use $\|x_0\|$ to denote its Euclidean norm.
For any two vectors $x_1, x_2 \in \mathbb{R}^n$, we use $x_1'x_2$ to denote their inner product.
The subspace of $\mathbb{R}^n$ spanned by vectors $x_1, x_2, \ldots, x_p$ is denoted by $[x_1, x_2, \ldots, x_p]$. The orthogonal complement space of a subspace $S$ is denoted by $S^\perp$.
We use $P_M x_0$ to denote the orthogonal projection of $x_0$ onto the subspace $M$. In particular, if $\{q_1, \ldots, q_m\}$ is an orthogonal basis of $M$, then \begin{align} P_Mx_0 = \frac{x_0'q_1}{q_1'q_1}q_1 + \cdots + \frac{x_0'q_m}{q_m'q_m}q_m. \end{align}

With these notations, various R-squares in the question can be expressed by definition as \begin{align} & R_{\text{full}}^2 = \frac{\|P_{[e, x, y, z]}d - P_{[e]}d\|^2}{\|P_{[e]^\perp}d\|^2}, \\ & R_{\text{reduced}}^2 = \frac{\|P_{[e, y, z]}d - P_{[e]}d\|^2}{\|P_{[e]^\perp}d\|^2}, \\ & R_{x}^2 = \frac{\|P_{[e, x]}d - P_{[e]}d\|^2}{\|P_{[e]^\perp}d\|^2}, \\ & \Delta R^2 = R_{\text{full}}^2 - R_{\text{reduced}}^2 = \frac{\|P_{[e, x, y, z]}d\|^2 - \|P_{[e, y, z]}d\|^2}{\|P_{[e]^\perp}d\|^2}. \end{align}

Therefore, requiring $R_x^2 \geq \Delta R^2$ is equivalent to requiring
\begin{align} \|P_{[e, x]}d - P_{[e]}d\|^2 \geq \|P_{[e, x, y, z]}d\|^2 - \|P_{[e, y, z]}d\|^2. \tag{1} \end{align}

Let $\{q_1, q_2, q_3\}$ and $\{q_1, q_2, q_3, q_4\}$ be an orthonormal basis of the space $[e, y, z]$ and an orthonormal basis of the space $[e, y, z, x]$ respectively, which can be obtained by the Gram-Schmidt orthogonalization procedure (see, e.g., Algorithm 3.1 in The Elements of Statistical Learning), or equivalently, performing QR decomposition of corresponding matrices. It can be shown that the right hand side of $(1)$ equals to $(d'q_4)^2$. More specifically, if letting \begin{align} & z_1 = e, \; q_1 = z_1/\|z_1\|, \\ & z_2 = y - \frac{y'z_1}{z_1'z_1}z_1, \; q_2 = z_2/\|z_2\|, \\ & z_3 = z - \frac{z'z_1}{z_1'z_1}z_1 - \frac{z'z_2}{z_2'z_2}z_2, \; q_3 = z_3/\|z_3\|, \\ & z_4 = x - \frac{x'z_1}{z_1'z_1}z_1 - \frac{x'z_2}{z_2'z_2}z_2 - \frac{x'z_3}{z_3'z_3}z_3, \; q_4 = z_4/\|z_4\|. \tag{G1} \end{align} then \begin{align} \|P_{[e, x, y, z]}d\|^2 - \|P_{[e, y, z]}d\|^2 = d'(q_1q_1' + \cdots + q_4q_4')d - d'(q_1q_1' + \cdots + q_3q_3')d = (d'q_4)^2. \end{align}

Treating the left hand side of $(1)$ similarly, it can be shown that \begin{align} \|P_{[e, x]}d - P_{[e]}d\|^2 = (d'\tilde{q}_2)^2, \end{align} where \begin{align} & v_1 = e, \; \tilde{q}_1 = v_1/\|v_1\|, \\ & v_2 = x - \frac{x'v_1}{v_1'v_1}v_1, \; \tilde{q}_2 = v_2/\|v_2\|. \tag{G2} \end{align}

Therefore, \begin{align} \|P_{[e, x]}d - P_{[e]}d\|^2 - (\|P_{[e, x, y, z]}d\|^2 - \|P_{[e, y, z]}d\|^2) = (d'\tilde{q}_2)^2 - (d'q_4)^2. \tag{2} \end{align}

In view of $(2)$, the sign of $R_x^2 - \Delta R^2$ depends on the lengths of the projections of $d$ onto the vector $\tilde{q}_2$ and the vector $q_4$ respectively, hence is in general indefinite. In particular, if $d$ is perpendicular to $\tilde{q}_2$ but is correlated with $q_4$, then $R_x^2 - \Delta R^2 = 0 - (d'q_4)^2 < 0$. One concrete example can be constructed as follows: let $y, z$ be arbitrary $n$-vectors such that $e, y, z$ are linearly independent. Let $z_1, z_2, z_3$ then be calculated by (G1) and define $x = z_1 + z_2 + z_3 + \epsilon$ and $d = -z_2 - z_3 + \epsilon$, where $\epsilon \perp [z_1, z_2, z_3]$ and $\|\epsilon\|^2 = \|z_2\|^2 + \|z_3\|^2$. By (G1) and (G2), it follows that \begin{align} & \tilde{q}_2 = \frac{1}{\|z_2 + z_3 + \epsilon\|}(z_2 + z_3 + \epsilon), \\ & q_4 = \frac{1}{\|\epsilon\|}\epsilon. \end{align} Hence \begin{align} d'\tilde{q}_2 = 0 < d'q_4 = \|\epsilon\|. \end{align}

I mean the last one, yes. Shouldn’t equality be allowed when the features are independent? — Dave, Feb 09 '23 at 06:47
It is $<$ because $|z_2| > 0$ and $\epsilon$ is selected to be such that $2|\epsilon|^2 > |z_2|^2$. The point of the last inequality is to construct a specific coumterexample based on $(2)$ such that $R_x^2 < \Delta R^2$. — Zhanxiong, Feb 09 '23 at 06:50

score 1 · Answer 3 · edited Feb 09 '23 at 08:40

The short answer is that $x$ and $y$ and $z$ can be correlated with each other in ways that are not correlated with $d$. The delta can be greater than the $R^2(x)$ because adding $X$ to the model can help clean up noise in y and $z,$ helping the model make sense of more of $y$ and $z,$ ultimately increasing the $R^2(\textrm{full}).$ When this concept is taught in classes, they often show neat overlapping circles, but the reality is that the circles are more like amorphous blobs that overlap in weird ways both inside and outside the variance circle for $d$.

Should individual $R^2$ of a predictor always be greater than $\Delta R^2$ when removing that predictor from an expanded model?

3 Answers3

Not always --- the contrary phenomenon is called "enhancement"

Linked