10

I am doing a simple Linear Regression (with intercept) which ends up presenting a negative R2, this should not be possible (cf comment 2 at the end)


Reproducible examples of the issue:

Minimal sklearn reproducible code:

import numpy as np; print(np.__version__) # 1.23.5
import scipy; print(scipy.__version__) # 1.10.0
import sklearn as sk; print(sk.__version__) # 1.2.1

from sklearn.linear_model import LinearRegression import pandas as pd

np.random.seed(8) s = pd.Series(np.random.normal(10, 1, size=1_000))

l_com = np.arange(100) df_Xy = pd.concat([s.ewm(com=com).mean() for com in l_com], axis=1) df_Xy['y'] = s.shift(-1) df_Xy.dropna(inplace=True)

X = df_Xy[l_com] y_true = df_Xy.y

model = LinearRegression(fit_intercept=True) # fit_intercept=True by default anyways model.fit(X, y_true) print(model.score(X, y_true))

-0.15802176533843926 = NEGATIVE R2 on VM 1

-0.05854780689129546 on VM 2 (? dependent on CPU ?)


Minimal scipy reproducible code:

import numpy as np; print(np.__version__) # 1.23.5
import scipy; print(scipy.__version__) # 1.10.0
import pandas as pd

Parameters:

(seed, N_obs, N_feat, mu_x, sigma_x, sigma_y) = (0, 100, 1000, 100, 10, 1)

Building very weird X,y arrays (High Colinearity)

np.random.seed(seed) s = pd.Series(np.random.normal(mu_x, sigma_x, N_obs)) X_raw = np.ascontiguousarray(np.stack([s.ewm(com=com).mean() for com in np.arange(N_feat)]).T) y_raw = np.random.normal(0, sigma_y, N_obs)

Center both arrays to zero

X_offset = X_raw.mean(axis=0) y_offset = y_raw.mean() X = X_raw - X_offset y = y_raw - y_offset

OLS: Finding parameters that minimise Square Residuals:

p, ,,_ = scipy.linalg.lstsq(X, y) # <-- This is silently Failing! (resulting parameters are worst than the zero vector) pred = np.matmul(X, p) RSS = np.sum(np.power(y - pred, 2)) # 108.3406316733817 TSS = np.sum(np.power(y - np.mean(y), 2)) # 107.05357955882408


  • Comment 1: Yes, X matrix is computed in a very specific way (Exponential Moving averages of the target). It seems that the problem arises particularly well in this case. I'm currently trying to find an example without this "complexity".
  • Comment 2: If you are a beginner/intermediate Data Scientist, please refrain from commenting something like "R2 can sometimes be negative": we are in the case of simple OLS with intercept. The Sum of Squares should be minimised, by definition.
Richard Hardy
  • 67,272
Jean Lescut
  • 346
  • 1
  • 9
  • I cannot reproduce this, I get an $R^2$ of 10.8%. – usεr11852 Jun 01 '23 at 21:14
  • 3
    Correction: Actually I can reproduce this with sklearn version 1.2.1 but not with 1.2.2. – usεr11852 Jun 01 '23 at 21:21
  • 3
    @usεr11852 And I get $5.8%$ running it in Jupyter online! – Dave Jun 01 '23 at 21:23
  • 1
    @Dave: What version ofsklearn is it using in your case? It seems to me like sklearn just has a horrible time optimising this. :) – usεr11852 Jun 01 '23 at 21:25
  • @usεr11852 I can’t tell on mobile, but it’s the Python tool here. – Dave Jun 01 '23 at 21:27
  • 4
    my Scientific Wild Butt Guess (SWAG): collinearity due to the X variables all being very similar leads to numerical issues. – John Madden Jun 01 '23 at 21:29
  • 1
    You monster, I never reply on mobile. (Otherwise, my life would be fully gone.) It has 1.2.2 but a different (newer) numpy than mine (1.24.2 vs 1.22.4). Yeah @JohnMadden is probably right, we are just looking at different ways scipy.linalg.lstsq can fail. – usεr11852 Jun 01 '23 at 21:32
  • @usεr11852 I also got 10.8% but can reproduce it with seed value 10 – Sextus Empiricus Jun 01 '23 at 21:33
  • @SextusEmpiricus: I like your versioning. (As it matches mine. :D ) – usεr11852 Jun 01 '23 at 21:35
  • Amazingly, if I change my EC2 VM instance type (different CPUs, exact same version of all packages), I get different results. I use sklearn version 1.2.1 – Jean Lescut Jun 01 '23 at 21:41
  • @JohnMadden It was my guess too. Considering how common EWMA are in time-series forecasting, that is amazing that this problem is not better documented :( – Jean Lescut Jun 01 '23 at 21:44
  • @JeanLescut regularisation can solve the problem of highly correlated variables, and is well documented. – Sextus Empiricus Jun 01 '23 at 23:56
  • What I am also wondering is whether the coefficients in this forecasting are typically non-zero? (I am asking because I do not know about EWMA and time series forcasting) That constraint will also reduce the effect of the correlations between the features. – Sextus Empiricus Jun 01 '23 at 23:58

2 Answers2

11

Detailed explanation of the problem:

In the case of X being near-singular (high colinearity/covariance between features), different issues where coming both from scipy.linalg.lstsq() and sklearn.linear_model.LinearRegession()

Source of error 1: As @SextusEmpiricus explained, the matrix being near-singular leads to rounding errors that impact enormously the final predictions. In this sense, scipy.linalg.lstsq() is silently failing WITHOUT raising any warning or error.

Source of error 2: The matrix coming from pandas was F-contiguous. sklearn converts it to C-contiguous before calling scipy.linalg.lstsq() and then use the predict() by using a matrix multiplication right from the F-contiguous array. This lead to another layer of rounding errors. I opened another question here on Stack Overflow

Source of error 3: The first thing that LinearRegression() is doing is to center the dataframe. This goes badly in my case, I still struggle to understand why exactly.

Note: Please note that these rounding errors also depends on CPUs and hardware, which makes it even hard to achieve reproducibility.


(Partial) Work-Around:

To work around the sklearn problems, one can:

  • Ensure input matrix/array are C-contiguous
  • Stop rely on LinearRegression's fit_intercept=True but instead center data manually first:
for seed in range(1000):
    np.random.seed(seed)
    s = pd.Series(np.random.normal(10, 1, size=1_000))
l_com = np.arange(100)
df_Xy = pd.concat([s.ewm(com=com).mean() for com in l_com], axis=1)
df_Xy['y'] = s.shift(-1)
df_Xy.dropna(inplace=True)

X = np.ascontiguousarray(df_Xy[l_com].values)
y = np.ascontiguousarray(df_Xy.y.values)

X_offset = X.mean(axis=0)
y_offset = y.mean()

X_centered = X - X_offset
y_centered = y - y_offset

model = LinearRegression(fit_intercept=False) # We don't rely on sklearn fit_intercept anymore
model.fit(X_centered, y_centered)
assert model.score(X_centered, y_centered) &gt; 0 # ALL GOOD


Moving forward / Long-term Solution:

  1. I opened an issue in scipy Github to raise a Warning in scipy.linalg.lstsq when the X matrix is near-singular.

  2. I opened an issue in sklearn project on Github, about inconsistency between C-cont vs F-cont arrays

Jean Lescut
  • 346
  • 1
  • 9
7

I can reproduce it with np.random.seed(15) and dig a bit deeper. It seems a lot like a computational round-off error due to the high collinearity.

The steps I took

  • I manually added a column with ones to the matrix X

    #X =  df_Xy[l_com]
    X = pd.concat([pd.DataFrame(np.repeat(1, 999)),df_Xy[l_com]], axis=1) 
    
  • and used directly the function lstsq(X, y_true), for which the function LinearRegression is a wrapper, and manually compute the sum of squares

    p, res, rnk, sin = lstsq(X, y_true)
    pred = np.matmul(X, p)
    RSS = np.sum(np.power(y_true - model.predict(X=X), 2)) ## 1007.6190
    RSS2 = np.sum(np.power(y_true - pred, 2))              ## 1007.6190
    TSS = np.sum(np.power(y_true - mean(y_true), 2))       ## 995.24937
    

The R-squared is still negative.

(the above is for intercept=False interestingly, when I set it true then the result becomes even worse, but I can't yet figure out what the sourcecode does with that boolean parameter)

A reason for this behavior might be that the parameters are very large because you generated the coefficients with some exponentially weighting (I do not know enough about python to figure out easily what you are doing there, and you might provide more comments about your code) and created highly correlated features. The lstsq gives as output a very small effective rank and the coefficients are in the order of $\pm 10^{10}$.

At the moment this is as far as I can get. I find the code in python libraries difficult to decipher. The linearmodel from sklearn refers to a function lstsq from scipy/linalg and that function refers to 'gelsd', 'gelsy', 'gelss' functions from lapack which is a fortran code that is somewhere but through all the layers of wrappers and imported packages it is difficult to figure out what happens in the blackbox between input and output of linearmodel.


The influence of the processor

print(model.score(X, y_true))
# -0.15802176533843926 = NEGATIVE R2 on VM 1
# -0.05854780689129546 on VM 2 (? dependent on CPU ?)

In the answer to this question on stack overflow it is explained that the algorithm might take slightly different steps for different CPU architectures and as a consequence there can be small differences in results for different CPUs: Is it reasonable to expect identical results for LAPACK routines on two different processor architectures?

Example: for a computer we have $(7+8) \times (1/9) \neq (7/9+8/9)$. Demonstration with R-code

options(digits=22)
(7/9) + (8/9) # 1.666666666666666518637
15/9          # 1.666666666666666740682

An algorithm might make such subtle changes in the computation to optimize the calculation speed for a processor.

In the case of a matrix inversion of a nearly singular matrix these small errors might get amplified due to the sensitivity of the computation on small differences.

(I don't have the ability to verify this with different processors, but I will check whether changing the number of cores can have a similar influence.)

  • Amazing, we're getting closer ! My hope here is to try to implement a check in lstsq() and issue a warning when needed. It's understandable that the function fails, but not that it fails silently... – Jean Lescut Jun 02 '23 at 06:56
  • @JeanLescut I know that lm in R would provide an error telling that the matrix is singular. I recently encountered a similar question for MATLAB that had different behaviours with a similar issue of matrix inversion. https://stats.stackexchange.com/questions/616591/ – Sextus Empiricus Jun 02 '23 at 07:25
  • You are absolutely right about CPU thing. According to my experiments, scipy.lstsq() doesnt lead to a negative R2 though. But the LinearRegression() code of sklearn convert X into a F-contiguous array before multiplying it with the parameters. This induces several small "errors" that actually becomes crazily bad in my case. I will investigate deeper and maybe propose a change in the sklearn code to keep arrays C-contiguous all along – Jean Lescut Jun 02 '23 at 09:01
  • @JeanLescut I also had slightly different behaviour with lstsq() and changed the random seed. In your case does it also 'work' when you use np.random.seed(15)? And even when it works, the cause of the problem (not giving an error for problematic user input) has not been resolved. Ideally the function should give a warning when results are computationally unstable. – Sextus Empiricus Jun 02 '23 at 10:37
  • Absolutely agree with you. As I wrote below, my next step would be to ask for said warning to be implemented. Despite all my trials, I couldnt make lstsq() fail on my side. I'll keep investigating. Thank you so much for your help (for me and the community !) – Jean Lescut Jun 02 '23 at 11:12
  • @JeanLescut The scipy.linalg.lstsq() has the same problems, but is just performing better and will make the 'error' less easily. I could make it fail with an online compiler https://onecompiler.com/python/3zadf7u39 (I believe that the link will take you to my scribble, although I am not sure whether it can change) – Sextus Empiricus Jun 02 '23 at 11:19