I am trying to replicate sklearn's linear regression coefficients and r2 score with an online calculation (so that it updates with each additional point of data). Starting with this code here.
class SimpleLinearRegressor():
def __init__(self):
self.dots = np.zeros(5)
self.intercept = None
self.slope = None
self.tss = 0
self.rss = 0
self.r2 = 0
self.count = 0
self.y_sum = 0
self.y_avg = 0
def update(self, x: np.ndarray, y: np.ndarray):
self.dots += np.array(
[
x.shape[0],
x.sum(),
y.sum(),
np.dot(x, x),
np.dot(x, y),
]
)
size, sum_x, sum_y, sum_xx, sum_xy = self.dots
det = size * sum_xx - sum_x ** 2
self.count += 1
self.y_sum += y
self.y_avg = self.y_sum / self.count
if det > 1e-10: # determinant may be zero initially
self.intercept = (sum_xx * sum_y - sum_xy * sum_x) / det
self.slope = (sum_xy * size - sum_x * sum_y) / det
self.tss += ((y - self.y_avg) ** 2).sum()
resid = y - (self.intercept + (x * self.slope))
self.rss += (resid ** 2).sum()
self.r2 = 1 - (self.rss / self.tss)
So far the coefficients are spot on. However, the r2 calculation is consistently higher than sklearn's r2 calculation. Here is a comparison (calculating at each new point of the data):
Here's the original data and table:
0: line with some slight noise
score: sklearn r2 score
coef: sklearn coef
coef_incr: online coef
score_incr: online r2
Thank you



sklearnnot quite agreeing on what $R^2$ should be. I’ll see what sense I can make out of this, and I hope to post an answer. (But if you figure it out, please do post a self-answer!) – Dave Mar 07 '23 at 17:54