0

I am trying to replicate sklearn's linear regression coefficients and r2 score with an online calculation (so that it updates with each additional point of data). Starting with this code here.

class SimpleLinearRegressor():
def __init__(self):

    self.dots = np.zeros(5)
    self.intercept = None
    self.slope = None

    self.tss = 0
    self.rss = 0
    self.r2 = 0

    self.count = 0
    self.y_sum = 0
    self.y_avg = 0

def update(self, x: np.ndarray, y: np.ndarray):

    self.dots += np.array(
        [
            x.shape[0],
            x.sum(),
            y.sum(),
            np.dot(x, x),
            np.dot(x, y),
        ]
    )
    size, sum_x, sum_y, sum_xx, sum_xy = self.dots
    det = size * sum_xx - sum_x ** 2

    self.count += 1
    self.y_sum += y
    self.y_avg = self.y_sum / self.count

    if det > 1e-10:  # determinant may be zero initially

        self.intercept = (sum_xx * sum_y - sum_xy * sum_x) / det
        self.slope = (sum_xy * size - sum_x * sum_y) / det

        self.tss += ((y - self.y_avg) ** 2).sum()

        resid = y - (self.intercept + (x * self.slope))
        self.rss += (resid ** 2).sum()
        self.r2 = 1 - (self.rss / self.tss)


So far the coefficients are spot on. However, the r2 calculation is consistently higher than sklearn's r2 calculation. Here is a comparison (calculating at each new point of the data):

r2 scores not matching

Here's the original data and table:

line with slight noise

data

0: line with some slight noise
score: sklearn r2 score
coef: sklearn coef
coef_incr: online coef
score_incr: online r2

Thank you

0 Answers0