sklearn Linear Regression vs Batch Gradient Descent

Question

tldr: Why would sklearn LinearRegression give a different result than gradient descent?

My understanding is that LinearRegression is computing the closed form solution for linear regression (described well here Why use gradient descent for linear regression, when a closed-form math solution is available?). LinearRegression is not good if the data set is large, in which case stochastic gradient descent needs to be used. I have a small data set and wanted to use Batch Gradient Descent (self written) as an intermediate step for my own edification.

I get different regression weights using LinearRegression and Batch Gradient Descent. Shouldn't the solution be unique?

Code:

import numpy as np
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt

data=pd.read_csv(r'') #Data set attached
X=data[['Size','Floor','Broadband  Rate']]
y=data['Rental Price']

#Sklearn Linear Regression
ols=linear_model.LinearRegression(fit_intercept=True, normalize=False)
LR=ols.fit(X,y)
Res_LR=y.values-LR.predict(X) #Residuals 
print('Intercept', LR.intercept_, 'Weights', LR.coef_)

#Batch Gradient Descent
def error_delta(x,y,p,wn):
    total=0
    row,column=np.shape(x)

    for i in range(0,row):
        if wn!=0:total+=(y[i]-(p[0]+np.dot(p[1:len(p)],x[i,:])))*x[i,wn-1]
        else: total+=(y[i]-(p[0]+np.dot(p[1:len(p)],x[i,:])))*1    
    return total

def weight_update(x,y,p,alpha):
    old=p
    new=np.zeros(len(p))
    for i in range(0,len(p)): new[i]=old[i]+alpha*error_delta(x,y,old,i)
    return new

weight=[-.146,.185,-.044,.119] #random starting conditions
alpha=.00000002 #learning rate

for i in range(0,500): #Seems to have converged by 100
    weight=weight_update(X.values,y.values,weight,alpha)

Res_BGD=np.zeros(len(X.values))
for i in range(0,len(X.values)): Res_BGD[i]=y.values[i]-(weight[0]+np.dot(weight[1::],X.values[i,:]))

print('Inercept', weight[0], 'Weights', weight[1:len(weight)]) 

plt.plot(np.arange(0,len(X.values)),Res_LR,color='b')
plt.plot(np.arange(0,len(X.values)), Res_BGD,color='g')
plt.legend(['Res LR', 'Res BGD'])
plt.show()

The data set is below (10 points)

Size,Floor,Broadband  Rate,Energy Rating,Rental Price
"  500   ","    4    ","    8    ","    C ","   320 "
"  550   ","    7    ","    50   ","    A ","   380 "
"  620   ","    9    ","    7    ","    A ","   400 "
"  630   ","    5    ","    24   ","    B ","   390 "
"  665   ","    8    ","    100  ","    C ","   385 "
"  700   ","    4    ","    8    ","    B ","   410 "
"  770   ","    10   ","    7    ","    B ","   480 "
"  880   ","    12   ","    50   ","    A ","   600 "
"  920   ","    14   ","    8    ","    C ","   570 "
"  1000  ","    9    ","    24   ","    B ","   620 "

When you plot the residuals, the performance is comparable despite very different weights

SklearnLR Intercept 19.5615588974 Weights [ 0.54873985 4.96354677 -0.06209515]

BGD Inercept -0.145402077197 Weights [ 0.62549182 -0.0344091 0.11473203]

Thoughts? Also, if there's any programming feedback, I'm open to more efficient ways to code this as well.

score 1 · Answer 1 · answered Apr 05 '18 at 20:31

1

There are some problems in your question.

Can you make sure the iterative solver converge?

Note that, we can solve linear regression / minimizing squared loss in different ways. My experience of using python scikit-learn is the default set up usually will not give the result that converge. It is possible that we are limiting number of iterations in iterative solver, and stopped early. If we stop early, it is half done work, so it will not be as same as the optimal solution you got from other algorithms.

I would not agree on

LinearRegression is not good if the data set is large, in which case stochastic gradient descent needs to be used.

If we are using QR decomposition, even data is on the level of millions (hopefully this is large enough), as well as number of features is not big, we can solve it in second. Check this R code. You may surprised that we can solve a linear regression on million data points with less than 1 sec.

x=matrix(runif(2e6),ncol=2)
y=runif(1e6)
stime = proc.time()
lm(y~x)
print(proc.time()-stime)

answered Apr 05 '18 at 20:31

Haitao Du

36,852
25
145
242

Thank you for the response.
I'm looking at the sklearn documentation for LinearRegression and it says it's Ordinary Least Squares. There's no convergence information parameters that I can see. The four method inputs are (fit_intercept, normalize, copy_X, n_jobs) . n_jobs description says "The number of jobs to use for the computation. If -1 all CPUs are used". How would I check convergence?

For batch gradient descent, I'll double check the sum of squares but I believe it converges as I get (within rounding) the same result as the book.
– bradm707 Apr 06 '18 at 20:08
@bradm707 I am not an active python user. but for logistic regression here you can select solver – Haitao Du Apr 06 '18 at 20:14
@bradm707 I think for your problem you can directly solve it by solving $A'Ax=A'b$, details see https://stats.stackexchange.com/questions/1829/what-algorithm-is-used-in-linear-regression/1882#1882 – Haitao Du Apr 06 '18 at 20:15

score 0 · Answer 2 · answered Feb 27 '19 at 14:18

0

I think for the Batch Gradient Descent, your weight update is not correct.

new[i]=old[i]+alpha*error_delta(x,y,old,i)

error_delta should be the gradient.

answered Feb 27 '19 at 14:18

Cem Aytekin

1

sklearn Linear Regression vs Batch Gradient Descent

2 Answers2