tldr: Why would sklearn LinearRegression give a different result than gradient descent?
My understanding is that LinearRegression is computing the closed form solution for linear regression (described well here Why use gradient descent for linear regression, when a closed-form math solution is available?). LinearRegression is not good if the data set is large, in which case stochastic gradient descent needs to be used. I have a small data set and wanted to use Batch Gradient Descent (self written) as an intermediate step for my own edification.
I get different regression weights using LinearRegression and Batch Gradient Descent. Shouldn't the solution be unique?
Code:
import numpy as np
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
data=pd.read_csv(r'') #Data set attached
X=data[['Size','Floor','Broadband Rate']]
y=data['Rental Price']
#Sklearn Linear Regression
ols=linear_model.LinearRegression(fit_intercept=True, normalize=False)
LR=ols.fit(X,y)
Res_LR=y.values-LR.predict(X) #Residuals
print('Intercept', LR.intercept_, 'Weights', LR.coef_)
#Batch Gradient Descent
def error_delta(x,y,p,wn):
total=0
row,column=np.shape(x)
for i in range(0,row):
if wn!=0:total+=(y[i]-(p[0]+np.dot(p[1:len(p)],x[i,:])))*x[i,wn-1]
else: total+=(y[i]-(p[0]+np.dot(p[1:len(p)],x[i,:])))*1
return total
def weight_update(x,y,p,alpha):
old=p
new=np.zeros(len(p))
for i in range(0,len(p)): new[i]=old[i]+alpha*error_delta(x,y,old,i)
return new
weight=[-.146,.185,-.044,.119] #random starting conditions
alpha=.00000002 #learning rate
for i in range(0,500): #Seems to have converged by 100
weight=weight_update(X.values,y.values,weight,alpha)
Res_BGD=np.zeros(len(X.values))
for i in range(0,len(X.values)): Res_BGD[i]=y.values[i]-(weight[0]+np.dot(weight[1::],X.values[i,:]))
print('Inercept', weight[0], 'Weights', weight[1:len(weight)])
plt.plot(np.arange(0,len(X.values)),Res_LR,color='b')
plt.plot(np.arange(0,len(X.values)), Res_BGD,color='g')
plt.legend(['Res LR', 'Res BGD'])
plt.show()
The data set is below (10 points)
Size,Floor,Broadband Rate,Energy Rating,Rental Price
" 500 "," 4 "," 8 "," C "," 320 "
" 550 "," 7 "," 50 "," A "," 380 "
" 620 "," 9 "," 7 "," A "," 400 "
" 630 "," 5 "," 24 "," B "," 390 "
" 665 "," 8 "," 100 "," C "," 385 "
" 700 "," 4 "," 8 "," B "," 410 "
" 770 "," 10 "," 7 "," B "," 480 "
" 880 "," 12 "," 50 "," A "," 600 "
" 920 "," 14 "," 8 "," C "," 570 "
" 1000 "," 9 "," 24 "," B "," 620 "
When you plot the residuals, the performance is comparable despite very different weights
SklearnLR Intercept 19.5615588974 Weights [ 0.54873985 4.96354677 -0.06209515]
BGD Inercept -0.145402077197 Weights [ 0.62549182 -0.0344091 0.11473203]
Thoughts? Also, if there's any programming feedback, I'm open to more efficient ways to code this as well.
I'm looking at the sklearn documentation for LinearRegression and it says it's Ordinary Least Squares. There's no convergence information parameters that I can see. The four method inputs are (fit_intercept, normalize, copy_X, n_jobs) . n_jobs description says "The number of jobs to use for the computation. If -1 all CPUs are used". How would I check convergence?
For batch gradient descent, I'll double check the sum of squares but I believe it converges as I get (within rounding) the same result as the book.
– bradm707 Apr 06 '18 at 20:08