I have a model of some financial data that achieves an R2 of ~0.01 (1%) using RidgeCV -- this is about what I expect. I'm exploring building the equivalent model using SGDRegressor so I can leverage partial_fit to do incremental training over larger than memory data sets... Unfortunately my SGDRegressor does not converge to the same R2 as RidgeCV even for equivalent L2, squared_loss and alpha values. I've tried fiddling with many parameters but it's not clear how to pick params...
Any advice on getting SGDRegressor to converge similiarly to RidgeCV? E.g if I have a 4M row data set is it best to call 10000's of epochs across many small (32 rows) batches with partial_fit or do large batches (500,000 rows)?
EPOCH_COUNT=10000
CHUNK_SIZE = 500000
Initialize SGDRegressor and StandardScaler
sgd = SGDRegressor(verbose=0,
tol=1e-4,
loss='squared_error',
penalty='l2',
alpha=0.1,
learning_rate='constant',
eta0=0.01,
fit_intercept=True, # play with this
shuffle=True,
)
scaler = StandardScaler()
regmodel = make_pipeline(StandardScaler(), sgd)
r2s=[]
r2_test=[]
Load data in chunks
for i in range(0, len(df_targets), CHUNK_SIZE):
df_chunk = df_targets.iloc[i:i+CHUNK_SIZE]
# split df_chunk into train and set set using sklearn
chunk, chunk_test = train_test_split(df_chunk, test_size=0.1, shuffle=False)
y = chunk[hyper['target_name']].values
X = chunk[features].values
X_test = chunk_test[features].values
y_test = chunk_test[hyper['target_name']].values
if len(df_chunk) != CHUNK_SIZE:
print('skipping chunk {} as len(X) = {} != CHUNK_SIZE = {}'.format(i, len(X), CHUNK_SIZE))
break
# Hack
if i == 0:
# Fit the scaler on the first chunk only
print('fitting scaler on first chunk')
scaler = scaler.fit(X)
X = scaler.transform(X, copy=False)
X_test = scaler.transform(X_test, copy=False)
for j in range(0, EPOCH_COUNT):
sgd.partial_fit(X,y)
level_2_r2_score = sgd.score(X,y)
level_2_r2_score
r2s.append(level_2_r2_score)
r2_score_test = sgd.score(X_test,y_test)
r2_test.append(r2_score_test)
print('chunk {} - EPOCH {} - === > r2_train = {} - r2_test = {} '.format(i, j, level_2_r2_score, r2_score_test))