How does alpha relate to C in Scikit-Learn's SGDClassifier?

Question

I'm trying to get the same linear SVM classifier model by using Scikit-Learn's SVC, LinearSVC and SGDClassifier classes. I managed to do so (see the code below), but only by manually tweaking the alpha hyperparameter for the SGDClassifier class.

Both SVC and LinearSVC have the regularization hyperparameter C, but the SGDClassifier has the regularization hyperparameter alpha. The documentation says that C = n_samples / alpha, so I set alpha = n_samples / C, but when I use this value, the SGDClassifier ends up being a very different model than the SVC and LinearSVC models. If I manually tweak the value of alpha, I can get all models to be approximately the same, but there should be a simple equation to find alpha given C. What is it?

from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler

X, y = make_moons(n_samples=100, noise=0.15, random_state=42)

C = 5
alpha = len(X)/C  # alpha == 20

sgd_clf1 = SGDClassifier(loss="hinge", alpha=alpha, n_iter=10000, random_state=42)
sgd_clf2 = SGDClassifier(loss="hinge", alpha=0.0007, n_iter=10000, random_state=42)
svm_clf = SVC(kernel="linear", C=C)
lin_clf = LinearSVC(loss="hinge", C=C)

X_scaled = StandardScaler().fit_transform(X)
sgd_clf1.fit(X_scaled, y)
sgd_clf2.fit(X_scaled, y)
svm_clf.fit(X_scaled, y)
lin_clf.fit(X_scaled, y)

print("SGDClassifier(alpha=20):     ", sgd_clf1.intercept_, sgd_clf1.coef_)
print("SGDClassifier(alpha=0.0007): ", sgd_clf2.intercept_, sgd_clf2.coef_)
print("SVC:                         ", svm_clf.intercept_, svm_clf.coef_)
print("LinearSVC:                   ", lin_clf.intercept_, lin_clf.coef_)

This code outputs:

SGDClassifier(alpha=20):      [-0.46597258] [[ 0.0283698  -0.03634389]]
SGDClassifier(alpha=0.0007):  [ 0.0422716] [[ 0.79608868 -1.48847539]]
SVC:                          [ 0.04569242] [[ 0.79788013 -1.48716383]]
LinearSVC:                    [ 0.04556911] [[ 0.79762806 -1.4866854 ]]

Note: to make the LinearSVC class output the same result as the SVC class, you have to center the inputs (eg. using the StandardScaler) since it regularizes the bias term (weird). You also need to set loss="hinge" since the default is "squared_hinge" (weird again).

So my question is: how does alpha really relate to C in Scikit-Learn? Looking at the equations, the documentation should be right, but in practice it is not. What's going on?

I got the feeling you should call SGDClassifier with 1/alpha (after your calculations). But this is just an empirical feeling looking at the default-value and the recommend grid-search range [10^1, ...10^-7]. Another matter: are you sure your SGDClassifiers converged (might be hard to check; and yes, you got a huge amount of iterations, but that should not really help in theory, if the learning-rate/learning-rate-schedule is bad)? — sascha, Jun 03 '16 at 10:12
Thanks for your comment @sascha. I tried 1/alpha, but it did not give the same result as SVC and LinearSVC. I am using the default learning schedule, which is supposed to guarantee convergence (if I understand correctly) provided the number of iterations is large enough (I put a huge value to be sure, but I get the same result with a smaller number of iterations). This is so strange, and quite frustrating... — MiniQuark, Jun 03 '16 at 13:15

score 7 · Accepted Answer · answered Jun 22 '16 at 13:11

The correct scaling is C_svc * n_samples = 1 / alpha_sgd instead of C_svc = n_samples / alpha_sgd, the documentation seems to be incorrect.

In your example, it gives alpha = 0.002, and the results are similar with the SVC estimator's.

After a quick overview, it seems that the equivalence is:

1. / C_svr ~ 1. / C_svc ~ 1. / C_logistic ~ alpha_elastic_net * n_samples ~ alpha_lasso * n_samples ~ alpha_sgd * n_samples ~ alpha_ridge

Note that this is not a API-design failure, since scaling is more appropriate in some cases (e.g. with L1 regularization), whereas no-scaling is preferred in other cases (e.g. with L2 regularization).

How does alpha relate to C in Scikit-Learn's SGDClassifier?

1 Answers1

Linked