Why are my weights and and bias incorrect after scaling?

Question

I am trying to create my own linear regression model in Python. I have a working model, but when I try to add a preprocessing function that scales the feature vectors I get incorrect weights.

Below is my code:

import numpy as np
import pandas as pd
df = pd.read_csv("test_data.csv")
class LinearRegression:
def gradient_decent(self, X, y, w, b, alpha):
    &quot;&quot;&quot;
    :param X: A pandas dataframe, such that each column represents an example and the rows represent the features
    :param y: The labels. Each label corresponds to a column in the data (i.e. len(y) == len(X.cols))
    :param w: An array of the weight of the model (i.e. len(w) == len(X.rows))
    :param b: The bias. A constant.
    :param alpha: The learning rate - aka the step size
    :return:
    &quot;&quot;&quot;

    # Define N as the number of training examples
    N = len(y)

    # Initialise delta E / delta W as a vector of zeros - number of rows in the dataframe
    delta_w = np.zeros(len(X[0]))

    # Initialise delta E / delta b as 0
    delta_b = 0

    for i in range(N):

        wx = sum(w * X[i])

        delta_w += 2 * X[i] * (wx + b - y[i])
        delta_b += wx + b - y[i]

    w = w - ((2 / N) * alpha * delta_w)
    b = b - ((2 / N) * alpha * delta_b)

    return w, b

@staticmethod
def mean_square_error(X, y, w, b):
    res = 0
    N = len(y)

    for i in range(N):

        wx = sum(w * X[i])
        res += (wx + b - y[i]) ** 2

    return res / N

# ================== ERROR HERE ==================

@staticmethod
def feature_scaling(X, y):
    &quot;&quot;&quot;
    This function converts floating-point features from their natural range (e.g. from 100 to 900) into a standard
    range (e.g. -3 to 3) by calculating the z-score.
    :param X: The feature vectors (a Pandas DataFrame)
    :param y: The labels (a Pandas DataFrame)
    :return: The scaled data
    &quot;&quot;&quot;
    for col in X:
        mean = X[col].mean()
        std = X[col].std()
        X[col] = X[col].apply(lambda x: (x - mean) / std)

    y = y.apply(lambda x: (x - y.mean()) / y.std())

    return X, y

# ================================================

def preprocessing(self, X, y):
    X = X.transpose()
    X, y = self.feature_scaling(X, y)
    X = X.to_numpy()
    y = y.to_numpy()
    return X, y

def train(self, X, y, alpha, epochs, w=None, b=0):
    X, y = self.preprocessing(X, y)
    if w is None:
        w = np.zeros(len(X[0]))
    for e in range(epochs):
        w, b = self.gradient_decent(X, y, w, b, alpha)

        if e % 5000 == 0:
            print('epochs:', e, 'loss:', self.mean_square_error(X, y, w, b), 'w:', w, 'b:', b)

    print('---- FINAL ---')
    return w, b, &quot;Loss:&quot;, self.mean_square_error(X, y, w, b)



X = pd.DataFrame([df['x1'], df['x2']])
print(LinearRegression().train(X=X, y=df['y'], alpha=0.1, epochs=50000))

Without feature vector scaling my LinearRegression().train() returns:

w: [11.66585762 3.33397344], b: 23.331022792787955

Which are roughly the same values I would get if I used sklearn.linear_model.

However when I run the same code with feature_scaling(), my model returns:

w: [0.72353595 0.27347088], b: 8.326672684688619e-17

I am not too sure what the issue is: whether it's an issue with the code or whether its an issue with my maths, but any help at all on the matter is hugely appreciated!

Thanks in advance.

You'll get different w and b after feature scaling. Why do you think that they're wrong? — gunes, May 14 '21 at 13:07
Oh ok, I wan't sure if I would get different weights or not. Is there any way to express my 'scaled' weights and bias as if they had been fit to the original data, i.e. have w = [11.6, 3.33] and not w = [0.72, 0.23] — jda5, May 14 '21 at 13:10

Tim · Accepted Answer · 2021-05-14T16:14:41.517

This is by design. Parameters of linear regression are scaled along with the data. If you scale the features, the parameters would get scalled the opposite way. I'll give you an example in R, because the code would be shorter like this.

> y <- mtcars[,1]
> X <- as.matrix(mtcars[,2:ncol(mtcars)])
> lm(y~X)

Call:
lm(formula = y ~ X)

Coefficients:
(Intercept)         Xcyl        Xdisp          Xhp        Xdrat          Xwt  
   12.30337     -0.11144      0.01334     -0.02148      0.78711     -3.71530  
      Xqsec          Xvs          Xam        Xgear        Xcarb  
    0.82104      0.31776      2.52023      0.65541     -0.19942  

> X[,1] <- X[,1] * 100
> lm(y~X)

Call:
lm(formula = y ~ X)

Coefficients:
(Intercept)         Xcyl        Xdisp          Xhp        Xdrat          Xwt  
  12.303374    -0.001114     0.013335    -0.021482     0.787111    -3.715304  
      Xqsec          Xvs          Xam        Xgear        Xcarb  
   0.821041     0.317763     2.520227     0.655413    -0.199419

As you can see, scaling Xcyl by 100, also scaled the relevant parameter.

If you want to know why to look again at the definition of the linear regression model

$$ y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k + \varepsilon $$

For the results to be unchanged, if you multiply $X_1c$, you need to scale the parameter as well $\beta_1/c$. The same applies to all the other transformations of the features, they will also lead to changes in the parameters, though the change of the parameters may be harder to derive in other cases.

You ask about normalizing the variables in general. The normalization is

$$ Z_i = \frac{X_i - \bar x_i}{ s_i } $$

where $X_i$ is the value of the $i$-th feature, $m_i$ and $s_i$ are the mean and standard deviation respectively. So if you want to back-transform the parameters $\tilde \beta_i$ calculated on transformed data

$$ \tilde \beta_i Z_i = \tilde \beta_i \frac{X_i - \bar x_i}{ s_i } = \frac{\tilde \beta_i X_i}{ s_i } - \frac{\tilde \beta_i \bar x_i}{ s_i } = (\tilde \beta_i / s_i) X_i - (\tilde \beta_i / s_i) m_i $$

Notice that $m_i$ and $s_i$ are known constants, also $(\tilde \beta_i / s_i) X_i$ should be familiar to you, this is just a scaled parameter. So you need to collect all the constants into the intercept $\beta_0 = \tilde \beta_0 - (\tilde \beta_1 / s_1) m_1 - \dots$ and transform the parameter $\beta_i = \tilde \beta_i / s_i$.

Code example

> y <- mtcars[,1]
> X <- as.matrix(mtcars[,2:3])
> lm(y~X)
Call:
lm(formula = y ~ X)
Coefficients:
(Intercept)         Xcyl        Xdisp

   34.66099     -1.58728     -0.02058

Now with normalized features

> m <- colMeans(X)
> s <- apply(X, 2, sd)
> Z <- t(apply(X, 1, function(x) (x - m) / s))
> lm(y~Z)
Call:
lm(formula = y ~ Z)
Coefficients:
(Intercept)         Zcyl        Zdisp

     20.091       -2.835       -2.551

And here how you back-transform the parameters

> beta <- coef(lm(y~Z))
> beta[1] - sum(beta[2:3]/s * m)
[1] 34.6609947413328
> beta[2:3]/s
[1] -1.58727680900718 -0.0205836333707016

Thanks for your answer Tim. This has certainly cleared some things up. Is there any way to express my 'scaled' weights and bias as if they had been fit to the original data, i.e. have w = [11.6, 3.33] and not w = [0.72, 0.23] — jda5, May 14 '21 at 13:16
@jda5 you can, scaling is just $(x - m)/s$, so you can use algebra to revert it. Why you need to back-transform them? — Tim, May 14 '21 at 13:20
Well at the moment the user submits some data, my code scales it (to optimise the gradient decent algorithm), then fits a linear model to it. I just think it would make more sense to the user if the software returned the model weights in form relative to the data that the user submitted. — jda5, May 14 '21 at 13:36
@jda5 you could use least squares for estimation, in such case, there is no need for scaling. SGD is not the best optimization algorithm for the problem. — Tim, May 14 '21 at 14:11
Thanks so much for your help Tim. Though I am still struggling a lot with re-scaling my weights and my bias. Any chance you can help me out? How would I do this? — jda5, May 14 '21 at 15:22

Why are my weights and and bias incorrect after scaling?

1 Answers1

Linked