3

I'm following this guide on detecting anomalies using autoencoders. The section titled "Normalising & Standardising" seems to be describing normalization in terms of scaling and shifting the features to be centered around 0 with a standard deviation of 1. But the implementation in the pipeline is using sklearn.preprocessing.Normalizer, which, if I understand correctly, is the l2 vector norm that scales your features to have a l2 norm of 1.

Are these two somehow the same? Are there different "normalization" methods? If so, what should be used when?

Sycorax
  • 90,934
bli00
  • 143

1 Answers1

4

The Kaggle post describes a different procedure than the code carries out. What the author is trying (but not succeeding) to say is that preconditioning can improve gradient-based optimization. Here's an answer that explains this in a more precise way. https://stats.stackexchange.com/a/437848/22311

Suppose we have some matrix $X$ where the rows store observations (examples) and the columns store features (the measurements you collect for each example).

sklearn.preprocessing.Normalizer rescales the feature vector for each observation. So if an observation $i$ has feature vector $x_i$, then after applying sklearn.preprocessing.Normalizer, we have $\| x_i \|=1 ~ \forall i$. In other words, all of the rows for $X$ have the same length. This is why all of the data points fall along a clean curve in the sklearn plot: all of the plotted points are the same distance from the origin.

enter image description here

But the sklearn.preprocessing.Normalizer is different from the "normalization" and "standardization" usages that OP describes. Indeed, most usages of "normalizing" and "standardizing" are consistent with what OP describes in their question. Usually, "normalizing" and "standardizing" are about rescaling the features themselves. In other words, these are operations that apply scaling and shifting the columns of $X$, as described in What's the difference between Normalization and Standardization? This question has answers about when to use these methods: When to Normalization and Standardization?

Intuitively, we would not expect that composing L2 row scaling and min/max scaling to be the same as scaling the columns to have 0 mean and unit variance in general. This is because L2 row scaling makes the the values in each row depend on all other values in the row. On the other hand, $z$-scores are applied to the columns alone.

A direct demonstration for this is to just apply the two transformations to the same data and compare the result.

import numpy as np
from sklearn.preprocessing import Normalizer, StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
prng  = np.random.default_rng(12345)
X = np.random.multivariate_normal(mean=[-2, -1, 0, 1, 2], cov=np.eye(5) * np.array([0.1, 0.3, 0.5, 0.7,0.9]), size=10)

pipe1 = Pipeline([('normalizer', Normalizer()), ('scaler', MinMaxScaler())])

pipe2 = Pipeline([("Standard", StandardScaler())])

X1 = pipe1.fit_transform(X) X2 = pipe2.fit_transform(X)

print(X1 - X2) assert np.allclose(X1, X2)

The code raises an exception because the transformations are not identical: X1 is different from X2, and the size of the differences can be very large!


Why does the author of the Kaggle post make this error?

The Kaggle text quotes Giorgos Myrianthous's answer to this Stack Overflow question that describes centering and scaling the data, which is close to what the StandardScaler does. For some reason, the Stack Overflow post uses Normalizer instead of StandardScaler. Apparently, neither Giorgos Myrianthous nor the Kaggle author bothered to read the documentation to determine which function applies centering by the mean and scaling by the standard deviation.

Also, Giorgos Myrianthous's answer describes rescaling by the variance for some reason. That doesn't make much sense because the variance is measured in units squared; StandardScaler rescales by the standard deviation, which is measured in the same units as the data. Moreover, scaling a non-constant random variable by its standard deviation gives a random variable with variance of 1. Rescaling by the variance does not do this, unless the variance is already 1.

I've demonstrated the several ways that Giorgos Myrianthous's answer is misleading in https://stackoverflow.com/a/71887356/2482661.

Sycorax
  • 90,934
  • As I expected, the post is just wrong. I'm guessing that the L2 norm don't have practical applications for general feature sets unless your feature space holds Euclidean relations? – bli00 Mar 30 '22 at 16:07
  • I'm trying to think of cases where I've seen people use row-level L2 normalization. The only thing that comes to mind is in NLP tasks when you need to compare documents with dramatically different lengths (e.g. 100 words versus 10,000 words). In those scenarios, the longer documents will tend to be more similar to many other documents, simply because there are more words in it, so it shares more words with other documents. L2 norm can mitigate that. I'm sure there are other examples. Also, applying L2 norm as a first step simplifies cosine similarity to just a dot-product. – Sycorax Mar 30 '22 at 16:09
  • Thank you for the clarification – bli00 Mar 30 '22 at 16:10
  • If you really want to pull on this thread, you might search Github for instances where people import or use the L2 normalizer class. It will probably turn up some interesting applications. – Sycorax Mar 30 '22 at 16:12