I have seen the min-max normalization formula but that normalizes values between 0 and 1. How would I normalize my data between -1 and 1? I have both negative and positive values in my data matrix.
2 Answers
With: $$ x' = \frac{x - \min{x}}{\max{x} - \min{x}} $$ you normalize your feature $x$ in $[0,1]$.
To normalize in $[-1,1]$ you can use:
$$ x'' = 2\frac{x - \min{x}}{\max{x} - \min{x}} - 1 $$
In general, you can always get a new variable $x'''$ in $[a,b]$:
$$ x''' = (b-a)\frac{x - \min{x}}{\max{x} - \min{x}} + a $$
And in case you want to bring a variable back to its original value you can do it because these are linear transformations and thus invertible. For example:
$$ x = (x''' - a)\frac{(\max{x} - \min{x})}{b-a} + \min{x} $$
An example in Python:
import numpy as np
x = np.array([1, 3, 4, 5, -1, -7])
# goal : range [0, 1]
x1 = (x - min(x)) / ( max(x) - min(x) )
print(x1)
>>> [0.66666667 0.83333333 0.91666667 1. 0.5 0.]
- 7,078
-
17Honestly I don't have citations for this. It is just a linear transformation of a random variable. Have a look at the effect of linear transformations on the support of a random variable. – Simone Oct 19 '17 at 04:40
-
-
-
4
-
@Simone: Is there a way to renormalize all the values? i.e. bring them back to their original values? – Srivatsan Jun 26 '20 at 16:56
-
5@ThePredator this is a linear transformation of a random variable, so it is invertible. But you need to know the original $\max{x}$ and $\min{x}$. If you have $x''$ (as in the formula above) in $[-1,1]$ you can get back to $x$ with $(\max{x} - \min{x})\frac{x''+1}{2} + \min{x}$. – Simone Jun 28 '20 at 09:49
-
1or in general: $x=\frac{(x'''-a)(\max{x}-\min{x})}{b-a}+\min{x}$. I advise keeping your original and normalised datasets, you can then find $\max{x}$ and $\min{x}$ easily by just looking at your original dataset again. – GMSL Oct 11 '21 at 11:22
-
Just a remark: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html performs the same normalization as in $x'''$ – Simone Feb 15 '22 at 16:02
-
1
I tested on randomly generated data, and
\begin{equation} X_{out} = (b-a)\frac{X_{in} - \min{X_{in}}}{\max{X_{in}} - \min{X_{in}}} + a \end{equation}
does not preserve the shape of the distribution. Would really like to see the proper derivation of this using functions of random variables.
The approach that did preserve the shape for me was using:
\begin{equation} X_{out} = \frac{X_{in} - \mu_{in}}{\sigma_{in}} \cdot \sigma_{out} + \mu_{out} \end{equation}
where
\begin{equation} \sigma_{out} = \frac{b-a}{6} \end{equation}
(I admit that using 6 is a bit dirty) and
\begin{equation} \mu_{out} = \frac{b+a}{2} \end{equation}
and
$a$ and $b$ is the desired range; so as per original question would be $a=-1$ and $b=1$.
I arrived at the result from this reasoning
\begin{equation} Z_{out} = Z_{in} \end{equation}
\begin{equation} \frac{X_{out} - \mu_{out}}{\sigma_{out}} = \frac{X_{in} - \mu_{in}}{\sigma_{in}} \end{equation}
-
3Are you sure that this guarantees the transformed data will lie within the bounds? In R, try:
set.seed(1); scale(rnorm(1000))*.333. I get a max of1.230871. Your method seems to be just a tweak on standardizing data, rather than normalizing them as requested. Note that the question does not ask for a method that preserves the shape of the distribution (which would be a strange requirement for normalization). – gung - Reinstate Monica Jul 17 '19 at 17:01 -
4I'm not sure how the original transformation could fail to preserve the shape of the data. It's equivalent to subtracting a constant and then dividing by a constant, which is what your proposal does, and which doesn't change the shape of the data. Your proposal assumes all the data falls within three standard deviations of the mean, which may be somewhat reasonable with small, approximately normally distributed samples, but not with big or non-normal samples. – Noah Jul 17 '19 at 17:01
-
1@Noah It's not equivalent to subtracting and dividing by constants, because the min and max of the data are random variables. Indeed, for most underlying distributions they are pretty variable--more variable than the rest of the data--whence using them for any form of standardization is usually not a good idea. In this answer it's unclear what $a$ and $b$ mean or how they might be related to the data. – whuber Jul 17 '19 at 17:15
-
2@whuber true, but I meant that in a given dataset (i.e., treating the data as fixed), they are constants, in the same way the sample mean and sample standard deviation function as constants when standardizing a dataset. My impression was that OP wanted to normalize a dataset, not a distribution. – Noah Jul 17 '19 at 17:57
-
@Noah I had the same impression, but I believe the present post may be responding to a different interpretation. – whuber Jul 17 '19 at 19:57
-
convertRangeshared by Giuseppe Canale. – Galen Oct 16 '22 at 15:50