1

I am trying to pre-process data following a statement in a paper.

They said

for the normalization, each dataset is normalized on a per channel basis with a sample range from -1 to 1 and a mean value of zero.

$X_{rescaled} = a+(b-a)\frac{X - min(X)}{max(X)-min(X)}$, following this, it can rescale data $X$ into an arbitrary interval $[a \ b]$, but it can't promise the mean is zero.

Probably I didn't get the statement here, anyone has any idea about this?

  • 1
    You would have to perform the normalization in two parts, separate for the below average and above average values, but this will change the relationship of the data. – user2974951 Nov 08 '22 at 11:14
  • 1
    Maybe e-mail the authors? – Tim Nov 08 '22 at 11:19
  • I e-mailed them a month ago with a few other questions, got no response. They did reply me once, maybe get tired of my endless questions:) – Margie Shi Nov 08 '22 at 11:25
  • Does the quote mean that the new values are mapped to $[-1,1]$ or is this interval only the range of possible values? – user2974951 Nov 08 '22 at 11:27
  • @user2974951 thx! could you please explain a bit more? Do you mean first centring the data with zero mean and rescaling the two parts to [-1 0] and [0 1] respectively? – Margie Shi Nov 08 '22 at 11:28
  • I think the question is, *What do you want to do ?, and maybe, Why do you want to do this ?* Is it just a matter of figuring out what the authors of the paper did ? Or is there some point to transforming data in this way ? ... It seems to me that if you took a vector of data and both rescaled to (-1, 1) and forced the mean to zero, that most of the information would have been squeezed out of this set of values. – Sal Mangiafico Nov 09 '22 at 15:45
  • Could include a link to the paper? It's possible the authors got it wrong and didn't understand what their normalization was doing. – Noah Nov 10 '22 at 15:59
  • @Noah, here is the arXiv link:https://arxiv.org/abs/2202.01208. you can find the quote from 3.2.2 data preparation. – Margie Shi Nov 11 '22 at 21:07
  • @SalMangiafico, yes, I agree, the author mentioned it was for data preparation (before training the neural network). I assume it could help get better initializations because I always get a higher train loss than the paper reported. – Margie Shi Nov 11 '22 at 21:09
  • There's not much helpful in the description in the paper: "Normalization: Each dataset is normalized on a per channel basis with a sample range from -1 to 1 and a mean value of zero.". ... Unfortunately, "normalize" could mean a few different things. ... This isn't my field, so I have no intuition as what transformation would be common. If I had to guess, I'd guess that they converted each to a normal distribution with mean of 0 and sd of 1, as by normal scores inverse transformation. But that's just a guess, and isn't what the paper says. – Sal Mangiafico Nov 12 '22 at 01:21
  • @MargieShi I updated my answer to include one more reference that may help you interpret what the authors of the paper you're looking into meant. Good luck! – bmasri Nov 14 '22 at 13:55

2 Answers2

1

I believe the intended application of this preprocessing technique is to first standardize the data with mean zero then change its range to $[-1,1]$ for a neural network input. To quote this referece :

You will get better initializations if the data are centered near zero and if most of the data are distributed over an interval of roughly [-1,1]

However, in all cases, applying first a standardization technique to get zero mean and then apply a new data normalization to map your features to a certain range is meaningless... it is equivalent to normalizing only $X$ since the standardization step does not change the min/max values (check this question).

What I would suggest is only standardization to zero mean and unit variance like so

$$ X_{i}{scaled} = \frac{X_{i} - \mu_{X_i}}{\sigma_{X_i}} $$

where $\mu_{X_i}$ and $\sigma_{X_i}$ are computed for every feature $i$. Then if you really want to have a range of $[-1,1]$ and your data is Gaussian, you can simply filter them accordingly which according to a Gaussian distribution should give you about $68\%$ of your data. In the reference that I quote, they say 'most of the data' which in this case means $68\%$ if it is Gaussian. But this seems to me like an overkill since you loose too much data.

On the other hand, if your data is not Gaussian then clipping values will result in an even greater loss of data (and information). So one should choose a better mapping (e.g. logarithmic, box-cox, and so on...) to make data Gaussian-like first, then standardize to zero mean and unit variance, then clip to $[-1,1]$ if necessary. An example using a simple logarithmic transformation for non Gaussian features:

$$ X_i = log(X_i + \epsilon) $$ then $$ X_i = \frac{X_{i} - \mu_{X_i}}{\sigma_{X_i}} $$

See scikit-learn's implementation of standard scaler here and their implementation of non-linear data transformers here.

Update: I found this scikit-learn page that discusses tips to tackle deep learning data preprocessing very useful. They do indeed mention that it is recommended to either scale data to $[-1,1]$ or standardize it to have zero mean and unit variance. Applying both is meaningless as shown in this answer that @Tim so kindly provided.

bmasri
  • 183
  • This doesn't work, the mean is not 0. – user2974951 Nov 08 '22 at 13:06
  • In fact you're right, I realize that applying first a standardization then a normalization is the equivalent of normalization alone. So I updated my answer to a more correct way to get what you are looking for – bmasri Nov 08 '22 at 13:45
  • @bmasri thanks for your informative answers! I am thinking about this because my train loss is much higher (5-10 times) than the paper reported at the beginning, which is probably related to the initializations. Unfortunately, the data is not gaussian, it is ultrasound raw data, actually. I am not sure taking logarithmic make sense here. – Margie Shi Nov 09 '22 at 13:56
  • @bmasri, do you think it makes sense that first change its range to $[-1 1]$ and then subtract the mean for each feature? – Margie Shi Nov 09 '22 at 14:00
  • Not really since subtracting the mean will definitely mess with your $[-1,1]$ scale... in fact it will become $[-1 - \mu, 1 - \mu]$ which is a weird way to preprocess data... I think it is important to test normalization and standardization independently to see which one works best for your model. But if you are trying to reproduce a paper's results with a zero mean and $[-1,1]$ then I think standardization then clipping to $[-1,1]$ or $[-2,2]$ (if it is gaussian) works best in my experience. – bmasri Nov 09 '22 at 14:19
  • As for your raw data, of course it may not have a real sense to transform your data into a gaussian distribution. In fact neural networks do not make any normality assumptions of your input data. However, transforming your data into a known gaussian distribution may help you remove outliers based on standard deviation rules, it helps in the convergence of your training process when training is split into mini batches since batches will have more or less a more defined distribution. This may reduce your loss. Also, taking a larger batch may help too. – bmasri Nov 09 '22 at 14:35
  • To put in perspective how senseless input data may be to a neural network, youtube recommender systems use as input a feature $x$, its square root $\sqrt{x}$ and its square $x^2$ after the feature $x$ has been standardized... This gives more non-linear power to a neural network but it really makes no sense to give said features from a logical point of view does it? :) – bmasri Nov 09 '22 at 14:37
  • 1
  • Thank you for providing the link to the question that proves it mathematically. +1 @Tim – bmasri Nov 14 '22 at 16:04
1

The following code in R was my attempt of to solve the problem of finding an algorithm to convert a vector of values to a new vector of values with mean of 0 and range of [-1, 1].

After one pass, it doesn't quite work. (The result will have the specified range, but the mean will be off-zero, though usually not by too much).

However, it appears that if it's applied iteratively (feeding the results back through the algorithm), it will reach the desired result within a certain tolerance.

I doubt this transformation is practically useful. And one could probably come up with a simpler process to arrive at a similar result.

Briefly, it centers A on mean of 0, and then divides this new A_center into those values above and below 0 (A1 and A2), adds a value of precisely 0 to each, and then applies a linear transformation to each of these to fit a range of [-1, 0] and [0, 1].

### RUN THIS ONCE, WITH A AS THE INPUT VECTOR ###

A = c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,8,8,8,8,8,8,8,8,8,8,8)

A_trans = A

RUN THIS ITERATIVELY UNTIL THE RESULTS ARE WITHIN TOLERANCE

A = A_trans

hist(A)

A_center = A - mean(A)

A1 = c(0, A_center [A_center < 0])

A2 = c(0, A_center [A_center > 0])

A1_scaled = ((A1 - min(A1) * (0 - -1))/(max(A1) - min(A1)) + -1)

A2_scaled = ((A2 - min(A2) * (1 - 0))/(max(A2) - min(A2)) + 0)

A_trans = c(A1_scaled[2:length(A1_scaled)], A2_scaled[2:length(A2_scaled)])

hist(A_trans)

Sum=data.frame( N = c(length(A),length(A_center),length(A_trans)), Mean = c(round(mean(A), 2), round(mean(A_center), 2), round(mean(A_trans), 2)), Min = c(round(min(A), 2), round(min(A_center), 2), round(min(A_trans), 2)), Max = c(round(max(A), 2), round(max(A_center), 2), round(max(A_trans), 2)), CountLess0 = c(sum(A < 0),sum(A_center < 0),sum(A_trans < 0)), CountGreater0 = c(sum(A > 0),sum(A_center > 0),sum(A_trans > 0)), row.names=c("A", "A_center", "A_trans") )

Sum

Sal Mangiafico
  • 11,330
  • 2
  • 15
  • 35
  • 1
    Although clever, your algorithm does not seem to be applicable in a production environment where train data artifacts need to have been stored earlier to restore them and re-use them on new, previously unseen data... But you do mention that I may not be practically useful. Anyway it is a clever way to achieve said results. – bmasri Nov 14 '22 at 13:48