centering and scaling (standardizing) a variable: use population or sample standard deviation?

Question

For centering and scaling a variable (e.g. prior to a regression, or to a visualization), the standard procedure, of course, is to subtract the mean then divide by the standard deviation.

But is it considered preferable to use the population standard deviation (i.e. divided by n) or the sample standard deviation (divided by n-1)? Does it depend on one's use?

Interestingly, the standard R and Python functions seem to make different choices here. Python's sklearn.preprocessing.scale() uses population standard deviation; R's scale() uses sample standard deviation.

(NOTE: there's a prior question here, but it pertains to a very specific psychological method, and the one answer isn't actually substantiated by anything.)

We should not mix population variance (or st. dev.) - with df=N, with sample's variance, with df=n. Neither are an estimate of the parameter: the 1st is parameter, the 2nd is pure statistic. We may use the sample's variance as the estimate of population one, but it is biased; it is called maximum likelihood estimate of variance. — ttnphns, Nov 28 '16 at 08:16
(cont.) A better option is to compute what is nicknamed "sample variance", which is on df=n-1 and is unbiased estimate of population variance. That "sample" or unbiased estimate one shouldn't be confused with the above sample's variance. — ttnphns, Nov 28 '16 at 08:16
The use of n versus n-1 has to do with whether to choose a biased vs unbiased estimate of variance & has nothing to do with the population variance. — Michael R. Chernick, Sep 22 '19 at 14:28

score 2 · Answer 1 · answered Aug 20 '21 at 08:21

The short answer is 'it does not matter' in most cases. The goal of standardization is adjusting variables to have (roughly) similar distributions. This is usually necessary because many statistical learning methods assume they are, and otherwise, some variable may numerically overwhelm others during model fitting.

The reason behind dividing by standard deviation is because many methods assume that the variables are normally distributed, so standard normal distribution $N(0,1)$ (variance of 1) happens to be a convenient ideal. But in most cases, this is just arbitrary. You could scale to any sensible variance value (distribution $N(0,a)$), and it will not make any difference to your model performance.

Thus, the choice of sample standard deviation estimate rarely matters, as noted in scikit-learn documentation and the answer for that prior question.

In addition, even if you are in a situation where choice of standard deviation estimate could make a slight difference (e.g. multiple samples standardized separately to different distributions), there is no such thing as the 'best' standard deviation estimate. The uncorrected (divided by N) estimate actually has the maximum likelihood, and even the corrected (divided by N-1) estimate is still biased due to the square root. (See wiki article for more details.) As such, you should consult papers/guides on your method for their choice of standard deviation estimate.

Michael R. Chernick · Answer 2 · 2016-11-29T16:57:24.197

1

Practically speaking the population variance is usually not known. So you don't have a choice. If the population variance is known and hence also the population standard deviation, then of course it is best to scale by the population standard deviation.

edited Nov 29 '16 at 16:57

answered Nov 28 '16 at 01:17

Michael R. Chernick

42,857

True enough. Any idea what the thinking might have been behind the scikit-learn implementation then? – Paul Gowder Nov 28 '16 at 04:04
4

The meaning of "population standard deviation" in the question is what most people call the "sample standard deviation" while the meaning of "sample standard deviation" in the question would be called an "estimator" of the standard deviation. Neither one is based on the standard deviation of the assumed underlying population, process, or model, which is what is usually meant by "population standard deviation." By confounding these various meanings of "population standard deviation," this answer misses the mark and risks misleading readers. – whuber Nov 28 '16 at 16:22
1

@whuber You say that population standard deviation is "what most people call the'sample' standard deviation". Who are these people? I think you are suggesting that most readers don't understand the difference between a sample and the population. Why would anyone assume that the OP doesn't know what h is talking about? Based on his comment at least Paul Gowder knows what I am talking about. Over the past two days we have had a couple of amazing disagreements. I generally respect your judgment. But these comment are blowing me away! – Michael R. Chernick Nov 29 '16 at 17:07
2

Michael, please do not interpret any of my comments, ever, as insinuations about what someone might or might not understand. The current question can be understood as distinguishing two formulas for a "standard deviation" insofar as one divides by $n$ and another by $n-1$. Regardless of what they might be called, the question appears to concern which statistic to use. It does not appear to refer to the standard deviation of a distribution used in a model or hypothetical "population"--but that is precisely the sense in which you are answering. Therefore your answer misses the point. – whuber Nov 29 '16 at 18:36
4

To put @whuber's point otherwise: In some circumstances, the population standard deviation is considered or declared to be known; this might be, exceptionally, because the entire population happens to be known; or, more commonly, because we are just playing a game and make up the rules (e.g. declare that our population is $N(0, 1)$ and then generate samples in software). But anything based on a sample of size $n$, how I interpret the notation here, cannot be the population standard deviation. It is the SD of the sample. – Nick Cox Nov 29 '16 at 18:50
Some of these comments takes us off track. The OP is confusing sample & population standard deviation & isn't considering the properties of the estimator of variance. – Michael R. Chernick Sep 22 '19 at 14:33
My answer appropriately addresses the scaling issue. It does not get at the OPs confusion about estimates & the choice of the sample size n or n-1 is the denominator of the variance estimate. – Michael R. Chernick Sep 22 '19 at 14:37

score 0 · Answer 3 · answered Sep 22 '19 at 06:27

I was wondering the same thing and I tend to think that this should depend on the intended use.

If the reason of standardization is to use the standardized version for further work concerning new samples (such as standardization before a machine learning process), I can understand that the values at hand would be considered as a sample and the standard deviation would be calculated as that of a sample.

However, if you are going to use the sample values at hand for a comparison within themselves without further application of new samples, as in the link you shared:

In this context, as elsewhere, the standard deviation is used to make scores comparable, and no statistical inference to a population is implied.

then I would consider the samples at hand as the population being compared and tend to use a population standard deviation.

This might not be a good example but let us assume that we took a sample of 10 observations out of a population. If, for some reason, I only need to produce a standardized distance comparison among those 10 observations (without any statistical inference to another new sample other than that 10), then those 10 observations become my population for the comparison within temselves.

centering and scaling (standardizing) a variable: use population or sample standard deviation?

3 Answers3