Can we represent a population by its mean?
Is the answer simply "yes, unless there are outliers or the data is skewed"?
-
7What do you mean by "represent"? If you want a one-number summary $x$ that minimizes the sum of squared differences to $x$, then setting $x$ to the mean will achieve that (regardless of outliers or skew). If you want to minimize the sum of absolute differences, then you should set $x$ to the median. – Stephan Kolassa Sep 02 '22 at 15:37
-
4You can represent any population by any number or statistic you please. The issue is how well does this procedure work? In order to determine that, you need to tell us more about the population, what characteristic(s) of it you wish to represent, and how accurately you want to represent them. The mean (called the expectation) has many important theoretical properties -- but altering it in the face of qualitative, perhaps subjective assessments like "[has] outliers" or "is skewed" can ruin all those properties. – whuber Sep 02 '22 at 18:22
3 Answers
If I tell you that the average height of my friends is 173 centimeters, does it tell you anything about my friends? It tells you what the average height is, but doesn't tell you almost anything about the distribution of their heights, etc. You cannot make any reasonable guesses about the heights of my friends. The number "represents" the population only if you want to limit the representation to a single number with all the limitations of such representation.
As for outliers, it's not that simple. In fact, the sensitivity of the mean is a feature, not a bug, as you can learn from the If mean is so sensitive, why use it in the first place? thread. So the choice of the point estimate for the central tendency would be more complicated than deciding if there are outliers.
- 138,066
-
Wouldn’t the mean (or median) be the guess that is less wrong than any other guess about the heights of your friends? – Todd D Sep 03 '22 at 00:38
-
1
-
I really like your answer. the thing that is confusing to me is that statistics courses that I took online say that you should take the mean as a representation of your data unless there are outliers or the data is skewed. – floyd Sep 05 '22 at 01:11
-
-
@Tim thank you so much for the reply. But if representing an entire population with one number like the mean is not a good idea then why do some big organizations do that? For example: On “Coursera” website, they only show the median of salary for a particular job. – floyd Sep 06 '22 at 21:36
Can we represent a population by its mean?
Theoretically
Yes, unless the variance is non-zero.
Populations can be expressed by moments, cumulants, and other type of statistics. In the special case that the variance is zero, then all the moments/cumulants are zero, except for the mean, and the population is entirely determined by the mean.
This special case, when the variance is zero, is a degenerate distribution. In most cases the variance is not non-zero and this theoretic case is not applicable.
Practically
Yes, unless other properties are important.
For many practical cases it is only the mean that is important, or mostly important.
However, this is a bit of a pitfall. Example: Say, you design seats for an airplane, then you should not use the average size of the population, but instead some quantile, like the x% largest people. You want to design the seats such that only x% may have problems with it. You don't care for the mean in such a case, but the mean is often used as a default statistic to represent a population.
- 77,915
Easiest Example
It may help to look at a visual example, which is always good practice anyway when trying to understand where your data is centrally distributed. The easiest example is simulating a normal distribution. Here I have done so, including a mean:
x <- rnorm(n=1000,
mean=20,
sd=10)
The mean should be obvious and plotting it makes it easy to see why.
plot(density(x), main = "Density of X Variable")
From here I will plot the mean in red and the median in blue. Here you can see they are exactly the same, which make sense because the data is pretty easily seen to be mostly distributed in the center of the graph.
When Means Become Less Helpful
Now the most obvious example of when means become less helpful for understanding your data is when the data is heavily skewed. Here I have plotted a chi square distribution with mean and medians.
z <- rnorm(n=1000, mean=20, sd=10)^4
plot(density(z),main="Density of Z Variable")
mean(z)
median(z)
abline(v = mean(z),
col = "red",
lwd = 3)
abline(v = median(z),
col = "blue",
lwd = 3)
You can see that the median in blue is a much better approximation of where the center of the distribution is, as the mean is being dragged away by the data points on the right.
So to answer your question, the mean in useful when it's practical purposes is useful. If the population does not fit the mean well, then forgive the pun, the mean may not mean so much.
- 13,543


