2

My question is say you take a sample say of people weight and you get in pounds right [100,220,230,240] many people say this sample come from a normal distribution population.

But my question is from a sample how does you tell that it belong to a certain population???

BruceET
  • 56,185
  • 4
    You can't. People who claim to know that this comes from a normally distributed population don't have much of a clue (unless you're misrepresenting what they actually say). – Christian Hennig Jun 14 '21 at 21:43
  • but dont you need to assume then that your sample comes from a normal distribution to build a confidence interval? – Fernando Martinez Jun 14 '21 at 21:53
  • 2
    I think that the role of model assumptions is routinely misinterpreted in statistics. The theory requires model assumptions. In practice these are never precisely fulfilled. An important question is whether there is anything in the data or any knowledge of how they were collected that gives an indication that the confidence interval will be invalid. This is what we can know, we can't know more. (Meaning that there's always a chance that there is a problem that we miss.) – Christian Hennig Jun 14 '21 at 22:03
  • 1
    Some answers of mine related to this: https://stats.stackexchange.com/questions/518183/how-to-answer-critiques-about-the-inapplicability-of-the-framework-of-frequentis/518454#518454 https://stats.stackexchange.com/questions/162738/confidence-interval-from-a-non-probability-sample/503112#503112 – Christian Hennig Jun 14 '21 at 22:03
  • Weight may just be an arbitrary example but weight distributions for large groups tend to be skewed positively. As is well known, very low and high weights both come with health risks, including higher death rates, but the situation isn't symmetrical even. – Nick Cox Jun 15 '21 at 07:49
  • If you have no further information, there's no way. You'd need an infinitely-sized sample for that. However if you are working with a limited set of options (for example, if you know your data follows some normal distribution), then you can use your data to narrow the options down – David Jun 15 '21 at 07:49

1 Answers1

1

Suppose you have little information, from the sample or the context in which it was sampled, about nature of the population. Then you might want to try one of several kinds of nonparametric bootstrap confidence intervals.

Suppose you have $n = 20$ observations in a vector x in R with summary statistics and boxplot as shown below.

summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.00   13.50   25.50   30.00   51.25   60.00 
boxplot(x, col="skyblue2", horizontal = T)

enter image description here

The data are moderately right-skewed. One indication is that the median is closer to the minimum than to the maximum. Maybe the skewness is a little more than we would expect from normal data. Also, points in a normal probability plot seem more curved than linear.

qqnorm(x);  qqline(x, col="blue")

enter image description here

Assume normal data, use t confidence interval. If we are willing to guess that the data were sampled from a normal population, then a 95% t confidence interval for the population mean $\mu$ [from the procedure t.test in R] is $(20.4, 39.6).$

t.test(x)$conf.int
[1] 20.41577 39.58423
attr(,"conf.level")
[1] 0.95

Nonparametric bootstrap confidence intervals. This is not the place for a detailed explanation of bootstrapping. But the general idea is to take a moderately large number of 're-samples' of size $n=20$ with replacement from x in order to assess the variability of the mean of x and thus to make a 95% nonparametric bootstrap. (Here, the word nonparametric means that we have made no assumptions about the shape of the population distribution. We do assume that the population distribution has a mean $\mu.)$ One simple style of a bootstrap CI (shown below) gives the interval $(21.5, 38.6).$

For many practical purposes it would not make much difference which 95% CI is used. [Because re-sampling is a random process, results of the bootstrap CI differ slightly from one run to the next---unless the same seed is set. Two additional runs gave $(21.4, 38.2)$ and $(21.1, 38.5)$.]

set.seed(1234)
a = replicate(2000, mean(sample(x,20,rep=T)))
quantile(a, c(.025, .975))
    2.5%    97.5% 
21.49875 38.55375 
hdr = "Bootstrap Dist'n of Sample Means"
hist(a, prob=T, col="skyblue2", main=hdr)
 abline(v = c(21.50,38.55), col="red", 
        lwd=2, lty="dotted")

enter image description here

Note: There are better styles of bootstrap CIs--- especially if the bootstrap distribution shown above were markedly skewed. Another style of bootstrap results from re-sampling deviations from the sample mean, as shown below. The result is $(21.4, 38.9);$ again not very different from the previous CIs.

a.obs = mean(x)
set.seed(614)
d = replicate(2000, mean(sample(x,20,rep=T))-a.obs)
LU = quantile(d, c(.975, .025))
a.obs - LU
   97.5%     2.5% 
21.45000 38.90125 

Note on data: The data used in the discussion above were sampled in R as shown below. If we knew that the population distribution is gamma, then a parametric CI based on the particulars of that family would be preferable. The true population mean is $\mu = 30.$

set.seed(2021)
x = round(rgamma(20, 3, 1/10))
BruceET
  • 56,185