Questions tagged [sampling]

Creating samples from a well-specified population using a probabilistic method and/or producing random numbers from a specified distribution. As this tag is ambiguous, please consider [survey-sampling] for the former and [monte-carlo] or [simulation] for the latter. For questions regarding creating random samples from known distributions, please consider using the [random-generation] tag.

Sampling is used to collect data when observing whole population is not practical or not feasible (e.g., too expensive, conceptually impossible, etc.). To draw valid statistical inferences about sampled data, the mechanism by which the samples are drawn must be specified, and must involve randomization (selecting units using random numbers or random events). Randomization is necessary to be able to make probabilistic statements: one can talk about the mean or a tail probability of the sampling distribution of a statistic by virtue of looking at the histogram of this statistic as obtained by (hypothetically, or by actual exhaustive search) taking all possible samples from populaton and computing the statistic of interest based on every possible sample.

The simplest sampling method is simple random sampling (SRS): for a population of $N$ units, the SRS of size $n$ is a sampling design that assigns to each sample of size $n$ the same probability of selection $1/C_N^n$. This simplest method allows for inference that is nearly equivalent to the textbook "i.i.d." assumption. E.g., the minimum variance unbiased estimate of the population mean is the sample mean $\bar x$, and its variance is $s^2(1-n/N)/n$ where $s^2 = \sum (x_i - \bar x)^2/(n-1)$, and the factor $1-n/N$ is the finite population correction. However, if any other selection method was used to obtain the sample, the analysis methods must be modified to account for the features of this selection method. For instance, a naive understanding of sampling may entail thinking that if every unit in the population has the same probability of selection $n/N$, then the "i.i.d." analysis methods are applicable. This is not so; for a systematic sampling design (all units are arranged in the list, a starting point $k$ is chosen randomly as a number between 1 and $[N/n]$, and the units $k, k+[N/n], k+2[N/n], ...$ are taken into the sample), the sampling variance cannot even be estimated!

In samples of human and natural resource populations, the most typical twists on sampling selection methods include (a combination of):

  1. Stratification: selecting units independently within well-defined groups (e.g., regions or states in geographic samples; industry and size of an enterprize in establishment surveys; type of land use in natural resource surveys; etc.). Typically, although not necessarily, stratification leads to reduction of sampling variance.
  2. Multistage selection: selecting units within a specific hierarchy (schools within districts, then students within schools in education surveys; counties within states, then city blocks within counties, then households within city blocks in geographic samples; etc.). Multistage samples are also known as cluster samples (clusters of units rather than individual units are sampled at the early stages of selection). Clustering typically increases sampling variances.
  3. Unequal probability of selection, usually associated either with a need to obtain a sufficient number of observations for certain groups of populations, or with a need to balance costs of the survey. Unequal probabilities of selection must be accounted for by specifying (and using in analysis) sampling weights. Unweighted estimates will typically be biased, and hence of no real interest.

In some disciplines, the term "sample" is intended to mean "an observation", a single record containing data on one particular unit of analysis. More often, the term "sample" is used to denote a collection of units for which observations were made, measurements were taken, responses were obtained, etc. Furthermore, some disciplines use the term "sampling" rather loosely to indicate the process of collection data on arbitratrily taken units from the population. However, scientifically rigorous inferences can only be obtained from the samples that are random, i.e., a randomization mechanism is built into the data collection process.

To find out more, visit Wikipedia page, take a look at What Is a Survey? booklet of the American Statistical Association, or read introductory textbooks such as Lohr (2009), Kish (1995) or Cochran (1977). A complete and thorough discussion of how survey statistics should be analyzed in R is given in Lumley (2010).

Potentially related tags: survey, sample-size, response-rate, stratification, svy

Another, more algorithmic, meaning of the word "sampling" is to describe the procedures of drawing random numbers that have a specified distribution. Assuming that a (pseudo) random number generator is available that creates (pseudo) random numbers from $U[0,1)$, the simplest method is by inverting the distribution function: $X = F^{-1}(U)$. In more complicated cases, one has to utilize more sophisticated algorithms, such as acceptance-rejection sampling, importance sampling, etc. Understanding sampling methods is crucial in computational Bayesian statistics.

Potentially related tags: Bayesian, MCMC

Other meanings are also discussed at Wikipedia disambiguation page.

3244 questions
23
votes
4 answers

An unbiased estimate of the median

Suppose we have a random variable $X$ supported on $[0,1]$ from which we can draw samples. How can we come up with an unbiased estimate of the median of $X$? We can, of course, generate some samples and take the sample median, but I understand this…
robinson
  • 423
  • 3
  • 8
20
votes
1 answer

Sampling model for crowdsourced data?

I'm working on an open health survey application, planned to be used in developing country. The basic idea is that survey interviews are crowdsourced - they are performed by unorganized volunteers who submit forms data of the interviews they…
19
votes
4 answers

How to generate a non-integer amount of consecutive Bernoulli successes?

Given: A coin with unknown bias $p$ (Head). A strictly positive real $a > 0$. Problem: Generate a random Bernoulli variate with bias $p^{a}$. Does anyone know how to do this? For instance, when $a$ is a positive integer, then one can flip the coin…
Pedro A. Ortega
  • 781
  • 4
  • 13
17
votes
2 answers

Is "every blue t-shirted person" a systematic sample?

I'm teaching an intro stats class and was reviewing the types of sampling, including systematic sampling where you sample every kth individual or object. A student asked if sampling every person with a particular characteristic would accomplish the…
drury
  • 303
12
votes
5 answers

Can I use "left eye" and "right eye" in my sample as two different subjects?

My data is as follows. I have two groups of patients. Patients in each group had a different type of eye surgery. 5 variables were measured on patients in each group. I want to compare those variables between the two groups using a permutation test…
sara
  • 179
  • 1
  • 5
11
votes
1 answer

Definition of quantile

Given N sampled values, what does the "p-th quantile of the sampled values" mean?
bit-question
  • 2,817
9
votes
1 answer

How to generate a Bernoulli variate with bias $a/\mathbb{E}[X]$ given a sampler of $X$ and uniform variates?

Given: A loaded "die" with unknown probabilities generating a discrete, positive random variable $X$ taking on values in $\mathcal{X}$. A real number $a$, such that $0 \leq a \leq \mathbb{E}[X]$. Uniform random variates. Problem: Generate a…
Pedro A. Ortega
  • 781
  • 4
  • 13
9
votes
2 answers

Sampling with or without replacement?

I don't know a lot about sampling methods. I have a large population of size 2,000,000. I used one of those sample size calculators. It says that I need sample size of approximately 10,000. I am trying to find the probability p of success for…
Martin Velez
  • 375
  • 2
  • 3
  • 8
9
votes
2 answers

Do low-discrepancy sequences work in discrete spaces?

Low-discrepancy sequences in a real space ($[0,1]^n$) seem like a really excellent tool for evenly sampling a sample space. As far as I can tell, they generalise well to any real space, if you use an appropriate map (eg. $[0,1]\to[a,b]$ linear…
naught101
  • 5,453
7
votes
1 answer

Whether to stratify or do a simple random sampling from a set of papers to be compared?

I have a population/set of papers (~350) that have been categorized into non-mutually exclusive types of in vivo biological target each with different numbers of paper. Another way of categorizing these papers is the method by which some drug…
user1447630
  • 1,059
6
votes
1 answer

Sampling technique to estimate how many toxic waste sites are in a country?

I work for an environmental health nonprofit and I have moderate expertise in statistics. We want to estimate the total number of toxic industrial waste sites within a small African country. I would love to hear your thoughts on how we should start.…
Slyron
  • 249
6
votes
2 answers

Calculating % unsampled in sampling with replacement

You sample N of N items with replacement. How do you calculate the expected percent not sampled from original population N? Extra Credit: Generalize to sampling k of N items with replacement.
6
votes
1 answer

Sampling from a marginal when full density is given

I want to sample from a distribution with density $$ f(\mathbf x) = \int f(c) \prod_{i=1}^n f(x_i|c) dc $$ where $\mathbf x=(x_1,x_2,...,x_n)$. In my particular setup, is easy to sample according to the densities $f(c)$ and $f(x_i|c)$, but it is…
Mhc
  • 63
  • 3
5
votes
1 answer

What does $\pi(x)$ usually mean in importance sampling?

I'm learning about particle filters, and most expositions start with importance sampling, so now I'm learning about importance sampling. I'm not sure if I'm correctly interpreting a paragraph in this article: Arulampalam, Maskell, Gordon, and Clapp…
5
votes
1 answer

Information content of examples and undersampling

As I have written in my question "How much undersampling should be done?", I want to predict defaults, where a default is per se really unlikely (average ~ 0.3 percent). My models are not affected by the unequal distribution: It's all about saving…
RichardN
  • 401
1
2 3
12 13