Questions tagged [stratification]

A sampling technique in which the population of interest is partitioned into subsets ("strata") based on characteristics known at all units before sampling.

Stratification is a sampling technique in which the population of interest is split into strata based on characteristics available for all units before sampling. When done wisely, it may offer the following advantages:

  1. Efficiency of the estimates may be improved (i.e., variances/standard errors can be made lower).
  2. Data within a stratum or several strata can be analyzed independently of other observations. (Generally, subsets of data collected in a complex design sample require analysis of the complete data set using the techniques specifically formulated for domains or subpopulations.)
  3. Different sampling designs can be implemented within different strata.

Efficiency gains, if any, come from the following expression (variance estimator of the weighted mean) applicable to the stratified single stage sampling design:

$$ V_{\rm str}[\bar y_{\rm str}] = \sum_{h=1}^L \Bigl(1 - \frac{n_h}{N_h} \Bigr) \Bigl( \frac{N_h}{N} \Bigr)^2 \frac{S_h^2}{n_h}, \quad S_h^2 = \frac1{N_h-1} \sum_{i=1}^{N_h} (y_i - \bar y_h)^2 $$

Efficiency gains materialize if the within-stratum variances $S_h^2$ are lower (ideally, much lower) than the overall variance; or, in other words, when similar units are put together in the same stratum. In the extreme case, when the population consists of replicates of a small number of distinct units, the ideal stratification strategy can achieve zero sampling variance by putting the identical units together into corresponding strata and taking just one unit from each stratum.

Feasibility and efficiency of stratification crucially depend on the available sampling frames, and the auxiliary information that can be found on these frames (i.e., whether there are any additional data besides the frame identifier and the contact information).

In human population sampling, where most behaviors and outcomes are at least weakly associated with demographics, European statistical agencies can stratify on age and gender in the countries that have population registers where these individual characteristics are recorded, and draw samples from such registers. For such rich frames, the frame identifier is usually the person's tax number; the contact information includes the address and the phone number; and additional information may include all sorts of government records associated with that tax number.

In the U.S., where collection of such detailed data by the government is considered overstepping the limits of personal privacy, stratification by age and gender is not feasible. Often, the only possible stratification of the U.S. general population is by geography (although such stratification can also be employed to target and/or oversample recognizable minorities known to reside compactly in relatively homogeneous enclaves). Hence large scale in-person surveys have to be designed from scratch using area samples, enumerating the dwellings, and collecting household rosters within dwellings. In such frames, the sample is collected in several stages; at the first, top-most stage, census tracts may be selected, where the frame identifier is the census tract number in the GIS systems, and the contact information for that unit is the map of that census tract; auxiliary information at the tract level may include detailed summaries of the population of that tract based on the most recent Census data. Geographic stratification is also used in the U.S. survey industry for random digit dialing phone surveys. It relies on the implementation details of how the telecom industry has been assigning phone numbers, although targeting specific geographic areas using the landline phones frame has coverage limitations (not every household has a home landline phone), while on the cell phone frame, the available geographic resolution is never more detailed than about 100K people. For the phone frame, there is often no contact information (e.g., householder's name) available, and the frame identifier and the contact information is just the phone number itself, with no auxiliary information available regarding that phone number other than the area code (the first three digits of a phone number) and the exchange (the next three digits) pointing to the geography in which this number is likely to be located.

In establishment surveys, firms are usually put into strata defined by industry and a relevant measure of size (revenues or employment in the past year, often available either in commercial databases or in establishment registers), with larger firms sampled at much higher rate, up to 100% (i.e., sampling with certainty), as these firms are responsible for a disproportionate fraction of the total employment or revenue, and hence the sampling error is reduced when a large firm is included with certainty and contributes zero error to the total.

Other surveys and their specialized sampling frames may allow for their own idiosyncratic stratification strategies corresponding to the research questions and data collection needs.

270 questions
7
votes
1 answer

Post-stratification & quantitative variables

I'm in charge of contacting customers of a company in order to analyse their satisfaction. The problem is I contact them by phone and the people I contact (the sample) are not representative of the full population. Then I consider…
Ophelie
  • 393
5
votes
2 answers

Stratified sampling question

Suppose that I am conducting a questionnaire study that is trying to measure level of awareness of subjects about a programming language and find the relation of those level of awareness to working conditions and methods etc. To improve my precision…
3
votes
0 answers

Can unintentional sample stratification be problematic?

An example of what I mean: I have a certain essay from all students at a university. I take a 1% random sample (not stratified) and run some time consuming computational analysis on each essay independently. I realize I forgot to include students…
1
vote
0 answers

Relative Error when using an erroneous neyman allocation

I'm having a lot of issues trying to derive an equation for the relative error in the following problem. Someone has used the following incorrect formula to perform Neyman allocation $n_{h,e}=n\frac{W_hS_{U_h}^2}{\sum_{h=1}^{H}W_hS_{U_h}^2}$ instead…
1
vote
0 answers

Can I perform a stratified test againts a true value?

I have a study where they randomized patients into 2 groups (A and B). The primary aim is to compare group A against a fixed/ true value of 0.75. The binomial test is not stratified and cochran mantel haenszel needs 2 groups. How can I perform a…
0
votes
0 answers

Stratified sampling for household dataset. Multiple targets or single target?

I am new in stratified sampling. I have a regression task to analyse the relationship between income, household size and expenditure, using python. Both income and expenditure are continuous variables, while household size is categorical. I have…
Lu Cas
  • 11
  • 1
0
votes
0 answers

Is there a method of stratum selection based on minimising variance within strata?

I have a population of about 7200 businesses from which to sample 2100 for a survey. The sample is to be stratified, but I have no information whatsoever on the usual way to stratify this population, except that it is usually based on revenue…
SiKiHe
  • 465