23

Let's say that I have a categorical variable which can take the values A, B, C and D. How can I generate 10000 random data points and control for the frequency of each? For example:

A = 10% B = 20% C = 65% D = 5%

Any ideas how I can do this?

chl
  • 53,725
user333
  • 7,211

4 Answers4

45

Do you want the proportions in the sample to be exactly the proportions stated? or to represent the idea of sampling from a very large population with those proportions (so the sample proportions will be close but not exact)?

If you want the exact proportions then you can follow Brandon's suggestion and use the R sample function to randomize the order of a vector that has the exact proportions.

If you want to sample from the population, but not restrict the proportions to be exact then you can still use the sample function in R with the prob argument like so:

> x <- sample( LETTERS[1:4], 10000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
> prop.table(table(x))
x
     A      B      C      D 
0.0965 0.1972 0.6544 0.0519 
Greg Snow
  • 51,722
6

Using R (http://cran.r-project.org/). All I'm doing here is creating a random list with the proportions you specified.

x <- c(rep("A",0.1*10000),rep("B",0.2*10000),rep("C",0.65*10000),rep("D",0.05*10000))
# cheating    
x <- sample(x, 10000) 


prop.table(summary(as.factor(x)))

/me Waits patiently for argument over how truly random this is

Brandon Bertelsen
  • 7,232
  • 9
  • 41
  • 48
  • 5
    You can shorten/simplify your first line to x <- rep( c("A","B","C","D"), 10000*c(0.1,0.2,0.65,0.05) ) and you don't need to specify the 10000 in the call to sample, that would be the default (though for clarity it does not hurt to specify it). – Greg Snow Aug 11 '11 at 16:23
6
    n <- 10000
    blah <- character(n)
    u <- runif(n)
    blah[u<=0.1] <- "A"
    blah[u>0.1 & u<=0.3] <- "B"
    blah[u>0.3 & u<=0.95] <- "C"
    blah[u>0.95] <- "D"
    table(blah)
    prop.table(summary(as.factor(blah)))

I have no doubt this is truly random. I mean, to the extent that runif() is random :)

StasK
  • 31,547
  • 2
  • 92
  • 179
  • 4
    If the desired frequencies are really probabilities, it would be easier to use the prob argument for sample(): sample(LETTERS[1:4], 10000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05)) – caracal Aug 11 '11 at 16:01
  • Yeah, that's much cuter. Mine is just a brute force. – StasK Aug 11 '11 at 16:04
  • I have actually upvoted this because it shows how the sample(,prob=) works (at least in polish it is called roulette algorithm). –  Aug 11 '11 at 18:15
4

If you're a SAS user, recent versions provide a similar ability to pull from what it calls a "table" distribution - which is what you are looking for, as part of the Rand() function. See http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a001466748.htm

Fomite
  • 23,134