How to generate random categorical data?

Question

Let's say that I have a categorical variable which can take the values A, B, C and D. How can I generate 10000 random data points and control for the frequency of each? For example:

A = 10% B = 20% C = 65% D = 5%

Any ideas how I can do this?

score 45 · Accepted Answer · answered Aug 11 '11 at 16:06

Do you want the proportions in the sample to be exactly the proportions stated? or to represent the idea of sampling from a very large population with those proportions (so the sample proportions will be close but not exact)?

If you want the exact proportions then you can follow Brandon's suggestion and use the R sample function to randomize the order of a vector that has the exact proportions.

If you want to sample from the population, but not restrict the proportions to be exact then you can still use the sample function in R with the prob argument like so:

> x <- sample( LETTERS[1:4], 10000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
> prop.table(table(x))
x
     A      B      C      D 
0.0965 0.1972 0.6544 0.0519

score 6 · Answer 2 · answered Aug 11 '11 at 15:33

6

Using R (http://cran.r-project.org/). All I'm doing here is creating a random list with the proportions you specified.

x <- c(rep("A",0.1*10000),rep("B",0.2*10000),rep("C",0.65*10000),rep("D",0.05*10000))
# cheating    
x <- sample(x, 10000) 


prop.table(summary(as.factor(x)))

/me Waits patiently for argument over how truly random this is

answered Aug 11 '11 at 15:33

Brandon Bertelsen

7,232
9
41
48

5

You can shorten/simplify your first line to x <- rep( c("A","B","C","D"), 10000*c(0.1,0.2,0.65,0.05) ) and you don't need to specify the 10000 in the call to sample, that would be the default (though for clarity it does not hurt to specify it). – Greg Snow Aug 11 '11 at 16:23

score 6 · Answer 3 · answered Aug 11 '11 at 15:42

6

    n <- 10000
    blah <- character(n)
    u <- runif(n)
    blah[u<=0.1] <- "A"
    blah[u>0.1 & u<=0.3] <- "B"
    blah[u>0.3 & u<=0.95] <- "C"
    blah[u>0.95] <- "D"
    table(blah)
    prop.table(summary(as.factor(blah)))

I have no doubt this is truly random. I mean, to the extent that runif() is random :)

answered Aug 11 '11 at 15:42

StasK

31,547
2
92
179

4

If the desired frequencies are really probabilities, it would be easier to use the prob argument for sample(): sample(LETTERS[1:4], 10000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05)) – caracal Aug 11 '11 at 16:01
Yeah, that's much cuter. Mine is just a brute force. – StasK Aug 11 '11 at 16:04
I have actually upvoted this because it shows how the sample(,prob=) works (at least in polish it is called roulette algorithm). – Aug 11 '11 at 18:15

score 4 · Answer 4 · answered Aug 15 '11 at 11:06

If you're a SAS user, recent versions provide a similar ability to pull from what it calls a "table" distribution - which is what you are looking for, as part of the Rand() function. See http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a001466748.htm

How to generate random categorical data?

4 Answers4

Linked

Related