Let's say that I have a categorical variable which can take the values A, B, C and D. How can I generate 10000 random data points and control for the frequency of each? For example:
A = 10% B = 20% C = 65% D = 5%
Any ideas how I can do this?
Let's say that I have a categorical variable which can take the values A, B, C and D. How can I generate 10000 random data points and control for the frequency of each? For example:
A = 10% B = 20% C = 65% D = 5%
Any ideas how I can do this?
Do you want the proportions in the sample to be exactly the proportions stated? or to represent the idea of sampling from a very large population with those proportions (so the sample proportions will be close but not exact)?
If you want the exact proportions then you can follow Brandon's suggestion and use the R sample function to randomize the order of a vector that has the exact proportions.
If you want to sample from the population, but not restrict the proportions to be exact then you can still use the sample function in R with the prob argument like so:
> x <- sample( LETTERS[1:4], 10000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) )
> prop.table(table(x))
x
A B C D
0.0965 0.1972 0.6544 0.0519
Using R (http://cran.r-project.org/). All I'm doing here is creating a random list with the proportions you specified.
x <- c(rep("A",0.1*10000),rep("B",0.2*10000),rep("C",0.65*10000),rep("D",0.05*10000))
# cheating
x <- sample(x, 10000)
prop.table(summary(as.factor(x)))
/me Waits patiently for argument over how truly random this is
n <- 10000
blah <- character(n)
u <- runif(n)
blah[u<=0.1] <- "A"
blah[u>0.1 & u<=0.3] <- "B"
blah[u>0.3 & u<=0.95] <- "C"
blah[u>0.95] <- "D"
table(blah)
prop.table(summary(as.factor(blah)))
I have no doubt this is truly random. I mean, to the extent that runif() is random :)
prob argument for sample(): sample(LETTERS[1:4], 10000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05))
– caracal
Aug 11 '11 at 16:01
sample(,prob=) works (at least in polish it is called roulette algorithm).
–
Aug 11 '11 at 18:15
If you're a SAS user, recent versions provide a similar ability to pull from what it calls a "table" distribution - which is what you are looking for, as part of the Rand() function. See http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a001466748.htm
x <- rep( c("A","B","C","D"), 10000*c(0.1,0.2,0.65,0.05) )and you don't need to specify the 10000 in the call to sample, that would be the default (though for clarity it does not hurt to specify it). – Greg Snow Aug 11 '11 at 16:23