0

I have a population of about 7200 businesses from which to sample 2100 for a survey.

The sample is to be stratified, but I have no information whatsoever on the usual way to stratify this population, except that it is usually based on revenue intervals.

Rather than just finding a good stratification by trial and error, I was wondering if there is an algorithm which creates strata based on minimisation of variance within strata.

I've tried running my data through a clustering algorithm, but it doesn't work as no clear revenue intervals emerge from the clusters.

Is there some other algorithm or procedure to go by?

I'm using R or SAS.

Thanks in advance.

SiKiHe
  • 465
  • First, you need to decide what you are trying to minimize the variance of. Once you do that, then any clustering algorithm (e.g., K-means) should get you what you need. What clustering approach did you use and on what variable? –  Apr 01 '16 at 11:03
  • The strata are based on revenue so it's the variance of revenue within the strata that needs to be minimised. I used the cluster R package, and I think it was the k-means algorithm, but the revenue intervals in the clusters all overlap, so that approach was useless! – SiKiHe Apr 01 '16 at 17:52
  • How many clusters did you use? You may need more clusters. –  Apr 01 '16 at 18:21
  • Try running a simple histogram on the value you intend to use as the basis for your clustering. If the curve is very smooth with little variation from either flat or a base probability distribution you may have nothing to cluster on. In that case you may want to rethink your approach and use something like quartiles instead. This isn't the most rigorous analysis, but it is very quick and easy to do. Visually you should be able to see if the data can be clustered in some meaningful way. – drobertson Apr 01 '16 at 18:30
  • Second thing to remember is that unless there is a reason to categorize things in a certain way all categorizations or stratifications are arbitrary. I find that the majority of groupings are effectively arbitrary. If this is the case then look at the audience who is going to look at your work. Are there standard strata that they use often, industry standards, common terms with definitions; ie Mid Cap companies vs Small Cap, etc. Put the results into terms they understand. Your work will have more value if it can be related to in different ways and compared to other analysis. – drobertson Apr 01 '16 at 18:38
  • @Bey: I've tried 3, 4, 5 and 6 clusters. The scree graph doesn't show a clear bend and it has a bump at around 10-12, but 12 clusters is too many. The usual approach is to make about 5 revenue intervals and use them for stratification. The businesses are divided into 10 business groups. Making histograms of them by groups show that they're all right skewed, but not to the same degree – SiKiHe Apr 01 '16 at 19:56
  • @SiKiHe well, as drobertson said, your sample may not be well stratified by revenue. –  Apr 01 '16 at 21:44

0 Answers0