6

I keep reading that people are estimating the Google+ population based on statistical estimates:

My model is simple. I start with US Census Bureau data about surname popularity in the U.S., and compare it to the number of Google+ users with each surname. I split the U.S. users from the non-U.S. users. By using a sample of 100-200 surnames, I am able to accurately estimate the total percentage of the U.S. population that has signed up for Google+. Then I use that number and a calculated ratio of U.S. to non-U.S. users to generate my worldwide estimates. My ratio is 1 US user for every 2.12 non-U.S. users. That ratio was calculated on July 4th through a laborious effort, and I haven't updated it since. That is definitely a weakness in my model that I hope to address soon. The ratio will likely change over time.

How is this possible? I don't see how a fixed sample size tells you what percentage of the U.S. population is participating. Let's take 2 cases:

  • case 1: there are 10,000 Google+ users
  • case 2: there are 1,000,000 Google+ users

Why would the samples be statistically different?

Jason S
  • 255
  • 1
    Ehrm. At the end of the link you provide, there's the text: "To estimate Google+'s population, Allen tracks surnames that appear on the site. Here's how he describes his methodology", after which it is explained how this is possible. What more do you need? – Nick Sabbe Jul 12 '11 at 12:15
  • @Nick: I read that, and I'm still confused. Elaborated above. – Jason S Jul 12 '11 at 12:55

2 Answers2

6

This exercise will be pretty useless unless the sample of the surnames is statistically sound, i.e., a random sample with known probabilities of selection. Otherwise, you are estimating the number of female drivers by first picking a color (say yellow), counting the fraction of female drivers in the yellow cars, and then obtaining the estimate of the population total as the (total # of cars) * (fraction of women drivers based on the red cars). If you did not pick up your color at random (and preferably repeated this selection a bunch of times to ensure better coverage of different types of cars), God only knows how good your estimate might be.

Getting a good sample of US surnames is far from a trivial task. The studied distributions of surnames are very odd, to say the least. Most of the surnames will be unique, with just a handful of people having this last name (mine is an example). On the other hand, a few surnames (Smith, Johnson, Williams) may cover as much as 1% of the population).

The problem of weird distributions is frequently encountered in establishment surveys where you have monstrous corporations like Microsoft and Adobe, and millions shops-on-the-corner with two-three local geeks. One practice in working with the distributions like that is to perform probability proportional to size sampling: you take the whole list, but you will sample the surnames (or companies) with greater probabilities if they represent a greater share of the total. You then control for unequal probabilities of selection with weights. Another approach is to use cut-off sampling: you sample all the surnames with frequency greater than (businesses with sales greater than) 0.1% of the total, and then wave hands about the potential non-sampling error for the remaining surnames.

StasK
  • 31,547
  • 2
  • 92
  • 179
5

Two assumptions are made: (1) the rate of US citizens to all people is the same within the Google+ population as in the global population, and (2) for US citizens, the rate of people with any surname to all US citizens is (on average) the same within the Google+ population as in the global population.

So: you take, say, 200 surnames, and count how many US Google+ subscribers there are with these surnames ($USG_s$). Given assumption (2) (say the rate is $r_s$, found by dividing the number of US citizens with these surnames by the total number of US citizens), an estimate for the total number of US subscribers is found like this:

$USG\sim USG_s/r_s$

Then, using assumption (1) you can use the same 'trick' to find an estimate of the total number of Google+ users.

Simply put: if there are less people that are Google+ subscribers, there will be less US citizens that are Google+ subscribers (assumption (1)). By this and assumption (2), there will also be less US citizens with a given surname that are Google subscribers.

Nick Sabbe
  • 12,819
  • 2
  • 37
  • 47
  • ah -- somehow I was thinking he was using a sample of 200 users, not 200 surnames. Got it. – Jason S Jul 12 '11 at 13:41
  • @Nick Sabbe : can you explain in a bit more detail what this line means: $USG\sim USG_s/r_s$ $USG_s=count number of google+ users with the given sample of 200 names. r_s$=dividing the number of US citizens with given surname by total number of US citizens. Does sim mean that you do this for every single name? and then you divide USG_s by r_s for every single name? –  Jul 22 '11 at 03:19
  • Also when estimating the non-US population you just multiply the US estimates by 2.12 ? –  Jul 22 '11 at 03:23