6

I asked 312 people how many times they visited their preferred local supermarket in a month

The results are as follows:

  • 5% did not visit at all
  • 7% visited once a month
  • 33% visited twice a month
  • 22% visited three times a month
  • 15% visited four times a month
  • 18% visited five and more times a month

In the absence of the actual number of visits (I only have percentage of patrons as above), how do you calculate the mean and standard deviation for reporting purposes.

Jeromy Anglim
  • 44,984
Adhesh Josh
  • 3,255
  • 2
    Do you have any domain specific knowledge about how the "five and more times" category would be distributed across counts of 5, 6, 7, etc.? – Jeromy Anglim Oct 19 '11 at 06:54
  • @JeromyAnglim Not really (sorry!). It is all lumped together in this big category. Is there a way out of this predicament? – Adhesh Josh Oct 19 '11 at 12:16

2 Answers2

4

You need to be creative, because these data are consistent with any mean exceeding $0\times .05 + 1\times .07 + \cdots + 5\times .18$ = $2.89$ and any standard deviation exceeding $1.38$ (which are attained by assuming nobody visited any more than five times per month).

For reporting purposes, simply tabulate or graph the raw data:

Bar chart

If you must have a summary of location and spread, use alternative measures that can uniquely be found from these data. The median is between 2 and 3, because 45% visited 2 times or fewer and 67% visited 3 times or fewer. You might simply interpolate linearly and report a median of 2.3 visits per month. For the spread, use (say) an interquartile range, also computed with linear interpolation. I find Q1 is 1.4 and Q3 is 3.3, for an IQR of 1.9.

To go beyond that, you need to fit the data with a distribution, which requires assumptions and therefore is not just reporting. But it can be useful. However, these data are elusive: they will not fit standard models like Binomial or Poisson. (I recommend against trying to fit discretized versions of continuous distributions, such as Lognormal, because it's hard to find any reason why they should fit: they don't form informative bases for comparison. Moreover, since there are only six values here, it would be almost worthless to use more than one parameter in the modeling: two or more parameters give too much flexibility.)

As an example of the insight that might be afforded by a simple distributional fit, suppose the visits are made randomly over time by individuals and each individual has the same probability (per unit time) of visiting. This is potentially a useful and interesting framework against which these data can be compared. It leads to a Poisson distribution. The best fit (in a chi-squared sense) is achieved with an intensity of 3.185 per month; this also is the variance (whence the standard deviation is $\sqrt{3.185}$ = $1.8$).

Data and Poisson fit

This is not a good fit (as a chi-squared test will show, but the eye plainly sees): there are too many people reporting 2 visits and too few reporting 1 visit. That perhaps is the most interesting thing about this analysis. You could announce these results like this:

The median number of monthly visits among the respondents is 2.3 (with an IQR of 1.9). The data depart significantly from a (best fit) Poisson distribution with a mean of 3.18 visits per month in that 19 fewer people than expected report one visit and 37 more people than expected report two visits.

Incidentally, a Poisson fit suggestively fills in the upper tail of "5 or more visits," providing quantitative hypotheses that could be tested in follow-on surveys:

Poisson fit

Other distributions would give different extrapolations into this upper range.

whuber
  • 322,774
  • About the curve fitting: would you use all the availble categories? And do you use a numerical value of 5 for the last one? – Simone Oct 19 '11 at 18:36
  • @Simone For the fitting, the category $\ge 5$ was replaced by all values $5, 6, 7, \ldots$. In this fashion we avoid making an arbitrary assignment of a single value to this category. The price we pay is to push the arbitrariness back to our original assumption of a Poisson distribution: there's no avoiding making some assumption, if meaningful statistics like a mean or SD are to be computed. But at least this assumption is clearly stated and it serves as a useful, interesting point of reference for evaluating the entire distribution. – whuber Oct 19 '11 at 19:43
  • Interesting. And how do you do it in practice? I think that a fitting procedure take in input the values and their empirical frequency. Do you need to assign a particular frequency to $5,6,7,\dots$? – Simone Oct 19 '11 at 20:03
  • @Simone Yes: the frequency assigned to ${5,6,7,\ldots}$ is given in the question as 0.18. – whuber Oct 19 '11 at 20:19
2

You definitely have to associate a numerical value to the class "visited five and more times a month".

By the way, I would calculate the mean and the standard deviation in the usual way. In fact, $x_i$ are your values and $p_i$ are their empirical frequency estimated on the sample. In your case $$x_0=0 \ x_1=1 \ x_2=2 \ x_3=3 \ x_4=4 \ x_5=6$$ (you should decide $x_5$) $$p_0=0.05 \ p_1=0.07 \ p_2=0.33 \ p_3=0.22 \ p_4=0.15 \ p_5=0.18 $$
Thus $$\bar{x} = \sum_{i=0}^{5}x_i p_i$$ and $$\sigma=\sqrt{\sum_{i=0}^{5}(x_i - \bar{x})^2 p_i}$$ It could be interesting to delete $x_0$ and $p_0$ and rescale all $p_i$ in order their sum is 1. So you can calculate the average number of visits to the supermarket for a person that visits the supermarket.

Simone
  • 7,078
  • 1
    Thanks! Please tell me how do I associate a numerical value to the class "five and more times"? – Adhesh Josh Oct 19 '11 at 12:18
  • Can you ask to some of the 56 people who answered that how many times they meant? Otherwise, in an empirical way, it is likely it is a normal distribution with mean 2-3 (because of the shape of the histogram) with some variance. And any value around 6 could be a good value for that category. – Simone Oct 19 '11 at 14:01
  • Now I'm really guessing. In a more technical and complicated way, that might be not necessary in this case, I would fit $x_i$ and $p_i$ for $i=0,\dots,4$ with a log-normal distribution with a statistical sofware (eg Matlab) and then I would calculate the centroid of this function in the interval $[4.5, +\infty)$. But I might be not a log-normal distribution because the pdf is not continuous in 0. Maybe a simple curve-fitting, without referring to a particular pdf, with a statistical software is better. – Simone Oct 19 '11 at 14:04
  • 1
    Both the mean and the SD computed with the nominal data values will be biased low if you use $x_5=5$; if you use any other value for $x_5$, you're being arbitrary. – whuber Oct 19 '11 at 16:32