Weighting the averages of 11 groups with very different sample sizes

Question

I've read through a lot of answers on here but not found the answer I'm looking for the following problem - I've got a dataset of product ratings - a rating out of 5.

I've got this data for 11 different products, with the number of samples in each group as follows:

Product 1 n = 437 Product 2 n = 261 Product 3 n = 28 Product 4 n = 29 Product 5 n = 37 Product 6 n = 25 Product 7 n = 165 Product 8 n = 105 Product 9 n = 31 Product 10 n = 31 Product 11 n = 34

I want to find the best product, but can't just get all the averages because the sample sizes are so different. What process is best applied here to weight the average scores based on sample size?

I've had a look at ANOVA and run an analysis in Excel, but not sure if this is the right way to go or how to interpret the results.

Cheers!

score 1 · Answer 1 · answered Sep 30 '22 at 09:49

If I understand you correctly, you have eleven products and accompanying mean ratings per each product and you want to pick the "best" product, meaning the one with the highest rating. Still, you worry that the rating may not be comparable because of carrying sample sizes.

You are correct that the sample size will affect how precise the means are. We know that standard errors decrease with sample size by the factor of $\sqrt n$, so this is how much the sample size alone will affect the ratings.

Do you know the standard deviations of the ratings as well? If you knew them, you could calculate the confidence intervals for the means. If product $A$ has a higher rating than $B$, but $B$'s confidence interval overlaps with the mean rating of $A$ then we cannot conclude that they are statistically different. This wouldn't tell you which product is best, but will tell you which products aren't necessarily different from each other in terms of ratings.

The idea of considering the uncertainty in optimization is not new. In fact, in Bayesian optimization, we use acquisition functions for that. One of the optimization criteria used is the upper confidence bound, where we look at $\mu + c\sigma$ (where $c$ is some constant), to pick the most promising value to explore. Such optimization strategy considers the exploration-exploitation trade-off and picks the value that is "best among uncertain", where the $c$ parameter corrects for how much uncertainty you want to consider, or how eager to explore vs exploit you are.

The last example of Bayesian optimization shows an important problem here: it is subjective what you would consider "best". It will depend on how risk-averse you are. For the products with less data, you are simply more uncertain about the ratings, and it is about how much uncertainty are you willing to accept. If you can, gather more data, if you can't, it's your bet.

Finally, people sometimes "vote with their feet" so the information about the number of ratings can be important by itself.

Brant Inman · Answer 2 · 2021-02-05T02:45:49.323

0

Let's say you have two products with the following ratings.

Rating ID	Product	Value (0-5)
1	A	5
2	A	4
3	A	2
4	A	3
5	A	5
6	A	5
7	A	4
8	B	2
9	B	3
10	B	3

The overall average rating would be: $$\mu_{overall}=\frac{\sum{Value}}{n}=\frac{5+4+2+3+5+5+4+2+3+3}{10}=3.6$$

The average rating for each product would be: $$\mu_{A}=\frac{5+4+2+3+5+5+4}{7}=4$$ $$\mu_{B}=\frac{2+3+3}{3}= 2.67$$

If the product with the highest rating (Value in the example dataset) is "best" then product A is better than B. Note that the average rating for each product is weighted in the denominator (the number of times it was rated). Note that the sum of the denominators for each product will add up to the total number of ratings.

Note that if you only had the product averages, you could calculate a weighted overall average. Here the weights are equivalent to the proportion of the total ratings that the particular product got. For example,

$$\mu_{overall}=\mu_A*Wt_A+\mu_B*Wt_B=(4*\frac{7}{10})+(2.66667*\frac{3}{10}) = 3.6$$

edited Feb 05 '21 at 02:45

answered Feb 05 '21 at 02:39

Brant Inman

138

Thank you! My question from here is, though, if the averages of each product are equally valid if the the sample sizes are very different?
Eg: Product 1 has 437 ratings and gets an average score of 2.89 Product 10 has 31 ratings but an average score of 3.01.

Can we really say Product 10 is the beter product? Does the difference in the sample size need to be considered?
– eightyfish Feb 05 '21 at 03:58
Yes, they are equally valid point estimates of the mean. However, the precision of the estimates will be different for each product and hence their confidence intervals different. For example, if the score are normally distributed (which may or may not be a good assumption for an ordinal variable with 6 potential values---ratings 0 through 5---something you can check by plotting their histograms), you might use the following formula for the confidence interval:https://www.statisticshowto.com/probability-and-statistics/confidence-interval/ – Brant Inman Feb 06 '21 at 12:38
Thank you! I followed your advice and found that the highest ranking product in my data (based on average score) has an average which is higher than one 95% confidence interval above the average of the product below it. I feel this means my average scores are useable? Thanks again. – eightyfish Feb 08 '21 at 02:04
I am not sure what you mean by "usable". If the question is "which product has the highest average rating?", then the product with the highest average rating is the best. If the question is " Is the highest rated product statistically significantly higher than the next closest sample, you could examine the 95% confidence intervals of the two averages and, if not overlapping, they are likely statistically different. Another way to test this is to use a two sample t-test to compare both product scores. – Brant Inman Feb 09 '21 at 03:27
Ok cool, that's basically what I've done. Thanks again Brant, I really appreciate your help :) – eightyfish Feb 10 '21 at 05:48

Weighting the averages of 11 groups with very different sample sizes

2 Answers2