Weight a rating system to favor items rated highly by more people over items rated highly by fewer people?

Question

Thanks in advance for bearing with me, I am not a statistician of any kind and don't know how to describe what I'm imagining, so Google isn't helping me here...

I'm including a rating system in a web application I'm working on. Each user can rate each item exactly once.

I was imagining a scale with 4 values: "strongly dislike", "dislike", "like", and "strongly like", and I had planned on assigning these values of -5, -2, +2, and +5 respectively.

Now, if every item was going to have the same number of ratings, then I would be quite comfortable with this scoring system as clearly differentiating the most liked and least liked items. However, the items will not have the same number of ratings, and the disparity between the number of votes on different photos may be quite dramatic.

In that case, comparing the cumulative scores on two items means that an old item with a lot of mediocre ratings is going to have a much higher score than an exceptional new item with many fewer votes.

So, the first obvious thing I thought of us to take an average... but now if an item has only one rating of "+5" it has a better average than an item that has a score of 99 "+5" ratings and 1 "+2" rating. Intuitively that isn't an accurate representation of the popularity of an item.

I imagine this problem is common and you guys don't need me to belabor it with more examples, so I'll stop at this point and elaborate in comments if needed.

My questions are:

What is this kind of problem called, and is there a term for the techniques used to solve it? I'd like to know this so I can read up on it.
If you happen to know of any lay-friendly resources on the subject, I'd very much appreciate a link.
Finally, I'd appreciate any other suggestions about how to effectively collect and analyze this kind of data.

score 22 · Accepted Answer · edited Jan 19 '11 at 00:02

22

One way you can combat this is to use proportions in each category, which does not require you to put numbers in for each category (you can leave it as 80% rated as "strongly likes"). However proportions do suffer from the small number of ratings issue. This shows up in your example the Photo with 1 +5 rating would get a higher average score (and proportion) than one with the 99 +5 and 1 +2 rating. This doesn't fit well with my intuition (and I suspect most peoples).

One way to get around this small sample size issue is to use a Bayesian technique known as "Laplace's rule of succession" (searching this term may be useful). It simply involves adding 1 "observation" to each category before calculating the probabilities. If you wanted to take an average for a numerical value, I would suggest a weighted average where the weights are the probabilities calculated by the rule of succession.

For the mathematical form, let $n_{sd},n_{d},n_{l},n_{sl}$ denote the number of responses of "strongly dislike", "dislike", "like", and "strongly like" respectively (in the two examples, $n_{sl}=1,n_{sd}=n_{d}=n{l}=0$ and $n_{sl}=99,n_{l}=1,n_{sd}=n_{d}=0$). You then calculate the probability (or weight) for strongly like as

$$Pr(\text{"Strongly Like"}) = \frac{n_{sl}+1}{n_{sd}+n_{d}+n_{l}+n_{sl}+4}$$

For the two examples you give, they give probabilities of "strongly like" as $\frac{1+1}{1+0+0+0+4}=\frac{2}{5}$ and $\frac{99+1}{99+1+0+0+4}=\frac{100}{104}$ which I think agree more closely with "common sense". Removing the added constants give $\frac{1}{1}$ and $\frac{99}{100}$ which makes the first outcome seem higher than it should be (at least to me anyway).

The respective scores are just given by the weighted average, which I have written below as:

$$Score=\begin{array}{1 1} 5\frac{n_{sl}+1}{n_{sd}+n_{d}+n_{l}+n_{sl}+4}+2\frac{n_{l}+1}{n_{sd}+n_{d}+n_{l}+n_{sl}+4} \\ - 2\frac{n_{d}+1}{n_{sd}+n_{d}+n_{l}+n_{sl}+4} -5\frac{n_{sd}+1}{n_{sd}+n_{d}+n_{l}+n_{sl}+4}\end{array}$$

Or more succinctly as

$$Score=\frac{5 n_{sl}+ 2 n_{l} - 2 n_{d} - 5 n_{sd}}{n_{sd}+n_{d}+n_{l}+n_{sl}+4}$$

Which gives scores in the two examples of $\frac{5}{5}=1$ and $\frac{497}{104}\sim 4.8$. I think this shows an appropriate difference between the two cases.

This may have been a bit "mathsy" so let me know if you need more explanation.

edited Jan 19 '11 at 00:02

onestop

17,737
2
62
89

answered Jan 18 '11 at 23:42

probabilityislogic

24,971

That was a bit "mathsy" for me, and initially I didn't understand the formula, but I read it carefully about three times and it clicked! This is exactly what I was looking for, and your explanation was very clear, even for someone who isn't a mathematician or statistician at all. Thank you very much! – Andrew Jan 19 '11 at 00:11
6

Very nice non-technical answer, and an approach I wouldn't have thought of myself. I'd only add that it's possible to add any number of fake 'observations' to each category instead of 1, including non-integer numbers. This gives you flexibility to decide how much you want to 'shrink' towards zero the scores of items with few votes. And if you happen to want an technical-sounding description of this method, you could say you're performing a Bayesian analysis of data from a multinomial distribution using a symmetric Dirichlet prior. – onestop Jan 19 '11 at 09:28
2

While they may seem like "fake" observations, they do have a well defined meaning when it is +1 (as opposed to +2 or higher, which really are "fake" numbers, or numbers from a previous data collection). It basically describes a state of knowledge that it is possible for each category to be voted for, prior to observing any data. This is precisely what the flat prior on the (N-1) simplex does. – probabilityislogic Jan 19 '11 at 13:25
1

One more observation, for future people who find this post: In implementing this in my model I took the final score and multiplied it by 20, which gives a range of -100 to 100 from worst to best possible score (though I suppose technically those are limits you can't ever quite reach, but you get the idea). This makes the output for users in my app very intuitive! – Andrew Jan 19 '11 at 20:14
@probabilityislogic: surely any strictly positive parameters for the Dirichlet prior describe that all the probabilities are strictly between 0 and 1? And this argument suggests setting them to 2/m, where m is the number of categories, rather than 1: http://en.wikipedia.org/wiki/Rule_of_succession#Generalization_to_any_number_of_possibilities – onestop Jan 20 '11 at 22:32
@onestop - I would argue against the wiki page that you describe is only considering the case when 1 of 2 possible outcomes can happen. So if we were indifferent between "strongly like" and everything else (the other three categories combined), then we are not indifferent between "strongly like" and "like". But the question as posed clearly give four categories, not two, and from the information in the question there is no a priori reason to favour any one rating. I'll have another think about it, because the issue may be more subtle (e.g. why not use the reference prior?). – probabilityislogic Jan 21 '11 at 04:44
Okay, I've had a bit of a think and basically using the $Dir(k,\dots,k)$ prior to give posterior of $Dir(k+n_1,\dots,k+n_m)$. The [wiki page][1] [1]:http://en.wikipedia.org/wiki/Dirichlet_distribution#Aggregation shows that collapsing categories, you just add the parameters. This means that collapsing to 2 categories gives you a Beta posterior. so the denominator will be the sum of all dirichlet parameters $n_1+\dots+n_m+mk = n+mk$ and the numerator will be the sum of all the categories labelled "success" which will be of the form $s+ck$ where c=number of categories for "success". – probabilityislogic Jan 21 '11 at 06:34
... continuing, this means the probability of success is given by $\frac{s+ck}{n+mk}$. So setting $k=\frac{2}{m}$ gives $\frac{s+\frac{2c}{m}}{n+2}$. Now if m is "evenly divided" as the wiki page says, then 2c=m and we are left with the original rule of succession. So what does this mean for the uniform prior (k=1)? To me, it means that mere knowledge of the existence of more than two possible categories (that we consider them possible is essential to this argument) represents important information about an equal aggregation of them. more later – probabilityislogic Jan 21 '11 at 06:47
...continuing again... So what state of knowledge does the constant $\frac{2}{m}$ represent? Here's what I think. It represents that we are only know that one category in the "success" labeled group, and one category in the "failure" group are possible, the remaining $m-2$ categories we are not even sure that these categories are possible and will only consider them possible after observing them. This easily generalizes to a new constant $\frac{q}{m}$ where there are $q$ categories possible and $m-q$ categories where we are not sure of even the possibility that they exist. – probabilityislogic Jan 21 '11 at 06:55
...continuing again (apologies for the lengthy "comment")... Note that changing from $2$ to $q$ changes the probability to $\frac{s+\frac{q}{m}}{n+q}$. One thing I have wondered with this argument, is do we need to specify which categories within "success" are they ones we are willing to assume are possible a priori? Or can we just say "one of them, but not sure which" and then spread this out "evenly" over the aggregated categories? If we can, then this gives an interpretation of the Jeffreys prior as assuming that we consider only one outcome possible, but not sure which one or $q=1$. – probabilityislogic Jan 21 '11 at 07:07
Apologies, I made a small error in the above comment, the probability should be $\frac{s+\frac{cq}{m}}{n+q}$ instead of $\frac{s+\frac{q}{m}}{n+q}$. – probabilityislogic Jan 21 '11 at 08:10
@probabilityislogic: Trés, trés informatif! What if providing any rating is considered "success"? I presume the same logic holds, no? Also, do You have any insight on how to perform such a calculation if, instead of "like" and "dislike" and etc., the rating is any numerical value on a scale from, say, 0-9, including fractions? Does some sort of integral apply? – xuinkrbin. Jan 08 '14 at 16:28

score 3 · Answer 2 · answered Jan 18 '11 at 23:59

I'd take a graphical approach. The x-axis could be average rating and the y could be number of ratings. I used to do this with sports statistics to compare the contribution of young phenoms with that of veteran stars. The nearer a point is to the upper right corner, the closer to the ideal. Of course, deciding on the "best" item would still be a subjective decision, but this would provide some structure.

If you want to plot average rating against another variable, then you could set up number of ratings as the third variable using bubble size, in a bubble plot--e.g., in XL or SAS.

Weight a rating system to favor items rated highly by more people over items rated highly by fewer people?

2 Answers2

Linked