What are some techniques to calculate the highsest average when the number of data points is important?

Question

A very simple example here would be restaurants with online reviews. On a platform where every resturant might have a different number of reviews with a vote ranging from 1 to 5, which restaurant is the best?

Obviously one resturant might have an average vote of 4.9 but only 10 reviews while another one might have an average vote of 4.7 but 10k reviews and most people would probably assume the second one is "better".

What I am asking is, what are the possible ways to formalize this intuition? How can we calculate if an average vote is "robust"?

One idea I would use for example would be to calculate a new average with some additional 1 and 5 votes and the more the new average gets close to 3, the more we know the average was actually not robust but I'm pretty sure there are much better ways to handle this.

Any idea?

One possibility would be to calculate a confidence interval around the mean rating. This would be wider for restaurants with fewer ratings. Then order restaurants by the lower end of the CI. Three problems: (1) Often, ratings bump up against the upper limit, and I'm not sure whether this invalidates approximative/asymptotic CIs also at the lower end. (2) The more interesting situations are where there are few ratings, where asymptotic approximations are dubious, (3) We need to be careful about the case of a single review - maybe add a dash of Bayes. — Stephan Kolassa, Feb 08 '23 at 13:52
Related: https://stats.stackexchange.com/questions/6418/rating-system-taking-account-of-number-of-votes and https://stats.stackexchange.com/questions/6358/weight-a-rating-system-to-favor-items-rated-highly-by-more-people-over-items-rat — Henry, Feb 08 '23 at 14:08
The underlying question is addressed at https://stats.stackexchange.com/questions/9358. The issue is one of making trade-offs among two (or more) characteristics to create a single numeric representation of the overall "quality" or "goodness" or "value" of an object: in this case, rating and confidence in that rating. The related literature is comprehensive, showing both what is mathematically possible and practically achievable, as well as how to go about constructing such valuations. Many people have attempted ad hoc solutions out of ignorance of this literature, so beware! — whuber, Feb 08 '23 at 14:53

score 0 · Answer 1 · answered Feb 08 '23 at 15:09

One logical approach is to use a Bayesian approach. If you assume that a-priori most restaurants are somewhat average, you can then given ratings calculate a posterior probability that a given restaurant is best/second-best/third-best/etc. in town, or simply calculate a posterior average rating.

E.g. you might have a 5-star rating system, where you can give between 1 and 5 stars. In that case, you might assign e.g. a Dirichlet(1,2,4,2,1) prior equivalent to 10 previous ratings (1 times 1 star, 2 times 2 stars, 4 times 3 stars, 2 times 2 stars, 1 time 1 star) or whatever else makes sense as something you might a-priori believe about a new restaurant. This then gets updated with each star by adding 1 to the rating category that was given. That's one reason this approach is attractive, it's incredibly easy to implement and to update the posterior (the Dirichlet distribution is the conjugate prior for the probabilities of unordered categories, as which we for simplicity have treated these ratings).

A more sophisticated approach is to treat this as a hierarchical model (either treating the outcome of stars as if it were continuous or as ordered categories), where you then induce shrinkage towards the average restaurant in a data-driven way, but the more data a restaurant has, the more it's individual rating can move away from the average. With a simple ordinal probit-model, this could look like this (to use a R example):

library(tidyverse)
library(brms)
softmax <- function(x) exp(x)/sum(exp(x))
rating_data <- tibble(restaurant = 1:100, 
       ratings=seq(100, 1, -1)) %>%
  mutate(stars = map2(restaurant, ratings, 
                       function(x,y) sample(1:5, size=y, replace=T, prob= softmax(c(0, (x-50)/10, (x-50)/5, (x-50)/5, (x-50)/2.5) ) ))) %>%
  unnest(stars)
rating_data %>%
  group_by(restaurant, ratings) %>%
  summarize(stars=mean(stars), .groups = "drop") %>%
  ggplot(aes(x=restaurant, y=stars, col=ratings)) +
  geom_point()
likely needs more iterations
brmfit1 <- brm( stars ~ 1 + (1|restaurant),
                family = cumulative(probit),
                data= rating_data)
brmfit1 %>%
  posterior::as_draws_df() %>%
  pivot_longer(cols=everything(), names_to = "param", values_to="stars") %>%
  filter(str_detect(param, "r_restaurant")) %>%
  mutate(restaurant = as.integer(str_extract(param, "[0-9]+"))) %>%
  dplyr::select(-param) %>%
  group_by(restaurant) %>%
  summarize(median_stars = median(stars),
            stars_low = quantile(stars, probs=0.25),
            stars_hi = quantile(stars, probs=0.75)) %>%
  ggplot(aes(x=median_stars, xmin=stars_low, xmax=stars_hi, y=restaurant)) +
  geom_point() +
  geom_errorbarh() +
  xlab("Restaurant effect on the probit scale (0=avg. restaurant)")

What are some techniques to calculate the highsest average when the number of data points is important?

1 Answers1

likely needs more iterations