I am investigating the sentiment of television viewing. Armed with a matrix of sentiment scores like so.
# sad happy angry surprised disgusted
# A Word 1 0 0 0 0
# Backstrom 0 1 0 0 1
# Good Witch Hallmark 0 0 0 0 0
# Shark Tank 0 0 1 1 0
# Above the Rim 0 0 1 0 0
# O'Reilly Factor 0 0 0 0 0
# Jack the Giant Slayer 0 1 1 1 0
# Late Night Snack 0 0 1 0 1
# Outlander 0 0 1 0 0
# Cake Wars 0 0 0 0 0
Each show has a 1 if it includes the sentiment, and a 0 if it doesn't. I can match this to an individual's viewing preferences, but a problem occurs. If the person did not view many shows, their averages can look very extreme.
In this example, an individual watches only one show, 'Shark Tank'. Of all the shows they watch, 100% indicate 'anger'. On average, it would appear that this person really loves shows with anger, but it is being skewed by the fact that they only watched one show.
x1 <- x[4,,drop=FALSE]
x1
sad happy angry surprised disgusted
Shark Tank 0 0 1 1 0
Another individual watched all of the shows but one. This individual would get a 5/9 rating for anger. It would appear that they like angry shows less than the first.
x2 <- x[1:9,]
x2
# sad happy angry surprised disgusted
# A Word 1 0 0 0 0
# Backstrom 0 1 0 0 1
# Good Witch Hallmark 0 0 0 0 0
# Shark Tank 0 0 1 1 0
# Above the Rim 0 0 1 0 0
# O'Reilly Factor 0 0 0 0 0
# Jack the Giant Slayer 0 1 1 1 0
# Late Night Snack 0 0 1 0 1
# Outlander 0 0 1 0 0
I researched how sites like Yelp and TripAdvisor do weighted restaurant ratings. If a restaurant gets one rating, it does not weight as much as a restaurant with 1000 ratings. But that Bayesian paradigm wouldn't work here because the analagous 'rating' is just a one or zero.
It reminds me of this famous beta distribution answer. Is it possible to use a beta distribution approach to the viewing just as a batting average?
Data
set.seed(45)
x <- matrix(sample(0:1, 50, replace=TRUE, prob=c(0.6, 0.4)), nrow=10)
dimnames(x) <- list(c("A Word",
"Backstrom", "Good Witch Hallmark", "Shark Tank", "Above the Rim",
"O'Reilly Factor", "Jack the Giant Slayer", "Late Night Snack",
"Outlander", "Cake Wars"), c("sad", "happy", "angry", "surprised",
"disgusted"))
1 of 10hits among the population of shows I was going to use that as the prior. So for that category alpha=1 beta=9, and the mean is alpha / (alpha + beta) or0.1. The risk is that my population estimates are not based on what actual viewers have watched, rather the entirety of the show schedule. – Pierre L Dec 13 '16 at 16:37