75

PSY's music video "Gangnam style" is popular, after a little more than 2 months it has about 540 million viewers. I learned this from my preteen children at dinner last week and soon the discussion went in the direction of if it was possible to do some kind of prediction of how many viewers there will be in 10-12 days and when(/if) the song will pass 800 million viewers or 1 billion viewers.

Here is the picture from number of viewers since it was posted: PSY OGS

Here are the picture from number of viewers of the No1 "Justin Biever-Baby"and No2 "Eminem - Love the way you lie" music videos that both have been around for a much longer time Justin Eminem

My first attempt to reason about the model was that is should be a S-curve but this doesn't seem to fit the the No1 and No2 songs and also doesn't fit that there are no limit on how many views that the music video can have, only a slower growth.

So my question is: what kind of model should I use to predict number of viewers of the music video?

mpiktas
  • 35,099
FredrikD
  • 853
  • 23
    +1 for managing to steer the dinner table conversation from Gangnam to statistics. We need people like you! – Stephan Kolassa Oct 27 '12 at 20:22
  • 4
    What I can add to the discussion that I hope will be useful to gui11aume or others who are writing equations to try to model this, is that in the KONY example, geographic clustering was a significant aspect of the viral spreading. The fact that PSY is a Korean and then Asian phenomenon first, is an important part of the story. Not sure exactly how that would be modeled, but it might be a clue. –  Oct 27 '12 at 23:49
  • Data regarding views, comments, likes and dislikes of the video during November 2012, can be found at https://docs.google.com/spreadsheet/ccc?key=0AstJzCCxOXH1dFlhX3F2Z3dBc0xQS01ZeUpHVUt4VkE – FredrikD Nov 05 '12 at 15:02

6 Answers6

40

Aha, excellent question!!

I would also have naively proposed an S-shaped logisitic curve, but this is obviously a poor fit. As far as I know, the constant increase is an approximation because YouTube counts the unique views (one per IP address), so there cannot be more views than computers.

We could use an epidemiological model where people have different susceptibility. To make it simple, we could divide it in the high risk group (say the children) and the low risk group (say the adults). Let's call $x(t)$ the proportion of "infected" children and $y(t)$ the proportion of "infected" adults at time $t$. I will call $X$ the (unknown) number of individuals in the high risk group and $Y$ the (also unknown) number of individuals in the low risk group.

$$\dot{x}(t) = r_1(x(t)+y(t))(X-x(t))$$ $$\dot{y}(t) = r_2(x(t)+y(t))(Y-y(t)),$$

where $r_1 > r_2$. I don't know how to solve that system (maybe @EpiGrad would), but looking at your graphs, we could make a couple of simplifying assumptions. Because the growth does not saturate, we can assume that $Y$ is very large and $y$ is small, or

$$\dot{x}(t) = r_1x(t)(X-x(t))$$ $$\dot{y}(t) = r_2x(t),$$

which predicts linear growth once the high risk group is completely infected. Note that with this model there is no reason to assume $r_1 > r_2$, quite the contrary because the large term $Y-y(t)$ is now subsumed in $r_2$.

This system solves to

$$x(t) = X \frac{C_1e^{Xr_1t}}{1 + C_1e^{Xr_1t}}$$ $$y(t) = r_2 \int x(t)dt + C_2 = \frac{r_2}{r_1} \log(1+C_1e^{Xr_1t})+C_2,$$

where $C_1$ and $C_2$ are integration constants. The total "infected" population is then $x(t) + y(t)$, which has 3 parameters and 2 integration constants (initial conditions). I don't know how easy it would be to fit...

Update: playing around with the parameters, I could not reproduce the shape of the top curve with this model, the transition from $0$ to $600,000,000$ is always sharper than above. Continuing with the same idea, we could again assume that there are two kinds of Internet users: the "sharers" $x(t)$ and the "loners" $y(t)$. The sharers infect each other, the loners bump into the video by chance. The model is

$$\dot{x}(t) = r_1x(t)(X-x(t))$$ $$\dot{y}(t) = r_2,$$

and solves to

$$x(t) = X \frac{C_1e^{Xr_1t}}{1 + C_1e^{Xr_1t}}$$ $$y(t) = r_2 t+C_2.$$

We could assume that $x(0) = 1$, i.e. that there is only patient 0 at $t=0$, which yields $C_1 = \frac{1}{X-1} \approx \frac{1}{X}$ because $X$ is a large number. $C_2 = y(0)$ so we can assume that $C_2 = 0$. Now only the 3 parameters $X$, $r_1$ and $r_2$ determine the dynamics.

Even with this model, it seems that the inflection is very sharp, it is not a good fit so the model must be wrong. That makes the problem very interesting actually. As an example, the figure below was built with $X = 600,000,000$, $r_1 = 3.667 \cdot 10^{-10}$ and $r_2 = 1,000,000$.

growth model of Gangnam style

Update: From the comments I gathered that Youtube counts views (in its secret way) and not unique IPs, which makes a big difference. Back to the drawing board.

To keep it simple, let's assume that the viewers are "infected" by the video. They come back to watch it regularly, until they clear the infection. One of the simplest models is the SIR (Susceptible-Infected-Resistant) which is the following:

$$\dot{S}(t) = -\alpha S(t)I(t)$$ $$\dot{I}(t) = \alpha S(t)I(t) - \beta I(t)$$ $$\dot{R}(t) = \beta I(t)$$

where $\alpha$ is the rate of infection and $\beta$ is the rate of clearance. The total view count $x(t)$ is such that $\dot{x}(t) = kI(t)$, where $k$ is the average views per day per infected individual.

In this model, the view count starts increasing abruptly some time after the onset of the infection, which is not the case in the original data, perhaps because videos also spread in a non viral (or meme) way. I am no expert in estimating the parameters of the SIR model. Just playing with different values, here is what I came up with (in R).

S0 = 1e7; a = 5e-8; b = 0.01 ; k = 1.2
views = 0; S = S0; I = 1;
# Exrapolate 1 year after the onset.
for (i in 1:365) {
   dS = -a*I*S;
   dI = a*I*S - b*I;
   S = S+dS;
   I = I+dI;
   views[i+1] = views[i] + k*I 
}
par(mfrow=c(2,1))
plot(views[1:95], type='l', lwd=2, ylim=c(0,6e8))
plot(views, type='n', lwd=2)
lines(views[1:95], type='l', lwd=2)
lines(96:365, views[96:365], type='l', lty=2)

Extrapolation of the views of the Gangnam style Youtube video

The model is obviously not perfect, and could be complemented in many sound ways. This very rough sketch predicts a billion views somewhere around March 2013, let's see...

gui11aume
  • 14,703
  • 6
    (+1) As a first approach. Note that youtube's policiy for counting views is not well understood given that they have not made their algorithm public. They only say: "A view is counted whenever someone watches a video on YouTube. We do not get more specific than this to avoid attempts at artificially inflating view counts" (see). –  Oct 27 '12 at 13:47
  • @Procrastinator Thanks for the tip. That makes it very hard to model then... – gui11aume Oct 27 '12 at 14:06
  • @gui11aume, I like the model of the high and low risk groups, but it seems like the model increases too sharply at the end. Inspired by your model, perhaps the "contagion" phase ends/phases out and then the number of views is proportional to general viewing – FredrikD Oct 27 '12 at 19:02
  • Thanks! Yes, I noticed. I think @Procrastinator raised an important issue. Both models assume that users can view the video (be infected) only once, but this is probably not correct. Your model is interesting, how would you write it? – gui11aume Oct 27 '12 at 19:09
  • FYI Youtube counts a video as watched if at least 90% is played. It doesn't go by IP address either because sometimes whole companies or schools are behind a proxy server, and Youtube would only see that proxy's IP address. There are also service which sell Youtube views, to artificially inflate views, by making robots and botnets watch the videos. – Chloe Oct 28 '12 at 02:01
  • Using the input from @Procrastinator and others makes a better model. It probably lacks a non viral component. The model can be completed, but parameter estimation will become more and more difficult. – gui11aume Oct 28 '12 at 16:18
  • @gui11aume, agree on the complexity issue. Given the context of the question, your answer is accepted. Also, since the team behind "OGS" seems to be doing their best to increase S and k (other marketing activities) and increase the chance of mutations (dance instruction videos, own "mutations"), so it seems like there are things external to what we can observe and model that impacts real number of viewers. – FredrikD Oct 29 '12 at 08:05
  • 3
    @FredrikD thanks. You can still remove the 'accept' in March 2013 if I got it wrong :D – gui11aume Oct 29 '12 at 08:08
  • @gui11aume, I checked the value today, it was almost 630M. The model predicts around 620M which is good. However, if you look at the comment stream around the video, you see that there is a kind of "viewing" frenzy. Some of the "infected" are actively aiming for a billion, i.e. they increase the k-factor in your model. – FredrikD Nov 03 '12 at 07:59
  • @FredrikD yes, so far the model seems to hold up more or less. It is surprising given that parameter estimation was arbitrary. Do you know whether we can get a day-by-day view count in numeric format somewhere? – gui11aume Nov 03 '12 at 13:52
  • @gui11aume, found this on Stackexchange, http://stackoverflow.com/a/8199756/1494569 – FredrikD Nov 03 '12 at 18:38
  • Here is gui11aume's model rewritten in Mathematica, see http://mathematica.stackexchange.com/a/14047/1635 The difference equation itself is rewritten in this answer which might make it easier to estimate the parameters – FredrikD Nov 03 '12 at 20:37
  • 2
    SIR model parameter estimation, see http://rsfs.royalsocietypublishing.org/content/2/2/156.full – FredrikD Nov 05 '12 at 15:11
  • 1
    It seems I am going to lose this one! They may hit the million even before 2013... – gui11aume Dec 19 '12 at 18:49
  • 2
    http://www.engadget.com/2012/12/21/gangnam-style-one-billion-views/ So the world didn't end but 1 Billion views was hit today. – DanTheMan Dec 21 '12 at 20:36
5

Probably the most common model for forecasting new product adoption is the Bass diffusion model, which - similar to @gui11aume's answer - models interactions between current and potential users. New product adoption is a pretty hot topic in forecasting, searching for this term should yield tons of info (which I unfortunately don't have the time to expand on here...).

Stephan Kolassa
  • 123,354
  • yes, that is also a candidate model. However, it seems like it assumes that you only can be a user once. Here, you view the video a number of times if you are "infected". – FredrikD Oct 29 '12 at 10:01
  • 1
    @FredrikD: point taken. (Though I personally didn't manage to sit even through a single "use" of this "product"...) There should be generalizations of Bass to deal with this. (Shameless plug:) Next year's International Symposium of Forecasting is in Seoul, so anyone should consider presenting his/her favorite Gangnam forecasting model there! ;-) – Stephan Kolassa Oct 29 '12 at 10:04
4

I think you need to separate phenomena like Gangnam Style, which owes much of it's views to being a meme/viral thing, from Justin Bieber and Eminem, who are big artists in their own right and who also would spread widely in a traditional setting - JB or Eminem would sell a lot of singles too, I'm not sure that PSY would.

abaumann
  • 2,090
  • good point. After reading & listening to interviews of PSY and the team behind "OGS" (Oppa Gangnam Style), it is clear that they are well aware of which button to press to create a viral thing. Through some image analysis of the views picture above, it seems like the no of views are linear up to about 90 days after launch, then PSY appears on the Korean Grand Prix and the number of views per time unit increases. – FredrikD Oct 28 '12 at 13:08
  • and how does these two classes differ from "classics" - songs that were presumeably well-known when they were first uploaded on YouTube (I'm thinking David Bowie)?
  • – abaumann Oct 28 '12 at 14:08