10

I'm trying to model the distribution of my cumulated reputation on one Stack Exchange site over time (that is, each data point is the sum of my reputation whenever that reputation changes, mostly through up- or downvotes).

Here is an example of what the data looks like (the date is a timestamp in Unix time, that is, seconds since 1970-01-01 00:00:00 UTC):

reputation_history_type reputation_change creation_date cumulative_reputation
           post_upvoted                10    1689366017                    10
           post_upvoted                10    1689376446                    20
           post_upvoted                10    1689504809                    30
           post_upvoted                10    1690366268                    40
           post_upvoted                10    1690472012                    50

I'm interested in the relation between creation_date and cumulative reputation. The other columns are just given here to help you better understand the type of data.

I have been criticised for using a generalized linear model, because the variables aren't independent.

What would be a fitting model for such data? Or how do I find an appropriate one?

I'm using R.

Ben
  • 215
  • 7
  • 1
    Is this your own cumulative reputation? – Shawn Hemelstrand Mar 21 '24 at 07:06
  • 2
    @ShawnHemelstrand Yes, as I wrote: "my cumulated reputation" (emphasis added). But what does that have to do with model fitting? – Ben Mar 21 '24 at 08:08
  • The cumulative time series might be more attractive because ot looks more smooth with less irregular bumps. But if that is the only reason that you are interested in the cumulative series then possibly you might better work with modeling the non-cumulative data and for plotting you could use a moving average. Also, do you really want to fit a model? What sort of model? Do you have theories about it that you want to test? I don't believe that ARIMA is much useful if you don't have a plan with it. I would instead just create descriptive statistics. – Sextus Empiricus Mar 22 '24 at 09:12
  • @Ben possibly Shawn wants to extract the full dataset to create an example of how to model it. Also, each person's data can be different. As a result the modelling approach is different. Modeling only your data reputation is different from creating a model that is more flexible and adapts to other people's reputation as well. – Sextus Empiricus Mar 22 '24 at 09:15
  • @SextusEmpiricus I want to predict when I will have 10000 rep, assuming my behavior on the site remains the same. – Ben Mar 22 '24 at 09:16
  • In that case it seems easier to model the non-cumulative reputation, and based on the model compute a prediction. Instead of generating a model for the cumulative score directly. – Sextus Empiricus Mar 22 '24 at 09:17
  • A, complicating factor is that reputation grows faster when you accumulate more posts and while your behaviour of generating new posts stays the same, the voting behaviour does not. Also there can be random events that cause votes to occur in blocks, and the individual votes are not independent. Like a HNQ can suddenly make your post get more attention and votes. – Sextus Empiricus Mar 22 '24 at 09:20
  • @SextusEmpiricus And what model would you recommend, assuming that my answering behavior remains consistent and the number of answers grows, but older answers get less new up- or downvotes? – Ben Mar 22 '24 at 09:21
  • 1
    I would start with exploring the data by looking at smoothened time series and fiting a simple poisson model with a flexible seasonal pattern (e.g. several sinus and cosine functions), with an additional trend in the average and amplitudes. Then extrapolate that. Although it is difficult to consider the randomness. The Poisson model is just a suggestion to create a fit, but not a statistical accurate model to compute error in predictions. – Sextus Empiricus Mar 22 '24 at 09:24
  • Also here you can see that StackExchange is not steady state and popularity and activity on the websites changes. https://stats.meta.stackexchange.com/questions/6595/ https://stats.meta.stackexchange.com/questions/6546/ You might want to see in which stage of evolution your StackExchange website is to improve the scenario of the evolution of your score. (at the same time those are average scores and individual members may develop differently, e.g. the yearly income of individual people may rise, because of raised salary when we make carreer, while the income of the total population decreases) – Sextus Empiricus Mar 22 '24 at 09:29
  • 1
    Since you have only data for 1 year you may also analyse the website data from other memebers over multiple years to see if there are seasonal patterns. By the way, the easiest is to just ignore that and draw a reasonable straight line through recent data and extrapolate. That seems a sufficiently good enough model for your purpose, unless you have a bet for a large sum of money about which day you exactly turn 10000 rep. – Sextus Empiricus Mar 22 '24 at 09:37
  • @SextusEmpiricus Thank you. That was helpful. I finally understand why it makes more sense to begin looking at the uncumulated data. The distribution actually shows what might be a seasonal pattern. —— In the end, the best "model" is to just wait for the 10000 rep to happen. But that isn't as much fun as trying to calculate the date. In my first exponential GLM I calculated the end of April or beginning of May. The ARIMA and ETS, which predict a more linear progression, indicate sometime in July. I'll try your approach now, and in a few months I'll know which was the better model :-) – Ben Mar 22 '24 at 10:07
  • 1
    You can apply your modelling method on other users that have already obtained 10k to see what the typical discrepancy is between outcome and prediction. That may be used to determine the best model and also to create an estimate of the potential error. I suspect that simply assuming a growth equal to the last 1000 rep won't be too bad. Exponential models may be tricky because they may fit an initial explosive fase that doesn't continue. – Sextus Empiricus Mar 22 '24 at 10:10

2 Answers2

11

This is a textbook case of a time series, so you could bring some well-developed machinery to bear.

The initial challenge is that you have an irregular series. There are far more tools available for regular series. I would suggest that you pre-process your data to only contain day-end reputation, but that also for days on which you don't get any new rep (so the cumulative rep would be equal to the day's before).

Once you have this regular series, you can start fitting standard methods, e.g.:

  • ARIMA. For a cumulative series that mostly only increases, you want to use first differences in an ARIMA(p,1,q) model, so in effect you would be modeling daily increases and adding these back together afterwards. This is also what an auto-ARIMA tool will likely recommend.

  • Exponential Smoothing. Your upward trend can be modeled in these models. Modern implementations like forecast::ets() for fable::ETS() for R can decide on the trend shape, or you can prespecify an additive trend.

In any case, you might have seasonality, e.g., intra-weekly, getting more rep on certain days of the week. Both ARIMA and Smoothing can deal with this; just tell them you have a seasonal frequency of 7.

It's probably less likely that you have intra-yearly seasonality (although some of our sites show patterns that look like more traffic as the school/college year starts). The tools above have issues with such "long" seasonalities.

In any case, you might have , with both kinds of seasonality interacting. There are specialized methods for that. I personally like simple linear regressions with Boolean dummies for days of week and something like harmonics for intra-year seasonality best here.

If you have external predictors, e.g., knowing that you didn't get here a lot during certain periods of time, you could try running a regression on these and then modeling residuals with the tools above.

I would recommend that you start with simple methods like an auto-ARIMA and an automatic Smoothing tool. These are quite well developed and are often very hard to beat. We have references here: Resources/books for project on forecasting models

Stephan Kolassa
  • 123,354
3

You certainly can't use both daily change and total. Those are not only not independent, one is the sum of the other.

You could model either cumulative or daily reputation as a function of time. To do that, I'd create a variable "day" which would start at 1 and increase. You might want to add other variables too such as "day of week" (maybe divided into weekday and weekend/holiday). But maybe you are interested in the time of day as well (I'm not sure what "creation date" captures exactly --- is it date and hour and minute?). In any case, I would convert that variable into ones that are of interest to you and more readable.

But, if you've been on the particular site for a while, then I think some sort of time series analysis might be good.

And, depending on what you are interested in, a simple graph (similar to the ones that the site produces) might be what you need.

Peter Flom
  • 119,535
  • 36
  • 175
  • 383
  • Thank you. As indicated in my question, I am intersted in the cumulative reputation. I just included the reputation change (e.g. +10 for an upvote) to indicate how the cumulative reputation is calculated. The creation date are timestamps, that is, seconds since 1970-01-01 00:00:00. It is straightforward to plot a graph of the cumulative reputation. I want to fit a model to the cumulative reputation (a) to understand whether it follows some function or is random/unpredictable and (b) to be able to make predictions about the future development of my reputation (for fun, obviously). – Ben Mar 21 '24 at 10:13