How is this seven day rolling average calculated?

Question

Eric Topol posted the below on Twitter claiming that Europe is "turning COVID around" based on the trend in the past <7 days. However, while the case trend hitherto appears smooth, I was curious if the precision of the trend in the tails is somewhat denigrated, so we should necessarily "trust" the claimed result that EU is mitigating trends at this point.

Data sources included in the higher resolution graphic below.

What is the source of this data? Also as no CIs are shown, we should take this small scale changes with a pinch of salt... Please note that, the precision of the recent trend values is not usually denigrated by the smoothing itself but rather than the fact we have a reporting lag between being infected, being tested for infections and reporting test results. Without any other context, a turn-around maybe has started but it is not very clear. — usεr11852, Nov 13 '20 at 15:02
@usεr11852 pasted subsequent graphic including data source. Yes, seeing a CI explode in the tail would help confirm the "forecast" nature of the smoother that I'm looking to answer. — AdamO, Nov 13 '20 at 15:03
FT don't say how these data were combined. In the UK the cases 7-day average has not taken such a visible down-turn unfortunately yet (13/Nov/2020) based on the COVID-19 dashboard but indeed it has slowed down. Do note thought that testing is increasing too so if we actually want now-casting estimates rather than simple cases. I trust the MRC Biostatistics Unit (BSU) for that; for UK we are not out of woods yet... :( Probably other EU countries do better (hopefully) and that feeds the forecast. — usεr11852, Nov 13 '20 at 15:25
I wouldn't be surprised if the newspapers smooth the data because it "looks nice," without attention to any statistical issues. — Sycorax, Nov 13 '20 at 15:41

score 0 · Answer 1 · answered Nov 13 '20 at 15:04

For daily observations $y_{t}$ the 7-day rolling average is $\frac{y_{t-1} + \dots +y_{t-7}}{7}$. There is an issue on the beginning, since we do not have enough observations, typically you have to wait until enough observations are available.

This is a crude metric, even if you are strictly doing time series analysis, let alone trying to infer conclusions for such a complicated problem as a pandemic.

Sextus Empiricus · Answer 2 · 2020-11-19T18:09:51.767

About the edges

We can reproduce the image with the code below that obtains data from the European Centre for Disease Prevention and Control (ECDC) (they even provide a script on their website to download it directly into R).

I always see two issues near the edges.

When I am using a rolling average then I often use the Savitzky Golay filter (which is a bit more general). For a filter of length 7 it uses the values $$x_{k-3},x_{k-2},x_{k-1},x_{k}, x_{k+1}, x_{k+2}, x_{k+3}$$ Indeed this is problematic at the edges because the filter has not data beyond the edges while the filter requires this.

You can deal with this in two ways.
- You cut-off the values at the ends
- You extrapolate the filter at the end. In the image below you see this happening. Near the end the smooth curve changes in a straight line. The Savitzky Golay algorithm in R is simply taking the values of the filter as fixed for the last 4 points. I have plotted two different filters: The one is zero-order and based on averaging (which gives a flat line), the other is based on a linear curve fit (which is effectively also a moving average if the points are equally spaced, but near the end-points, you see that it is different in the extrapolation, it follows a line at an angle)
The data in the last period might not be up to date. This might give a false idea about the trend. The data shows the reported cases and not truly actual cases. In the last week or two weeks some data might be incomplete which is not the case for the other weeks

(Actually, besides this reporting effect there is much more going on with these covid-19 reports so I am not saying here that by solving this issue all problems with interpretation of the trend are gone. The data from March and April are not equally comparable to data from October November).

Some comparable effect is seen in data about posts on StackExchange. The below image is from a meta post (We have a very large & widening gap between questions and answers. How do we fix it?), which shows the ratio of answered questions as a function of the score of the question (each curve is a different score) and the date (x-axis). In the last year you see some drop in the yellow curve. This is not because there is a sudden change in the answering rate of such questions that occurred in the last year. But instead, it is because after one year the system removes/deletes unanswered questions of low score.

I considered this second point to be more problematic than the first point. You can see that the first effect has only a very tiny effect for the last three days and is barely visible.

By the way, on a logarithmic scale it is often more easy to interpret these types of growth curves (which are multiplicative in the way how they change). Then you would get the following graph:

About the simplistic comparison

Also note that these comparisons on a small timescale are not very meaningful. You can see that the curve goes up and down a lot on short time scales (mostly due to differences in reporting during the weekends) and you may have an occasional dip or peak. The curve in the image from the Newyork Times has a very sharp bend. This might be due to some extreme smoothing (Maybe they used a higher-order Savitzky Golay filter) or possibly they picked up a few days with lower reporting.

The media are very eager to report on these short timescale trends, place them out of context, and blow up the story. They make money by doing that.

Edward Tufte has a good example on pages 74-75 of his book "The Visual Display of Quantitative Information". It is about traffic deaths in Connecticut compared for just two years or on a longer stretch. You can see it being discussed on his blog

Code

#these libraries need to be loaded
library(utils)
#read the Dataset sheet into “R”. The dataset will be called "data".
data <- read.csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv",
                 na.strings = "", fileEncoding = "UTF-8-BOM", stringsAsFactors = 0)
countries <- c('Austria','Belgium','Bulgaria','Croatia','Cyprus','Czechia','Denmark',
                'Estonia','Finland','Germany','Greece','Hungary','Iceland','Ireland',
                'Italy','Latvia','Liechtenstein','Luxembourg','Malta','Netherlands',
                'Norway','Poland','Portugal','Romania','Slovakia','Slovenia','Spain',
                'Sweden','United_Kingdom')
#View(data[data$countriesAndTerritories == "Croatia",])
combining all the cases for the European countries
lc <- length(countries)
M <- c()          ### this will be a matrix with the cases from each country in each column
first_day <- c()  ### this will be a vector with the first day for each country (to verify the data) (the first day is actually the last day)
length <- c()     ### this will be a vector with the lenght (to verify the data)
for (i in 1:lc) {
info about the date
fd <-       data$dateRep[data$countriesAndTerritories == countries[i]][1]
  l <- length(data$dateRep[data$countriesAndTerritories == countries[i]])

  first_day <- c(first_day, fd)
  length <- c(length,l)
}
in this example executed on 19/11/2020 we have Spain missing a day
so we shift them by one and add a 0
pre <- rep(0,lc)
pre[lc-2] <- 1    #spain
we add zeros for the vectors that are shorter
post <- max(length)-length - pre
for (i in 1:lc) {
extract the data for each country
m <- data$cases[data$countriesAndTerritories == countries[i]]
  mcorr <- c(rep(0,pre[i]),m, rep(0,post[i]))
  M <- cbind(M,mcorr)
}
M <- M[-1,]                                ### remove row because Spain has a zero here
cases <- rev(rowSums(M))
extract points for plotting dates on axis
dates <- data$dateRep[data$countriesAndTerritories == countries[1]]
dates <- rev(dates[1:max(length)][-1]) ### cut to the length of the data and remove 1
date_x <- which(dates %in% c("01/03/2020","01/04/2020","01/05/2020","01/06/2020","01/07/2020",
                             "01/08/2020","01/09/2020","01/10/2020","01/11/2020"))
labels <- c("Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov")
extract days since first day with 10 cases
first_10 <- min(which(cases >= 10))
axis_length <- floor((max(length-1)-first_10)/20) 
f10labs <- c(0:axis_length)*20
f10labs[c(0:axis_length) %% 5 != 0] = ""
plotting
par(mar = c(8,5,1,1), mgp = c(3,1,0))
plot(first_10:(max(length)-1), cases[first_10:(max(length)-1)], 
     xlab = "", ylab = "detected/reported cases", xaxt = "n", yaxt = "n", type = "l" , 
     xlim = c(first_10, max(length)), bty = "n", col = 8)
axis(2, at = c(0,50,100,150,200)1000, labels = c("0", "50k", "100k", "150k", "200k"), las = 2)
axis(1, at = date_x, labels = rep("", length(date_x)))
axis(1, at = date_x+15, labels = labels , tck = 0)
axis(1, at = first_10 + c(0:axis_length)20, labels = f10labs, line = 3)
axis(1, at = first_10+0.5*(max(length)-first_10), 
     labels = c("days since first day with 10 cases") , tck = 0, line = 4.5)
rolling1 <- signal::sgolayfilt(cases, p=0, n = 7)
lines(1:length(cases), rolling1)
rolling1 <- signal::sgolayfilt(cases, p=1, n = 7)
lines(1:length(cases), rolling1, lty = 2)

How is this seven day rolling average calculated?

2 Answers2

About the edges

About the simplistic comparison

Code

combining all the cases for the European countries

info about the date

in this example executed on 19/11/2020 we have Spain missing a day

so we shift them by one and add a 0

we add zeros for the vectors that are shorter

extract the data for each country

extract points for plotting dates on axis

extract days since first day with 10 cases

plotting