Best statistic to use to compare two different methods of data collection over same time period

Question

I collected pulse rate over the same time period by two different methods. My hypothesis is that the two different methods are statistically the same. My data looks like this:

Welcome to cv, John i Solon! So you need a pairwise comparison method - I suppose the data have been collected from the same individuum, right? Two questions: 1 ) what do you mean by "are statistically the same"? I assume that you want to confirm that they would give the same mean for a given situation? / / 2 ) can you assume that the measurement error of the two methods is independent for any given point in time? — Ute, Jul 26 '23 at 17:31
Yes, the data is collected from the same individual at the same time, so I'm thinking I need to use a paired T-test? The measurement error is independent; the two methods measure completely different factors. — John in Solon, Aug 01 '23 at 19:14
Are measurement errors using the same method independent or would you expect roughly the same error in adjacent time points? Check the time series of differences between the measurements for autocorrelation, the effective sample size might be smaller than the actual number of measurements. If not corrected, you get deceptively small p values (I reckon you have really many measurements) — Ute, Aug 02 '23 at 02:28

score 1 · Answer 1 · answered Aug 02 '23 at 17:46

Since you clearly have paired data, the easiest way is to calculate the differences between measurement 1 and 2 and test if their mean is 0, or better: report a confidence interval. With large dataset and presumably bounded measurement differences, it seems justified to use a z-test or a t-test. However, you should adress two potential problems:

correlation between measurement errors close in time, and
rounded measurements (avoid if possible)

1. Are differences independent? If not, then what?

Positive correlation between measurement differences close in time can result in highly liberal tests. This means that you get small p-values way to often (demonstrated below).

Since you have time series data taken on equally spaced time steps, you should inspect the autocorrelation function to see if there is a correlation problem. In R, you simply use function acf. This gives you a graph of correlation for time lags 0, 1, 2, ..., together with a critical value (stipled lines) that indicates that an estimated correlation is significantly different from zero. For lag 0, the autocorrelation naturally is equal to 1.

Here is a simulation that demonstrates that; measurements are represented as AR(1) time series, with equal variance. The autocorrelation is positive and outside the critical band for lags 1 to 5 (shaded).

set.seed(1234)
n <- 1000 # number of steps
# simulate dependent data, both with mean 0
M1 <- arima.sim(model = list(ar=c(0.2)), sd = sqrt(.96), n=n)
M2 <- arima.sim(model = list(ar=c(0.6)), sd = sqrt(.64), n=n)
Mdiff <- M1 - M2
acf(Mdiff)

Simple remedy: subsampling

When you have a large dataset, just use only every $m$-th value for analysis.
You can read the necessary step size from the autocorrelation function: you should be on the safe side when you chose $m$ a bit larger than the lag where the acf drops below the critical values (stipled lines). If you are very concerned about throwing data away, have a look at more sophisticated corrections via using effective sample size, as explained by @Ben in this thread.

The graph below shows rejection rates obtained by simulation from the same model as before. When using the data as they are ($m=1$), you would reject the null in 24% of the cases with a test or nominal size 5%.

This is the code used to simulate the rejection rates:

# simulation of p-values, using different subsamples
nsim <- 100000
nlags <- 10 # lags for subsampling
pvals <- array(numeric(nsim*nsteps), c(nsteps, nsim))
n <- 1000 # original data size
for (i in 1:nsim){
  M1 <- arima.sim(model = list(ar=c(0.2)), sd = sqrt(.96), n=n)
  M2 <- arima.sim(model = list(ar=c(0.6)), sd = sqrt(.64), n=n)
  Mdiff <- M1 - M2 
  for(m in 1:nsteps)
    pvals[m, i] <- t.test(Mdiff[seq(1, n, k)])$p.value
}
rejectionrate <- apply(pvals, 1, function(x) mean(x < .05))
plot(1:nlags, rejectionrate, type = "b",
     xlab = "m", ylab="rejection rate")
abline(h=0.05, col="gray", lty="dashed")

2. Rounded measurements

Rounding means discretizing your data and loss of information. In the example table shown in your question, the only possible values for differences Method 2 - Method 1 are 0 and 1. If your sample size is small, you risk that all differences are the same - in that case, a confidence interval or test would be pointless. While tests often are quite robust against rounding in respect to rejection rate, they can drop down in power considerably when the true difference is small (e.g. below 0.5 when rounding to integers).

Best statistic to use to compare two different methods of data collection over same time period

1 Answers1

1. Are differences independent? If not, then what?

Simple remedy: subsampling

2. Rounded measurements