Compare 90th percentiles of two samples (confidence interval, test)

Question

I have a dataset of ambulance response times before and after a quality improvement change. I'm looking to see if there is a difference in response times before and after the change. Specifically, I'm trying to report the 90th percentile values in two groups (Before and After), the difference between their two 90th percentiles, the $95$% confidence interval (CI) around that difference, and a $p$-value.

In R, the dataset might look like this:

set.seed(123)
data <- data.frame(
  Group = sample(c("Before", "After"), 100, replace = TRUE),
  ResponseTime = rnorm(100, mean = c(10, 15), sd = 2)
)

I can easily test the difference of means using the t.test function:

t.test(data$ResponseTime[data$Group == "Before"],
       data$ResponseTime[data$Group == "After"])

I can also calculate the 90th percentiles like so:

before <- quantile(a$ResponseTime[a$Group=="Before"], probs = 0.9, na.rm = T)
after <- quantile(a$ResponseTime[a$Group=="After"], probs = 0.9, na.rm = T)

But I can't figure out how to compare the two.

My questions:

Does this even make sense?
If it does make sense, what test would I use to compare the 90th percentiles?

A bootstrap and a permutation test will get you a CI and P-value respectively, without many distributional assumptions. Inference towards the tails of a distribution is always a bit less reliable compared to some central measure, how big is your sample? — PBulls, Dec 17 '23 at 05:24
total sample is 28,981. 14,551 in the After, 14,430 in the before groups. — Jeff Jarvis, Dec 17 '23 at 05:39
With times I'd be inclined to consider ratios rather than differences; it's a lot easier to save 5 minutes on a 1 hour time than on a 10 minute time. Meanwhile, you can't do it at all with a 4 minute time. ... but you could maybe save say 8% in each of those cases. — Glen_b, Dec 17 '23 at 07:07
can you plot the(combined) distribution. maybe possible to estimate full distribution parameters. — seanv507, Dec 17 '23 at 09:18
An interesting alternative to the bootstrapping method could be a Fiducial approach where we might consider probabilities that specific percentiles in the sample are the 90th percentile of the true distribution and based on that compute a Fiducial distribution for the difference between the percentiles. — Sextus Empiricus, Dec 18 '23 at 15:00
A confidence interval for a percentile is called a tolerance limit -- that's worth investigating. (The two endpoints of the CI are different kinds of tolerance limits.) A nonparametric confidence interval (which might be a good choice here) is described at https://stats.stackexchange.com/questions/99829. A similar approach (based on rank statistics of the combined dataset) provides an excellent nonparametric test of equality of any two percentiles. @PBulls Usually both the bootstrap and permutation tests can be computed using combinatorics: subsampling isn't even needed. — whuber, Dec 19 '23 at 20:00

PBulls · Answer 1 · 2023-12-17T20:54:06.150

You can use a bootstrap to generate a confidence interval on this quantity, and a permutation test for the P-value. Let's go over an example for both.

I'm not going to assume your response times are normally distributed, and I'm going to use a smaller sample than your actual dataset of $n=28{,}981$. The large sample size will work to your advantage for inference but will make these methods more computationally heavy than they already are.

To show you what we'll work with for inferring the 90th percentile of 'After' - 'Before' response time:

n <- 220
set.seed(1)
Group <- sample(1:2, size=n, replace=TRUE)
Resp <- rexp(n, rate=1/c(10, 15)[Group])
data <- data.frame(
   G = c("Before", "After")[Group],
   R = Resp
)

We know the exact quantity that we're trying to estimate here which is qexp(.9, 1/10) - qexp(.9, 1/15) $=-11.51$; you don't have this luxury in your real data. The assumed distribution has a pretty flat cumulative density towards the top however, so the upper quantiles are going to be pretty susceptible to sampling effects (read: vary from draw to draw).

The bootstrap is easy enough to implement manually, but let's use boot.

resamples <- 1E4
## Function to calculate statistic of interest
## Relies on reference level sorting *last*!
qdiff <- \(d) diff(tapply(d$R, d$G, quantile, probs=.9))
Wrapper for boot
bootfun <- function(data, i) qdiff(data[i,])
bootstrap <- boot::boot(data, statistic = bootfun, R = resamples)
boot::boot.ci(bootstrap, type = c("perc", "basic", "norm"))
> Intervals : 
> Level      Normal              Basic              Percentile

> 95%   (-17.94,  -0.86 )   (-16.29,  -0.21 )   (-20.98,  -4.90 )

There is probably more than one package that'll do permutation testing for you as well, you can do it by hand too:

nA <- sum(data$G == "After")
rows <- seq_len(nrow(data))
permute <- vapply(seq_len(resamples), (i) {
   data$G <- c("B", "A")[1 + (rows %in% sample(rows, size=nA))]
   qdiff(data)
}, numeric(1))
permute_P <- mean(abs(permute) > abs(qdiff(data)))
> 0.0102

This is really just scratching the surface & doesn't address a few possible extensions or things to be worked out still:

which bootstrap confidence interval to use (they all come with trade-offs).
directly estimating the quantile sampling variance (requires assumption on its distribution).
the bootstrap CI and permutation P-value do not have the direct relation that most parametric tests have, so you might get inconsistent results (e.g. P < 0.05 but 95% CI includes 0). You could back-transform the bootstrap CI into a P-value but I wouldn't recommend it.
while these methods come with relatively few assumptions, they can still go awry (e.g. highly skewed distributions, non-exchangeability) though your large sample size should help. A particular point I didn't go into is that this assumes your 'Before' and 'After' samples are independent, if this is some kind of within-cluster sample (same unit measured twice) you'd do well to account for that too. In the same vein, this bootstrap sampled from both groups randomly (because their size wasn't fixed in the design either), you may want to preserve some sampling features there.
as was mentioned in the comments, maybe you want to look at other quantities than absolute time difference; the same methods can be applied then.

(+1) Excellent answer. Flipping the axes on your graph would underline the problem even more. Naturally each graph shows the same idea with different conventions. — Nick Cox, Dec 17 '23 at 15:31
How are these confidence intervals computed? Are they a 95% region created by using the quantiles of the bootstrapping sampling distribution? But that's not a confidence interval. — Sextus Empiricus, Dec 17 '23 at 20:23
@SextusEmpiricus I explicitly left open which interval because it'd depend on the distribution & has ample discussion on this site already. See also boot.ci docs or any intro text, in display order they are normal-approximate (based on $\sigma$ of bootstrap means), 'hybrid' Efron's quantiles (basic), and raw quantiles. Other options with more parametric assumptions exist. What is your definition of a CI then? I went ahead & checked 95% coverage in the first 1E4 seeds here, the raw quantile does best at just under 96% (so slightly conservative), followed by normal at <93% and basic at <90%. — PBulls, Dec 18 '23 at 08:44
@NickCox you make a very fair point, but it felt wrong not having the horizontal axis be time (maybe I've made too many survival curves). — PBulls, Dec 18 '23 at 08:51
Indeed; each tribal tradition usually boils down to whatever is consider the outcome or result of interest being plotted vertically. Or the convention that time is plotted horizontally is triumphant (not necessarily in Earth sciences or archaeology!). — Nick Cox, Dec 18 '23 at 11:31
A confidence interval is an interval for which the frequency that the interval contains the true value is equal to the nominal percentage when conditioning on the true value of the parameters. A simple example of a bootstrap distribution that will never give a correct confidence interval is the estimation of the parameter $\theta$ for an uniform distribution $U(0,\theta)$ (like the German tank problem). The use of bootstrap distributions is an approximation with the assumption that the distribution is symmetrical and not changing much when the location is shifted. — Sextus Empiricus, Dec 18 '23 at 14:32
The problem with the ambulance response times here (estimating the 90th percentile) is not exactly the same as the german tank problem (estimating the 100th percentile), and it is also about estimating a difference instead of an absolute value. So probably the approach with bootstrapping may give a good indication. However, I would not blindly apply it without some analysis of the type of problem and distribution and checks whether a bootstrap distribution might give a false idea of the fiducial distribution. — Sextus Empiricus, Dec 18 '23 at 14:38

score 9 · Answer 2 · answered Dec 17 '23 at 08:11

in R, quantreg package has rq funtions can handle this.

library(quantreg)
library(tidyverse)
set.seed(123)
here, i made some minor modifications to your data, so expected q90 difference is 5.0
Group = sample(c("Before", "After"), 100, replace = TRUE)
data <- data.frame(
  Group = factor(Group, levels = c('Before', 'After')),
  ResponseTime = if_else(
    Group == 'Before', 
    rnorm(100, mean = 10, sd = 2),
    rnorm(100, mean = 15, sd = 2),
  )
)
model <- rq(ResponseTime ~ Group, data, tau = 0.90)
summary.rq(model, se = 'boot')

result is

Call: rq(formula = ResponseTime ~ Group, tau = 0.9, data = data)
tau: [1] 0.9
Coefficients:
            Value    Std. Error t value  Pr(>|t|)
(Intercept) 12.22249  0.87235   14.01104  0.00000
GroupAfter   4.89747  0.92706    5.28282  0.00000

for more detail, refer to summary.rq help docment.

score 8 · Answer 3 · answered Dec 18 '23 at 16:43

The answer by @PBulls is excellent, and I would just like to add a thought here:

What do you need the P value for actually? The use of P values is widespread in empirical research but even experts like Statisticians, Epidemiologists or Psychometrists are bound to interpret them incorrectly (see for example the study quoted in Gigerenzer et al's "The Null Ritual"). An extensive list of misinterpretations is collected in Greenland et al's paper, and with your sample size of 28,000 most tests will yield a "significant result" anyway, even for tiny differences. My recommendation is therefore not to bother too much with P values (short of abandoning them altogether as suggested by McShane et al) and focus on something else instead.

If you have the possibility, in communicating your results, focus on effect sizes and variability, displayed in graphs or summarized in confidence intervals, as encouraged for example by Dunkler et al. for transplantation medicine. This will hopefully lead to discussions and further insights, like "for which circumstances was response more time reduced, and for which less so?" such that the merit of current measures to reduce response times can be discussed and improvements can be suggested.

Thanks, Geeeero. I'm a physician who does some research and tries to understand and use good methodology. I agree completely with you and hate p-values. I've found that I get papers returned from peer-reviewers who insist on p-values. Even when using large datasets. I make the same arguments you presented, but sometimes the nature of peer-review is if you want to get published, sometimes you have to relent. I really do appreciate your thoughts! — Jeff Jarvis, Dec 19 '23 at 04:28

Sextus Empiricus · Answer 4 · 2023-12-20T11:11:47.770

Here is an approach for the computation of the intervals using a fiducial distribution. The steps are as following

Compute fiducial distributions for the two individual percentiles. This can be done by considering the probabilities that the true percentile $\theta_{90}$ is below a particular order statistic $P(\theta_{90}<x_{(i)})$. This probability can be computed with a beta distribution.

See also: How to obtain a confidence interval for a percentile?
Interpret the fiducial distributions as probabilitiy distributions (this is an approximation, the fiducial distribution does not behave exactly like a probability distribution) and compute the probabilities for the joint probabilities of the two parameters/percentiles to be inside the 2-d bins/cells created by the grid.
Compute boundaries based on the distribution and consider the bounds for the difference as boundary of regions that ensure the probability to be at least $\alpha \%$ (the computed joint probability is coarse/discrete and we can not nicely integrate it, but we can sum probabilities of all the cells through which the boundary passes)

The image below illustrates how this goes. For two samples $x$ and $y$ we draw a grid with grid lines corresponding to the values of the sample. We compute for each cell probabilities (depicted by a color in the image, purple is more probable and yellow less probable). For a line corresponding to some constant value of $x-y$ we consider all the cells that it passes through and in addition the cells above/below it (for lower/upper bound) and add up the probability to compute an estimate for the fiducial probability. The lines in the image below correspond to a 95% fiducial interval and it is computed such that the light shaded area is at least 95%. The actual coverage will be larger because the lines cover a larger area than the light shaded area.

Note 1: I created this approach quickly and do not know whether this has been described elsewhere before. What annoys me is that the actual coverage 0.9701 is different from the nominal coverage 0.95, even with large samples such as $n=500$. The method is an approximation, but I had hoped that the difference would have become smaller quickly with increasing sample size. I am not sure whether the discrepancy is because of the approximation errors (using discrete boundaries of the square grid) or because of the intrinsic error in the method which is the use of the fiducial distribution as if it is a true probability density independent from the sample.

Note 2: the problem with the discrete estimate of the density might be reduced by smoothening the distribution. I have left this out as it complicates the problem a lot and is not really necessary for a proof of principle.

Note 3: The computation with the matrices for the joint distribution takes a lot of time. As an alternative one can also use a Monte Carlo simulation after having obtained the one-dimensional fiducial distributions. Also, for repetitions with the same sample sizes and quantiles, the computations with the beta distributions can be stored and re-used.

Edit: I have implemented notes 2 and 3 and it appears that now for n=500 the actual coverage might be closer to the nominal coverage. This indicates that the discreteness of the fiducial distribution (not allowing to select exact 95% regions) plays a big role.

R-Code:

### function to compute fi% fiducial interval for difference in q-th quantiles 
fi_dq = function(x, y, q = 0.9, fi = 0.95) {
  nx = length(x)
  ny = length(y)
  x = x[order(x)]
  y = y[order(y)]
  kx = 1:nx
  ky = 1:ny
compute one-dimensional fiducial probabilities and related probabilities
Fx = 1-pbeta(q,kx,nx+1-kx)    ## P(theta < x_k)
  Fy = 1-pbeta(q,ky,ny+1-ky)    ## P(theta < y_k)
  px = diff(c(0,Fx,1))
  py = diff(c(0,Fy,1))
pxy = outer(px,py,"*")   ### cell probabilities
  diffxy = outer(x,y,"-")  ### cell values of difference
compute fiducial lower boundaries
lowerxy = rbind(diffxy[1,],diffxy)  ### add rows for edges
  lowerxy = cbind(lowerxy,lowerxy[,ny])
lowerord = order(lowerxy)                  ### order values and add upp associated probabilities 
  lower_diff = lowerxy[lowerord]
  lower_F = cumsum(pxy[lowerord])
  lower_k = max( which(lower_F < (1-fi)/2) )  ### compute a 1-alpha/2 boundary
  lower_bound = lower_diff[lower_k]
compute fiducial upper boundaries
upperxy = rbind(diffxy,diffxy[nx,])  ### add rows for edges
  upperxy = cbind(upperxy[,1],upperxy)
upperord = rev(order(upperxy))
  upper_diff = upperxy[upperord]
  upper_F = cumsum(pxy[upperord])
  upper_k = max( which(upper_F < (1-fi)/2) )
  upper_bound = upper_diff[upper_k]
return(c(lower_bound, upper_bound))
}
some settings
set.seed(1)
n = 500
q = 0.9
fi = 0.95
simulations to test true coverage probability
m = 10^4
set.seed(1)
bounds = sapply(1:m, FUN = function(s) {
  x = rexp(n,1/10)
  y = rexp(n,1/15)
  fi_dq(x, y , q, fi)
})
effective coverage
true = qexp(q,1/10) - qexp(q,1/15)
mean((bounds[1,] < true)*(true < bounds[2,]))
example from question
results in -0.8429189  2.7977309
set.seed(123)
data <- data.frame(
  Group = sample(c("Before", "After"), 100, replace = TRUE),
  ResponseTime = rnorm(100, mean = c(10, 15), sd = 2)
)
fi_dq(x = data$ResponseTime[data$Group == "After"],
      y = data$ResponseTime[data$Group == "Before"],
      q = 0.9, fi = 0.95)
illustration of joint fiducial distribution
set.seed(1)
n = 100
x = rexp(n,1/10)
y = rexp(n,1/15)
q = 0.9
fi = 0.95
nx = length(x)
ny = length(y)
x = x[order(x)]
y = y[order(y)]
kx = 1:nx
ky = 1:ny
Fx = 1-pbeta(q,kx,nx+1-kx)    ## P(theta < x_k)
Fy = 1-pbeta(q,ky,ny+1-ky)    ## P(theta < x_k)
fx = diff(Fx)
fy = diff(Fy)
pxy = outer(fx,fy,"*")
diffxy = outer(x,y,"-")
plot(-100,-100, xlim = range(x), ylim = range(y),
     xlab = "x", ylab = "y")
bounds = fi_dq(x, y, q, fi)
color polygons based on probabilities (not densities)
p_area = 0
for (l in c(2:nx)) {
  for (m in c(2:ny)) {
    area1 = sum(diffxy[l+c(-1,0),m+c(-1,0)]>bounds[1])
    area2 = sum(diffxy[l+c(-1,0),m+c(-1,0)]<bounds[2])
    p = pxy[l-1,m-1]
    rp = (p/max(pxy))
    polygon(x[l+c(-1,0,0,-1)],y[m+c(-1,-1,0,0)], 
            col = hsv(0.15+0.55rp, 
                      0.7-0.5rp,
                      1-0.3rp),
            lwd = 0.1)
    if (area1 < 4) {
      polygon(x[l+c(-1,0,0,-1)],y[m+c(-1,-1,0,0)], 
              col = hsv(0,0,0,0.2),
              lwd = 0.1)
    }
    if (area2 < 4) {
      polygon(x[l+c(-1,0,0,-1)],y[m+c(-1,-1,0,0)], 
              col = hsv(0,0,0,0.2),
              lwd = 0.1)
    }
    if ((area2 == 4)(area1 == 4)) {
      p_area = p_area + p
    }
  }

}
p_area  ### manual computed check p_area should be at least 'fi' or larger
lines(c(-10,70),c(-10,70)-bounds[1], lwd = 2, col = 1)
lines(c(-10,70),c(-10,70)-bounds[2], lwd = 2, col = 1)

Alternative code that combines note 2 and note 3 by computing with monte carlo simulations and making the discrete fiducial distribution a piecewise linear continuous function.

### function to compute fi% fiducial interval for difference in q-th quantiles 
### using a monte carlo approach
fi_dq_mc = function(x, y, q = 0.9, fi = 0.95, m = 10^3) {
  nx = length(x)
  ny = length(y)
  x = x[order(x)]
  y = y[order(y)]
  kx = 1:nx
  ky = 1:ny
compute one-dimensional fiducial probabilities and related probabilities
Fx = 1-pbeta(q,kx,nx+1-kx)    ## P(theta < x_k)
  Fy = 1-pbeta(q,ky,ny+1-ky)    ## P(theta < y_k)
  px = diff(c(Fx))
  py = diff(c(Fy))
draw_xy = sapply(1:m, FUN = function(k) {
    draw_x = sample(1:(nx-1),1, prob = px)
    draw_y = sample(1:(ny-1),1, prob = py)
    samp_x = runif(1, x[draw_x],x[draw_x+1])
    samp_y = runif(1, y[draw_y],y[draw_y+1])
    return(samp_x-samp_y)
  })
lower_bound = quantile(draw_xy,(1-fi)/2)
  upper_bound = quantile(draw_xy,1-(1-fi)/2)
return(c(lower_bound, upper_bound))
}
some settings
set.seed(1)
n = 500
q = 0.9
fi = 0.95
simulations to test true coverage probability
m = 10^3
set.seed(1)
bounds = sapply(1:m, FUN = function(s) {
  x = rexp(n,1/10)
  y = rexp(n,1/15)
  fi_dq_mc(x, y , q, fi)
})
effective coverage
result is 0.955
true = qexp(q,1/10) - qexp(q,1/15)
mean((bounds[1,] < true)*(true < bounds[2,]))

score 0 · Answer 5 · answered Dec 19 '23 at 19:09

In line with a poster above who suggested using ratios, you could graph the results as a series of ratios depending on (the 'before') response time. The x-axis would be the (before) response times, let's say in 10-minute intervals, and the y-axis would be the ratios of after/before shown for each 10-minute interval. A clustered boxplot would work very well for that too, so the range, median, mean and outliers of response times could be graphed for both before and after, at each 10-minute interval.

And needless to say, the response times could also be dependent on distance from the ambulance station to the incident, time of day, etc. You might try getting someone to make a GPS analysis.

This is an interesting suggestion but how are you gonna get the response times for the y-axis and x-axis in 10-minute intervals? This requires some way to categorize the data according to additional variables and have several pairs (or larger groups) of before and after times. You suggest the use of extra variables in the final paragraph that may achieve this, but it might be good to explain this better. — Sextus Empiricus, Dec 20 '23 at 11:36

Compare 90th percentiles of two samples (confidence interval, test)

5 Answers5

Wrapper for boot

here, i made some minor modifications to your data, so expected q90 difference is 5.0

compute one-dimensional fiducial probabilities and related probabilities

compute fiducial lower boundaries

compute fiducial upper boundaries

some settings

simulations to test true coverage probability

effective coverage

example from question

results in -0.8429189 2.7977309

illustration of joint fiducial distribution

color polygons based on probabilities (not densities)

compute one-dimensional fiducial probabilities and related probabilities

some settings

simulations to test true coverage probability

effective coverage

result is 0.955