9

I am trying to make simple median and IQR calculations that involve numbers (in percentages) appearing in a range and inequalities in addition to whole numbers. My sample dataset looks like this:

5, 10, < 1, 10 - 20, 25, > 90

Any suggestions on how I can perform these two calculations on such a dataset?

EdM
  • 92,183
  • 10
  • 92
  • 267
Adam
  • 147
  • Do you mean $>90$ for the last one? – Dave Apr 18 '22 at 11:36
  • sorry, yes its supposed to be > 90. – Adam Apr 18 '22 at 11:42
  • 1
    Do you have counts or probabilities accompanying those values? – Tim Apr 18 '22 at 15:59
  • 2
    @Tim: this is a dataset. It has six values, of which three are intervals. Interval arithmetic permits computing quantiles, which generally would themselves be intervals. – whuber Apr 18 '22 at 18:55
  • I also suggest interval arithmetic (IA) as it (1) allows the desired calculations, and (2) assumes very little. While other complications can arise with IA, like having to solve optimization problems, you will find perhaps at worst here that IA is going to give conservative (i.e. wider than you might get with a well-chosen probability model) bounds. – Galen Apr 18 '22 at 19:45
  • What do these numbers mean? 5, 10, < 1, 10 - 20, 25, > 90 Are they each single observations in which some observations are expressed as numbers and some observations are expressed as ranges? Or do these numbers indicate that you have observations that fall inside these ranges (e.g. you know the 5th percentile, the 10th percentile etc.) – Sextus Empiricus Apr 19 '22 at 17:49
  • Can the inequalities be refined into bounded intervals? For example if $x < 1$ then $x \in (-\infty, 1)$, but often things cannot be unbounded in physically realistic scenarios. Can you make a reasonable inference for a finite value for the lower bound? Same goes for the $> 90$ upper bound. – Galen Apr 19 '22 at 19:47
  • 1
    @AgnesianOperator I describe and illustrate the interval arithmetic solution at https://stats.stackexchange.com/a/86120/919. – whuber Apr 20 '22 at 03:29
  • 1
    @whuber I would upvote your answer there, but I have already done that! We have a shared interest in interval arithmetic. Definitely relevant to share here though. I would recommend it to others. Hopefully the OP here checks it out (hint, hint). – Galen Apr 20 '22 at 03:34

3 Answers3

7

In this case you can make a non-parametric estimate, although without important things like confidence intervals. You have a combination of left-censored (<1), right-censored (>90) and interval-censored (10-20) values.

Although you might not think of what you have as a survival problem, a survival function $S(t)$ is just 1 minus the corresponding (cumulative) distribution function $F(t)$ (i.e., $S(t)=1-F(t)$), so the median of $F(t)$ is the value corresponding to a "survival" fraction of 0.5, the first quartile that for a survival fraction of 0.75, etc. So you can use a survival modeling method designed to handle arbitrarily censored data to get estimates of quantiles.

The R icenReg package can calculate the Turnbull nonparametric maximum-likelihood estimate of a survival curve based on such data (a generalization of the Kaplan-Meier method for interval-censored data). That should be more generally useful than a method that requires you to pre-rank the exact and interval values.

To get a single non-parametric survival curve this way, provide a 2-column matrix with the lower and upper limits for each data point. For a known data point, those two values are identical. With your example percentage data (lower limit, 0; upper limit, 100):

library(icenReg)
datMat <- matrix(c(0,1,5,5,10,10,10,20,25,25,90,100),ncol=2,byrow=TRUE)
datMat
##     [,1] [,2]
## [1,]    0    1
## [2,]    5    5
## [3,]   10   10
## [4,]   10   20
## [5,]   25   25
## [6,]   90  100
icTest<- ic_np(datMat)
plot(icTest,bty="n")

Turnbull "survival" estimate from data

I didn't change the default axis labels, so your values correspond to "time" here. $S(t)$ is the survival function for your data, although the boxes might look strange. The package vignette explains:

Looking at the plots, we can see a unique feature about the NPMLE for interval censored data. That is, there are two lines used to represent the survival curve. This is because with interval censored data, the NPMLE is not always unique; any curve that lies between the two lines has the same likelihood.

Based on the plot, you would accept the range of 10-20, corresponding to $S(t) = 0.5$, as including the median. The IQR would be 5 - 25 (corresponding to $S(t) = 0.75, S(t) = 0.25$).

If you have a reasonable parametric form for your data you can do much more with this type of modeling, as Frank Harrell suggests in his answer.

EdM
  • 92,183
  • 10
  • 92
  • 267
6

If you can order the levels of the observations, you can determine the median, and the 1st quartile and 3rd quartile.

  1. In your sample data, the observations could be ordered, with the exception that you have to decide if "10 - 20" is greater then "10", or if they would have the same rank when ranked.

  2. IRQ itself, as the difference between the 1st and 3rd quartile, wouldn't necessarily make sense.

  3. There are different methods to determine the value of quantiles. For example, R has 9 options ( www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile ). Some are more appropriate for non-continuous data.

  4. With discontinuous data, you may get answers that are e.g. "Between 'good' and 'very good'. In your sample data, the median would be between "10" and "10 - 20", assuming those are two different when ranked.

  5. The following can be run in R (or at e.g. rdrr.io/snippets/ ). This assumes that "10 - 20" is greater than "10". And uses R quantile type 1, which will not return answers that straddle two levels.

.

Observed = c("5", "10", "< 1", "10 - 20", "25", "> 90")

Obs.factor = factor(Observed, ordered = TRUE, levels = c("< 1", "5", "10","10 - 20", "25", "> 90") )

quantile (Obs.factor, type=1, probs=0.50)

quantile (Obs.factor, type=1, probs=0.25)

quantile (Obs.factor, type=1, probs=0.75)

Sal Mangiafico
  • 11,330
  • 2
  • 15
  • 35
4

If you only had "< 1" occuring you could compute the median. In general you can't estimate what you want unless you assume a smooth parametric distribution and explicitly handle left, right, and interval censoring in computing the likelihood function so that you can get maximum likelihood estimates of the parameters of that distribution. Then you compute the mean and quantiles which are functions of those underlying parameters. It's pretty involved.

Frank Harrell
  • 91,879
  • 6
  • 178
  • 397
  • Thanks, so there is really no straightforward way of solving this, even if the numbers represented percentages on a scale of 0% to 100%? Would you recommend taking the median of such ranges to get whole numbers (e.g. using a value of 15% for the "10%-20%" cell and using 95% for the "> 90% cell). Perhaps these could then be used along with the other whole number values to calculate the median and IQR? – Adam Apr 18 '22 at 12:52
  • No straightforward way, and I don't recommend assuming the midrange characterizes an interval. Best you can do easily is to compute proportions of all possible levels/intervals. – Frank Harrell Apr 18 '22 at 13:16
  • 5
    I think this answer is way too pessimistic. For instance, 10-20 clearly is the median and if Tukey's hinges would be acceptable as quartiles, they are both uniquely defined at 7.5 and 22.5. – whuber Apr 18 '22 at 18:54
  • I don't see why you can't proceed with (some) combinations of left and right censoring. The median of {x: x < 1} , 1, and {y: y >1} is pretty clearly 1, regardless of the values of x and y. As the proportion of censored data approaches the median's breakdown point, you might not be able to provide an exact value, but you can still say something sensible: the median of (e.g., <1, 5, <10, >50) is no more than 7.5 – Matt Krause Apr 19 '22 at 16:03
  • 2
    The interval censored values cause special problems if the interval is wide. Yes if you want to state the median as an interval you can stay honest in some cases without resorting to parametric distribution fits. – Frank Harrell Apr 19 '22 at 16:59