4

I have been given quantiles (min, 25%, med, 75%, max) for items of data, along with the size of the data n. From these pieces of information I would like to obtain a random sample of data points.

Apart from the trivial solutions where n ≤ 5, is there any way of doing this?

My attempt at a solution:

After some research I believe my best option is to obtain a distribution from these quantiles and then use inverse transform sampling to randomly sample n items from a given distribution which would give me n random data points that roughly agreed with the quantiles given.

However I am struggling to find digestible reading material on how I can obtain this distribution, from domain knowledge I suspect this distribution will be highly negatively skewed (Gumbel minimum / minimum extreme distribution)

Here are some related threads:

Estimating a distribution based on three percentiles

Estimate distribution from 4 quantiles

https://www.johndcook.com/blog/2010/01/31/parameters-from-percentiles/

Stephan Kolassa
  • 123,354
JDraper
  • 217

2 Answers2

4

The raw quantiles do not uniquely define a distribution. (Unless you have additional information, like that it is normal. In which case the question is whether the quantiles are actually consistent with a normal distribution.)

I would recommend that you draw

  • $\frac{n}{4}$ data points that are uniformly distributed in $[q_0, q_{.25}]$
  • $\frac{n}{4}$ data points that are uniformly distributed in $[q_{.25}, q_{.5}]$
  • $\frac{n}{4}$ data points that are uniformly distributed in $[q_{.5}, q_{.75}]$
  • $\frac{n}{4}$ data points that are uniformly distributed in $[q_{.75}, q_1]$

If your $n$ is large and the distances between the quantiles vary much, then this may yield a somewhat "unnatural" histogram:

Histogram

nn <- 1e6
quantiles <- c(0,2,6,12,20)

set.seed(1)

xx <- c(
    runif(nn/4,quantiles[1],quantiles[2]),
    runif(nn/4,quantiles[2],quantiles[3]),
    runif(nn/4,quantiles[3],quantiles[4]),
    runif(nn/4,quantiles[4],quantiles[5]))

hist(xx)

If this is a problem for you, then you may want to prespecify a distribution, fit this to the quantiles provided and sample from the distribution, per above. Or try fitting a kernel density estimate to your quantiles and sample from that.

Stephan Kolassa
  • 123,354
0

With the help of the Metalog distribution it is possible to get a nice fit to a set of given quantiles. Here is some example code in Python using the metalogistic package :

from metalogistic import MetaLogistic

import sys eps = sys.float_info.epsilon

For numeric reasons we cannot use min=0.0 and max=1.0

Instead we set min=eps, the max=1.0 - eps

quantiles = [eps, 0.25, 0.5, 0.75, 1.0 - eps] xs = [0, 2, 6, 12, 20]

metalog = MetaLogistic(cdf_xs=xs, cdf_ps=quantiles) metalog.print_summary()

The output is

Fit method used: Linear least squares
    Distribution is valid: True
    Method for determining distribution validity: SmallMReciprocal
    Mean square error: 1.1886423615385787e-28
    a vector: [6.00000000e+00 9.01924897e-17 1.97954724e-15 2.00000000e+01
     1.60000000e+01]

To show the distribution simply do

metalog.display_plot();

enter image description here

You can sample from the fitted distribution like

>>> metalog.rvs(size=(10,))
array([16.82096623, 13.92429775, 19.62488196,  7.96498741,  7.69518981,
    3.90956535,  0.3221118 ,  0.03950283, 12.72069787,  1.54214087])
asmaier
  • 381