Online estimation of quartiles without storing observations

Question

I need to compute quartiles (Q1,median and Q3) in real-time on a large set of data without storing the observations. I first tried the P square algorithm (Jain/Chlamtac) but I was no satisfied with it (a bit too much cpu use and not convinced by the precision at least on my dataset).

I use now the FAME algorithm (Feldman/Shavitt) for estimating the median on the fly and try to derivate the algorithm to compute also Q1 and Q3 :

M = Q1 = Q3 = first data value 
step =step_Q1 = step_Q3 = a small value
for each new data :
        # update median M 
        if M > data:
            M = M - step
        elif M < data:
            M = M + step
        if abs(data-M) < step:
            step = step /2

        # estimate Q1 using M
        if data < M:
            if Q1 > data:
                Q1 = Q1 - step_Q1
            elif Q1 < data:
                Q1 = Q1 + step_Q1
            if abs(data - Q1) < step_Q1:
                step_Q1 = step_Q1/2
        # estimate Q3 using M
        elif data > M:
            if Q3 > data:
                Q3 = Q3 - step_Q3
            elif Q3 < data:
                Q3 = Q3 + step_Q3
            if abs(data-Q3) < step_Q3:
                step_Q3 = step_Q3 /2

To resume, it simply uses median M obtained on the fly to divide the data set in two and then reuse the same algorithm for both Q1 and Q3.

This appears to work somehow but I am not able to demonstrate (I am not a mathematician) . Is it flawned ? I would appreciate any suggestion or eventual other technique fitting the problem.

Thank you very much for your Help !

==== EDIT =====

For those who are interested by such questions, after a few weeks, I finally ended by simply using Reservoir Sampling with a revervoir of 100 values and it gave very satistfying results (to me).

Are you looking for a proof that Q1 and Q2 converge to the true quantiles as the number of examples increase in a manner similar to the markov chain analysis in the slides you linked? In terms of implementation, the above algorithm does not seem flawed (I tested approximating quantiles for standard normal in R and the algorithm works fine). — Theja Tulabandhula, Jun 13 '14 at 15:25
@Theja thank you, I am not looking for a proof (too much work) but merely advices and comments, The main problem I see is to base the computation on running estimate of the median, as whuber has pointed. — Louis Hugues, Jun 13 '14 at 15:33

score 6 · Answer 1 · edited May 23 '17 at 12:39

6

The median is the point at which 1/2 the observations fall below and 1/2 above. Similarly, the 25th perecentile is the median for data between the min and the median, and the 75th percentile is the median between the median and the max, so yes, I think you're on solid ground applying whatever median algorithm you use first on the entire data set to partition it, and then on the two resulting pieces.

Update:

This question on stackoverflow leads to this paper: Raj Jain, Imrich Chlamtac: The P² Algorithm for Dynamic Calculation of Quantiiles and Histograms Without Storing Observations. Commun. ACM 28(10): 1076-1085 (1985) whose abstract indicates it's probably of great interest to you:

A heuristic algorithm is proposed for dynamic calculation qf the median and other quantiles. The estimates are produced dynamically as the observations are generated. The observations are not stored; therefore, the algorithm has a very small and fixed storage requirement regardless of the number of observations. This makes it ideal for implementing in a quantile chip that can be used in industrial controllers and recorders. The algorithm is further extended to histogram plotting. The accuracy of the algorithm is analyzed.

edited May 23 '17 at 12:39

Community

1

answered Jun 13 '14 at 14:11

Avraham

3,737
25
43

4

This reply overlooks two subtle points, one unimportant but the other possibly very important. The unimportant one is that the double-splitting technique computes the upper and lower hinges which can differ slightly from the median, depending on sample sizes. The important one is that the double-splitting appears to be based on a running estimate of the median. Any variation between this estimate and the actual median will cause the hinges to vary as well. Intuitively, this should not be a problem as the amount of data grows larger, but it is an issue that needs some analysis. – whuber Jun 13 '14 at 15:07
Wouldn't directly estimating the quartiles be subject to similar issues? Direct estimation would partition the $n$ data points into a $1:3$ ratio. This partitions the elements into $2:2$ and then takes one of those "2"s and splits it $1:1$. I'm no theoretician, true, but, in general, wouldn't the difference between the two be different by at most one spot to the left or right and would converge as $n$ increases? Yes, a pathological distribution could be created, but that would suffer from direct median estimation as well. Obviously, storing all the values is better, of course. – Avraham Jun 13 '14 at 15:32
2

@Avraham , thanks for pointing the paper, as I mentioned I already tried P-square algorithm from Chain and Chlamtac. on my data set the algo I decribed gives a better result (MSE) and is faster. So I was questionning if it could have some problem nevertheless. As whuber remarked the fact that it uses a running estimate is a potential problem ; but I do not know if its really important or not . – Louis Hugues Jun 13 '14 at 15:44
Whoops, saw that and forgot it. My apologies. – Avraham Jun 13 '14 at 15:46

score 4 · Answer 2 · answered Jan 20 '20 at 23:41

A very slight change to the method you posted and you can compute any arbitrary percentile, without having to compute all of the quantiles. Here's the Python code:

class RunningPercentile:
    def __init__(self, percentile=0.5, step=0.1):
        self.step = step
        self.step_up = 1.0 - percentile
        self.step_down = percentile
        self.x = None

    def push(self, observation):
        if self.x is None:
            self.x = observation
            return

        if self.x > observation:
            self.x -= self.step * self.step_up
        elif self.x < observation:
            self.x += self.step * self.step_down
        if abs(observation - self.x) < self.step:
            self.step /= 2.0

and an example:

import numpy as np
import matplotlib.pyplot as plt

distribution = np.random.normal
running_percentile = RunningPercentile(0.841)
observations = []
for _ in range(1000000):
    observation = distribution()
    running_percentile.push(observation)
    observations.append(observation)

plt.figure(figsize=(10, 3))
plt.hist(observations, bins=100)
plt.axvline(running_percentile.x, c='k')
plt.show()

Small correction: according the FAME paper, step is initialized as self.step = max(abs(observation/2), self.step). I think in this case, it would become self.step = max(abs(observation), self.step), since the halving is applied by the step_up/down multiples. — abroekhof, Jul 07 '21 at 21:38

Online estimation of quartiles without storing observations

2 Answers2

Linked