I need to compute quartiles (Q1,median and Q3) in real-time on a large set of data without storing the observations. I first tried the P square algorithm (Jain/Chlamtac) but I was no satisfied with it (a bit too much cpu use and not convinced by the precision at least on my dataset).
I use now the FAME algorithm (Feldman/Shavitt) for estimating the median on the fly and try to derivate the algorithm to compute also Q1 and Q3 :
M = Q1 = Q3 = first data value
step =step_Q1 = step_Q3 = a small value
for each new data :
# update median M
if M > data:
M = M - step
elif M < data:
M = M + step
if abs(data-M) < step:
step = step /2
# estimate Q1 using M
if data < M:
if Q1 > data:
Q1 = Q1 - step_Q1
elif Q1 < data:
Q1 = Q1 + step_Q1
if abs(data - Q1) < step_Q1:
step_Q1 = step_Q1/2
# estimate Q3 using M
elif data > M:
if Q3 > data:
Q3 = Q3 - step_Q3
elif Q3 < data:
Q3 = Q3 + step_Q3
if abs(data-Q3) < step_Q3:
step_Q3 = step_Q3 /2
To resume, it simply uses median M obtained on the fly to divide the data set in two and then reuse the same algorithm for both Q1 and Q3.
This appears to work somehow but I am not able to demonstrate (I am not a mathematician) . Is it flawned ? I would appreciate any suggestion or eventual other technique fitting the problem.
Thank you very much for your Help !
==== EDIT =====
For those who are interested by such questions, after a few weeks, I finally ended by simply using Reservoir Sampling with a revervoir of 100 values and it gave very satistfying results (to me).
