0

I saw this question "Storing a probability distribution without saving single values" on stackexchange and thought it deserved a statistical answer.

Example Scenario

I could see this problem arising if one would like to store a percentile, like P90, on a granular level, but would also like the ability to aggregate to higher grains.

Say for example one has a probabilistic forecast at the hourly level one could store the median and a few percentiles P70, P90, etc. However, if one would like to aggregate those values to percentiles to get daily level percentiles then those daily percentiles may not be valid.

It seems that the most convenient thing to do statistically is to store all the samples. This is expensive, but feasible at the daily grain. However, it expensive and unfeasible at the hourly grain.

Possibilities

Here I make a list of possibilities available to get valid percentiles on both a daily and hourly grain without having to store all the samples. The constraint is that these procedures must be possible for a database engine to store (so I can't store fit objects). Anything sophisticated would have be precomputed then stored in a table.

Here I list the options I am aware of. I would like to know:

a) if I am missing something from this list

b) am I making proper assumptions about the items in this list

  1. Store all samples: If I store all samples at the daily grain and I know the proportion of demand an hour makes in a day then I can multiply the samples by the proportions to get hourly samples which I can calculate percentiles on. As stated above this is not feasible. It is to expensive.
  2. Store bins: I know I can get bins from average shifted histograms, and KDE. Further, any density estimation technique could be used. For an overview see Density estimation in R. The way this would work is that my 1000 samples would be reduced to some lower count of bins. I can define this to be just about anything I want. In the paper above it looks like the standard is 512, but this does not reduce my samples all that much 1000 to 512 is only a small reduction. Also, my ability to calculate extreme percentiles say P99 would be reduced as the number of my bins decreases. More reduction could be possible because we only really care about the percentiles greater than the median so we could take the ~250 bins greater than the median. Still this would be more data than I would want to store at a granular level.
  3. Store parameters of a functional fit: The metalog with 6 or so parameters is very flexible and accurate. The idea here is that we store the 6 parameters in a table. To obtain the percentiles the parameters of the metalog could be plugged into the formula to generate a sample. Then the sample could be multiplied by the hourly proportions. Then the samples*proportions would be what the percentiles are calculated on. The problem here is that the metalog fitting is slow relative to ASH and KDE. I am also not sure if this really fixes the problem or just pushes it downstream.
  4. Taking random samples of the original distribution. This feels unsafe because "random" does not always mean uniform. I mean there is a small probability that a random sample would grab more from the top tails as opposed to the bottom tails. This could be changed to, take a grid sample. This feels like it has the same problems that density estimation had without the benefits.
  5. Store moments (mean, sd, skew, kurtosis). This would be my option of choice if the distribution can be generated accurately from these moments. This is not my area of specialty. If there are high fidelity ways to do this I would like to know. I assume that I can calculate moments of the distribution at the daily level then multiply those moments by hourly proportions to deaggregate. Conversely, for aggregation I do not see a good solution.
  6. Store daily percentiles and adjusted with multipliers for hourly grains. In this situation I would take my samples. Calculate the percentiles then multiply by hourly proportions. I am not sure if this is valid. I know percentiles to not aggregate (add).

These are all the ideas I have at this time.

Alex
  • 2,011
  • 1
    Asked and answered (in several ways) at https://stats.stackexchange.com/questions/35220. There, the question concerned approximating a probability plot. But the ECDF is in the same form, whence you can store arbitrarily close approximations to any ECDF, often with huge compression. The solutions I posted there amount to adaptive versions of your idea (6). – whuber Dec 17 '22 at 15:26
  • Thanks for the comment. I see your code uses the normal as an illustration. Would your solution hold for claw and Cauchy? – Alex Dec 17 '22 at 15:44
  • 1
    It doesn't matter. You could always use the uniform distribution: that produces the ECDF, which contains full information about the distribution. By choosing a (continuous) reference distribution likely to capture the tail behavior of your data, you might achieve a better compression (because long stretches of the graph might be close to linear), but the technique is guaranteed to work no matter what. – whuber Dec 17 '22 at 16:56
  • Is there a reference for this quants <- (1 + sin(1:m / (m+1) * pi - pi/2))/2 sort(c(x[1:e], quantile(x, probs=quants), x[(n+1-e):n])). I could have grabbed a grid or letter-values or some other cuts. What is special about this draw? – Alex Dec 17 '22 at 17:06
  • I don't think so. I recall it was inspired by the arcsine transformation for binomial variables, but I don't believe there's any more theory to it than that. This is merely a starting guess. It might have been better just to position those quantiles evenly. You would actually do fine with a random selection of quantiles. – whuber Dec 17 '22 at 18:10
  • @whuber This is making sense. I just wanted to make one of your statements more rigorous. You said, "this technique is guaranteed to work". What are the conditions? It seems there should be something said about the number of points needed and the spread. One percentile of the ecdf would not accurately reproduce the distribution, neither would two. So, how many? Also if 1000 draws were between the 0.01 percentile and the 0.02 percentile there would not be an accuracy reproduction either. I am curious if there is some theory behind this. – Alex Dec 18 '22 at 01:59
  • @whuber also does this technique have a name? – Alex Dec 18 '22 at 02:00
  • In the GIS world this is called the Douglas-Peucker algorithm for simplifying a polyline. There are no conditions on the guarantee: in the worst case you will wind up storing all the data, which obviously reproduces the data distribution exactly. The amount of compression depends on your target accuracy, the distribution of the data, and (to a small extent) your starting values. One could obtain bounds based on the largest jumps in the data, but that would be of little interest. You're better off just monitoring the progress at each step of the algorithm. – whuber Dec 18 '22 at 15:46

0 Answers0