I saw this question "Storing a probability distribution without saving single values" on stackexchange and thought it deserved a statistical answer.
Example Scenario
I could see this problem arising if one would like to store a percentile, like P90, on a granular level, but would also like the ability to aggregate to higher grains.
Say for example one has a probabilistic forecast at the hourly level one could store the median and a few percentiles P70, P90, etc. However, if one would like to aggregate those values to percentiles to get daily level percentiles then those daily percentiles may not be valid.
It seems that the most convenient thing to do statistically is to store all the samples. This is expensive, but feasible at the daily grain. However, it expensive and unfeasible at the hourly grain.
Possibilities
Here I make a list of possibilities available to get valid percentiles on both a daily and hourly grain without having to store all the samples. The constraint is that these procedures must be possible for a database engine to store (so I can't store fit objects). Anything sophisticated would have be precomputed then stored in a table.
Here I list the options I am aware of. I would like to know:
a) if I am missing something from this list
b) am I making proper assumptions about the items in this list
- Store all samples: If I store all samples at the daily grain and I know the proportion of demand an hour makes in a day then I can multiply the samples by the proportions to get hourly samples which I can calculate percentiles on. As stated above this is not feasible. It is to expensive.
- Store bins: I know I can get bins from average shifted histograms, and KDE. Further, any density estimation technique could be used. For an overview see Density estimation in R. The way this would work is that my 1000 samples would be reduced to some lower count of bins. I can define this to be just about anything I want. In the paper above it looks like the standard is 512, but this does not reduce my samples all that much 1000 to 512 is only a small reduction. Also, my ability to calculate extreme percentiles say P99 would be reduced as the number of my bins decreases. More reduction could be possible because we only really care about the percentiles greater than the median so we could take the ~250 bins greater than the median. Still this would be more data than I would want to store at a granular level.
- Store parameters of a functional fit: The metalog with 6 or so parameters is very flexible and accurate. The idea here is that we store the 6 parameters in a table. To obtain the percentiles the parameters of the metalog could be plugged into the formula to generate a sample. Then the sample could be multiplied by the hourly proportions. Then the samples*proportions would be what the percentiles are calculated on. The problem here is that the metalog fitting is slow relative to ASH and KDE. I am also not sure if this really fixes the problem or just pushes it downstream.
- Taking random samples of the original distribution. This feels unsafe because "random" does not always mean uniform. I mean there is a small probability that a random sample would grab more from the top tails as opposed to the bottom tails. This could be changed to, take a grid sample. This feels like it has the same problems that density estimation had without the benefits.
- Store moments (mean, sd, skew, kurtosis). This would be my option of choice if the distribution can be generated accurately from these moments. This is not my area of specialty. If there are high fidelity ways to do this I would like to know. I assume that I can calculate moments of the distribution at the daily level then multiply those moments by hourly proportions to deaggregate. Conversely, for aggregation I do not see a good solution.
- Store daily percentiles and adjusted with multipliers for hourly grains. In this situation I would take my samples. Calculate the percentiles then multiply by hourly proportions. I am not sure if this is valid. I know percentiles to not aggregate (add).
These are all the ideas I have at this time.
quants <- (1 + sin(1:m / (m+1) * pi - pi/2))/2 sort(c(x[1:e], quantile(x, probs=quants), x[(n+1-e):n])). I could have grabbed a grid or letter-values or some other cuts. What is special about this draw? – Alex Dec 17 '22 at 17:06