I am dealing with a dataset that has -
- many "measurements" (say X, Y and Z)
- a tremendous amount of data (at least considering we need to query it on the fly)
- and a user base that is interested in joint probabilities of measurements (e.g.
P(X >= x & Y <= y) | P(Y <= y))
**NOTE**: All measurements of interest are binned
- Our univariate distributions are stored as
t-digeststhat are pretty efficient and they are also accurate for our needs - Currently, we are storing the
joint distributions- of say X, and Y - asnested mapsin our database wherekeys of the outer mapcorrespond tobins of X,those of the inner mapcorrespond tobins of Y, and thevalues of the mapcorrespond tofrequency of x-bin, y-bin. while this helps us "build & answer" the joint probabilities queries, nested maps are computationally inefficient - Consequently, I am exploring to see if its possible to store the
bivariate distribution(say of X, and Y) asunivariate distributionof sayY projected onto X(for example, I could split the "bin of X" intototal bins of Yand then frequency of a given x-y bin could be stored as that of this single projected variable) and then recruit the aggregate structures we use for univariate
Problem
- My idea of "projection" isn't quite working out though. The projected variable's quantiles work out very different results compared to the 2-D distribution. I tried to account for two items I noticed -
x-bin, max-y-binare too close tox-bin + 1, 0- so I tried to split "x-bin" intototal-y-bins * 3and "interleave" y-bin values- I also built the distribution using a variable that is 5X the projected variable thinking the t-digest algorithm is perhaps merging these bins
Question
Not withstanding the t-digest approach to aggregating the univariate projection of the bivariate distribution (as that may be introducing issues, at least in part, to how the algorithm works relative to this problem), is it even possible in principle, to capture the distribution of a 2-d variable as a univariate distribution?
flatten the matrixwas the approach I was trying to take. Is there another approach that helps store a 2 dimensional distribution as one dimensional? – Bi Act Feb 15 '22 at 15:26flattened, and b)used T digest, the result ofa) and b)did not yield correct answers. I guess my question was doesflatteningdistort the 2d distribution beyond repair? – Bi Act Feb 17 '22 at 05:52