1

I am dealing with a dataset that has -

  • many "measurements" (say X, Y and Z)
  • a tremendous amount of data (at least considering we need to query it on the fly)
  • and a user base that is interested in joint probabilities of measurements (e.g. P(X >= x & Y <= y) | P(Y <= y))

**NOTE**: All measurements of interest are binned

  1. Our univariate distributions are stored as t-digests that are pretty efficient and they are also accurate for our needs
  2. Currently, we are storing the joint distributions - of say X, and Y - as nested maps in our database where keys of the outer map correspond to bins of X, those of the inner map correspond to bins of Y, and the values of the map correspond to frequency of x-bin, y-bin. while this helps us "build & answer" the joint probabilities queries, nested maps are computationally inefficient
  3. Consequently, I am exploring to see if its possible to store the bivariate distribution (say of X, and Y) as univariate distribution of say Y projected onto X (for example, I could split the "bin of X" into total bins of Y and then frequency of a given x-y bin could be stored as that of this single projected variable) and then recruit the aggregate structures we use for univariate

Problem

  • My idea of "projection" isn't quite working out though. The projected variable's quantiles work out very different results compared to the 2-D distribution. I tried to account for two items I noticed -
    • x-bin, max-y-bin are too close to x-bin + 1, 0 - so I tried to split "x-bin" into total-y-bins * 3 and "interleave" y-bin values
    • I also built the distribution using a variable that is 5X the projected variable thinking the t-digest algorithm is perhaps merging these bins

Question

Not withstanding the t-digest approach to aggregating the univariate projection of the bivariate distribution (as that may be introducing issues, at least in part, to how the algorithm works relative to this problem), is it even possible in principle, to capture the distribution of a 2-d variable as a univariate distribution?

Bi Act
  • 23
  • IIUC, you consider two binned random variables X and Y for which there is a belonging matrix that approximates the joint distribution of X and Y. This matrix is currently stored in nested maps and you think about changing that. When you say "projecting", do you just mean to flatten the matrix (i.e. mxn matrix becomes mx1), or do you mean marginalization? – frank Feb 15 '22 at 09:47
  • @frank, yes - flatten the matrix was the approach I was trying to take. Is there another approach that helps store a 2 dimensional distribution as one dimensional? – Bi Act Feb 15 '22 at 15:26
  • I am sorry, I had a typo: flatten (as in https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flatten.html ) means turning an mxn matrix into an (m*n)x1 matrix by concatenating all the rows or all the columns. But that is what you are referring to? – frank Feb 15 '22 at 15:31
  • @frank, that's precisely what I was shooting for, but my precision was way off the mark :) – Bi Act Feb 16 '22 at 20:47
  • So, that answers your question? Or is there still something you want to know? I am not sure... – frank Feb 17 '22 at 04:20
  • well in my case, I - a) flattened, and b) used T digest, the result of a) and b) did not yield correct answers. I guess my question was does flattening distort the 2d distribution beyond repair? – Bi Act Feb 17 '22 at 05:52

1 Answers1

0

You can change your data by flattening the matrix without losing any information. You can convince yourself thereof by noting that you can of course easily re-create the matrix from the flat vector.

And, if you want to, you can also think of the $(n\cdot m)$-dimensional vector (that is the flattened $n\times m$ probability matrix) as a new one-dimensional probability density and apply your T digest to it.

But it will then be more cumbersome to compute those joint probabilities you mentioned above. Extracting those from the T digest of the flattened vector will be difficult and maybe the problems you will get are worse than the benefits.

frank
  • 10,797
  • Thanks @frank. It is helpful knowing that the solution in concept isn't flawed. I will definitely need to understand how t-digest affects the implementation of this idea. – Bi Act Feb 18 '22 at 05:05
  • If you like the answer, please consider accepting and/or upvoting it. If you think something is missing, please leave a pertinent comment. – frank Feb 18 '22 at 06:34