1

Background

Suppose I have data $\mathcal{D}_1, \cdots, \mathcal{D}_n$ with each $\mathcal{D}_i$ containing $m$ observations $X_{i1}, \cdots, X_{im}$; these observations are of unknown distribution, but we may safely assume that they have moments of all orders.

If $\overline{X}_{n-1}$ represents the sample mean over the $(n-1)m$ observations from $\mathcal{D}_1, \cdots, \mathcal{D}_{n-1}$, then computing $\overline{X}_{n}$ doesn't need to recompute over $\mathcal{D}_1, \cdots, \mathcal{D}_{n-1}$ but rather follows from the simple updating scheme \begin{align*} \overline{X}_{n} = \frac{(n-1)m\overline{X}_{n-1} + \mathbf{1}^\intercal \mathcal{D}_n}{nm} \end{align*} which is very convenient when dealing with large databases that are constantly being updated. A similar algorithm exists for the sample variance, and the general field of studying this kind of serial updating is known as online learning.

Problem

Unfortunately for the quantiles (specifically in this case, 0.025 and 0.975 quantiles to construct 95% confidence intervals), no such serial updating exists due the rank-based nature of quantiles. One initial solution I considered is computing the sample mean and variance and constructing quantiles based on a normal assumption. However, my data can be severely skewed, heavy/light-tailed, etc, so either I would like to construct these quantiles based on empirical quantiles or based on a "flexible" approach.

With the "flexible" approach, I was considering seeking a four-parameter distribution which would then account for skewness and kurtosis. Things I've considered include the Generalized Beta of the Second Kind, Burr Type XII distribution, Fleishman distribution, etc. But the problem with all these distributions are that their sufficient statistics are just the entire dataset, and therefore not viable for online learning.

So my main question is does there exist a flexible distribution suitable for online learning? Or more concretely, does there exist a flexible distribution with sufficient statistics which isn't just all the data?

It's not necessary to proceed down this flexible distribution route; I'm happy to hear out all possibilities.

Tom Chen
  • 601
  • 3
  • 13
  • 2
    I think this is just asking: What four-parameter distribution is good for fitting to data and estimating the 2.5th and 97.5th percentiles of an underlying unknown distribution? A shorter question like that might be clearer, since all the discussion of moments, updating rules, sufficient statistics and online learning can be confusing. – Matt F. Sep 27 '23 at 17:32
  • Unfortunately, the online learning part is just as crucial. The datasets I'm working with are quite massive, and are to be updated constantly. Constantly refitting all the data each time with a four-parameter distribution is unfeasible, and would prefer a quick updating scheme. – Tom Chen Sep 27 '23 at 17:40
  • 1
    "their sufficient statistics are just the entire dataset" ... if you want sufficient statistics that are a small, fixed subset, and if the support doesn't depend on the parameter vector you're looking at the exponential family, because of the Pitman–Koopman–Darmois theorem. e.g. see https://en.wikipedia.org/wiki/Sufficient_statistic#Exponential_family – Glen_b Sep 27 '23 at 17:45
  • Perhaps my version could be modified by saying “and we want a consistent updating procedure using only the last 1% of the data and the parameters derived from the first 99%”. (I had previously read the question as not requiring an updating procedure because it seemed to be out of reach.) – Matt F. Sep 27 '23 at 17:51
  • 1
    Re "no such serial updating exists due the rank-based nature of quantiles": not so. See https://stats.stackexchange.com/questions/7959 for instance. – whuber Sep 27 '23 at 21:09

0 Answers0