1

I want to save the ecdf function as an approximation of the distribution of my measured values. My data sets contain on average 1000 measure values, 300 of them are unique. The ecdf function in R uses step function that seems to store as many data point as unique values included in data set. Because of some millions of data sets I would like to reduce the number of the stored data points. The problem is equivalent to the selection of minimum N quantile best representing the data distribution regarding the needed accuracy. Is there already a method in R I could use or a suitable algorithm I could implement?

Peter
  • 11
  • 1
  • What's your basis for saying that "The ecdf function in R uses step function that seems to store as many data point as unique values included in data set"? the source code (type `ecdf` to see it) shows that it tabulates the unique values of x, which seems to be your desired behaviour – Miff Mar 16 '22 at 11:29
  • Welcome to SO, Peter! Questions on SO (especially in R) do much better if they are reproducible and self-contained. By that I mean including attempted code (please be explicit about non-base packages), sample representative data (perhaps via `dput(head(x))` or building data programmatically (e.g., `data.frame(...)`), possibly stochastically), perhaps actual output (with verbatim errors/warnings) versus intended output. Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Mar 16 '22 at 13:03
  • I think I understood your question. You want to handle the underlying distribution of your datasets through the use of a small set of representative values of your ECDF (here you have 300 value for each data set, and you want for instance k=20, for space purposes). However, I'm afraid the "best" choice of these k values will depend a lot on your distributions shape, and what you wish to do with your distributions afterwards (resampling for instance?). For instance, you could use a k-means clustering of your data set (in 1d, I suggest Ckmeans.dp.1d in R, it's efficient and deterministic) – Clej Mar 22 '22 at 10:23
  • And once you have your clusters, taking the mean of each cluster could give you something "representative". I've already seen this technique in several fields. However, that does not mean you'll have a suitable representation of the distribution. In particular, you'll probably give too much weight to the tails of your distribution (especially if it is skewed...). Anyway, I think this question is more likely to get an answer on CrossValidated than SO, as it's more a statistics-related problem than a pure coding issue. – Clej Mar 22 '22 at 10:27

0 Answers0