1

I have several datasets that I wish to analyse and summarize - and I'm starting with the basic median, and percentile/quartiles to begin with (unless there are other approaches I should be considering). The data are not normally distributed. Each dataset contains readings from a single sensor.

However, my raw dataset is not a column of single numbers, but of two: the first giving the value, and the second column indicating the number of times that value has been observed. (Actually three, with the first column identifying which sensor the measurement was taken from, but I think we can ignore that to begin with).

Is it possible to calculate percentiles without first "un-binning" (eg. using a loop)? If so, how?

I am working towards a final objective of:

  • data, either as comma or space separated rows, to be preferably piped to a script from bash, to perform the analysis (alternatively it could be pulled from a postgresql database if I can pass variables into the sql query).
  • script to output data in a form suitable for graphing in gnuplot as a boxplot (there will be intermediate steps between calculation of quartiles and the graphing).
  • I can thereby visually compare the reading from one sensor against the others.

At this point, I have few other constraints. I'm currently looking at julia, octave, or R -- but I'm open to suggestions.

Data File (can be awk'd, or change SQL, to remove the first column if needed)

Sensor A,0.1,1
Sensor A,8.3,2
Sensor A,19.1,3
Sensor A,22.4,7
Sensor A,67.3,3
Sensor A,102,2
Sensor A,428,1
Sensor A,1024,1

Output:

sensor name, 10th-percentile, lower quartile, median, upper quartile, 90th-percentile.
(repeated for other sensors if I do this in one run)

Additional notes about a typical (real) data sample:

Number of rows: 2,000
Number of readings: 32,000 (approx)
Min: 0.01
Max: 11,000

If I manually bucket the readings for one sensors, I have

Number of Sensor Readings

Bucket Sensor A Sensor B
0-1 1,447 2,638
1-10 26,285 16,858
10-100 4,138 10,326
100-1,000 152 1,972
1,000-10,000 1 320

(the buckets currently are not detailed enough for a histogram, but tht is separate work)

Phil
  • 11
  • 2
  • Why don't you want to un-bin the data? In R if you unbin using (eg) inverse.rle() then you'll be able to get all the summaries you want very easily. Without unbinning there are functions to calculate summary stats on frequency weighted data (eg DescTools::Quantile() – George Savva Feb 15 '23 at 13:03
  • 1
    I don't understand the description of your data. What are the three columns in your table? Is the "value" that you observe the actual reading that is the same for all the components in the bin, or is it some kind of aggregate? – Tim Feb 15 '23 at 13:04
  • @Tim: The table is purely an example of the data I'm working with, nothing more. If not helpful to the question it can be deleted. The important part is paragraph 2. – Phil Feb 15 '23 at 14:36
  • @GeorgeSavva: It's not that I don't want to un-bin -- it's that due to the number of data samples (which for one run could potentially involve >2m samples) I was concerned about efficiency. I also have the design consideration of whether to run the analysis multiple times (one for each sensor), or, as briefly mentioned in para 2, whether I can feed all data in at once and get out stats for each. – Phil Feb 15 '23 at 14:43
  • If you supply a small subset of data as a reproducible example then it will be easier to give specific advice. Whether you analyse all the data together or one sensor at a time depends on exactly what you want to know? – George Savva Feb 15 '23 at 14:47
  • For efficient, flexible ways to summarize large univariate datasets, take a look at https://stats.stackexchange.com/questions/35220, which discusses methods of accurately describing the entire distribution of results (without binning) to any given level of accuracy using minimal amounts of storage. With such a summary available for each sensor you can readily combine them. Another approach for specific quantiles is to update them sequentially: see https://stats.stackexchange.com/questions/7959. – whuber Feb 15 '23 at 15:38
  • @whuber: Aha! 65710 does good -- thank you. I'd missed that earlier. Alas! The two links in your comment are beyond my comprehension. A QQPlot seems to compare two sets of quantiles, but it's not clear to me how I can use this. For Q7957, the P2 algorithm link is stale, the extended one is behind a paywall, and I suspect that I'm not academically able to grasp it anyway. – Phil Feb 15 '23 at 21:11
  • A QQ plot against the uniform distribution, aka an empirical CDF, enables you to look up quantiles in $O(1)$ time: you just read them off the plot. – whuber Feb 15 '23 at 21:22

0 Answers0