I have several datasets that I wish to analyse and summarize - and I'm starting with the basic median, and percentile/quartiles to begin with (unless there are other approaches I should be considering). The data are not normally distributed. Each dataset contains readings from a single sensor.
However, my raw dataset is not a column of single numbers, but of two: the first giving the value, and the second column indicating the number of times that value has been observed. (Actually three, with the first column identifying which sensor the measurement was taken from, but I think we can ignore that to begin with).
Is it possible to calculate percentiles without first "un-binning" (eg. using a loop)? If so, how?
I am working towards a final objective of:
- data, either as comma or space separated rows, to be preferably piped to a script from bash, to perform the analysis (alternatively it could be pulled from a postgresql database if I can pass variables into the sql query).
- script to output data in a form suitable for graphing in gnuplot as a boxplot (there will be intermediate steps between calculation of quartiles and the graphing).
- I can thereby visually compare the reading from one sensor against the others.
At this point, I have few other constraints. I'm currently looking at julia, octave, or R -- but I'm open to suggestions.
Data File (can be awk'd, or change SQL, to remove the first column if needed)
Sensor A,0.1,1
Sensor A,8.3,2
Sensor A,19.1,3
Sensor A,22.4,7
Sensor A,67.3,3
Sensor A,102,2
Sensor A,428,1
Sensor A,1024,1
Output:
sensor name, 10th-percentile, lower quartile, median, upper quartile, 90th-percentile.
(repeated for other sensors if I do this in one run)
Additional notes about a typical (real) data sample:
Number of rows: 2,000
Number of readings: 32,000 (approx)
Min: 0.01
Max: 11,000
If I manually bucket the readings for one sensors, I have
Number of Sensor Readings
| Bucket | Sensor A | Sensor B |
|---|---|---|
| 0-1 | 1,447 | 2,638 |
| 1-10 | 26,285 | 16,858 |
| 10-100 | 4,138 | 10,326 |
| 100-1,000 | 152 | 1,972 |
| 1,000-10,000 | 1 | 320 |
(the buckets currently are not detailed enough for a histogram, but tht is separate work)
inverse.rle()then you'll be able to get all the summaries you want very easily. Without unbinning there are functions to calculate summary stats on frequency weighted data (egDescTools::Quantile()– George Savva Feb 15 '23 at 13:03