1

I have a set of measurements from an air polution sensor. I want to determine the min and the max value in a period of time (let's say in a day).

The min and the max don't have to be the true mathematical min and max and I want to determine them robustly, because I suspect that there are outliers in sensor data.

I want to use the 1st percentile and the 99th percentile. Is that okay?

  • 1
    What exactly are those min and max supposed to be? If you really want to model the “smallest possible” value “outliers” should not be your concern unless they are wrong (the measurement device malfunctioned etc). – Tim May 06 '23 at 07:13

3 Answers3

2

I see often that people spend too little time planning the experiment and too much time evaluating a corrupted dataset. Therefore, I get suspicious and ask why have you chosen this percentile? Have you thought about the frequency of corrupted data points, and their origin before evaluating the dataset? Do the values 1% and 99% just enhance the "argument" you are trying to make or are you being conservative? You should ask these questions yourself, and test if the answers are satisfying.

To the question: State what you are using to evaluate the data. Do not say that you are evaluating min and max values, but use the 1% and 99% percentiles instead. It's also good practice to run different evaluations using different values and test that the result is robust against the subjective choice (1%, 99%).

Other than this, I do not take an issue with the analysis. Here is sample R code.

## generate fake data 
nDays = 200   # we take data for 200 days
nData = 24*2  # we take one data point every 30min => 48 data points per day
data  = rnorm(nDays*nData) # fake data
day   = factor(rep(1:nDays, each=nData))

store the data in a data frame:

df = data.frame(data, day)

calculate quantiles for each day:

library(dplyr) q01 = df %>% group_by( day ) %>% reframe(q = quantile(data, c(1e-2))) q99 = df %>% group_by( day ) %>% reframe(q = quantile(data, c(99e-2)))

plot them

dfq = data.frame( data = c(q01$q, q99$q), grp = factor(c(rep('1%', nDays), rep('99%', nDays))) ) boxplot(data ~ grp, dfq)

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
Semoi
  • 731
  • 1
    Good points (+1). The OP does not state that they are using R. It is evidently software you use, and it's perhaps the best guess at what people posting here use, but there is no CV default on software. – Nick Cox May 06 '23 at 09:26
1

Denote $X_1,\dots,X_n$ the sensor data from which you want to compute the max.

A preliminary approach could be to take $$\widehat{max}(X_1,\dots,X_n) = Median(X_1,\dots,X_n)+\Phi^{-1}\left(\frac{n-\alpha}{n-2\alpha+1} \right)\frac{IQR(X_1,\dots,X_n)}{\Phi^{-1}(3/4)-\Phi^{-1}(1/4)} $$ with $\alpha=0.375$, $\Phi$ the gaussian cdf and $IQR$ the inter-quartile range. The idea is to consider an approximation of maximum order statistics found here and replace $\mu$ by the median and $\sigma$ by $\frac{IQR(X_1,\dots,X_n)}{\Phi^{-1}(3/4)-\Phi^{-1}(1/4)}$.

Then, if the data were gaussian, you would get an approximation of the expectation of the maximum. On the other hand, if the data are Gaussian but with outliers, you still get a robust estimator of the max because you use only the median and IQR and they can both tolerate up to $25\%$ outliers. Now this is very preliminary because it suppose a Gaussian model for the inliers, but nonetheless if your data are well behaved (we would need to see the data to assess that typically with a qqplot), then this should work.

TMat
  • 804
0

if you have a list of all values through out a window of time (24 hours), then:

  1. rearrange the values in the list in ascending or descending order.
  2. calculate the median of the first n of values (e.g. take the median of a five values)
  3. check the values manually for the outliers by looking at the list
  4. if the all the n values fall within the outliers, then increase n to include more normal values
Thulfiqar
  • 101
  • 1
    This recipe is problematic for several reasons. First, it ignores the likelihood of serial correlation. Second, it's too vague because it doesn't define (or even describe) what an "outlier" might be. Third, it simply won't work: try it out on some data. – whuber Jan 21 '21 at 13:09
  • @whuber thank you for your informative comment. I had assumed the outlier to be a huge value due to a faulty sensor – Thulfiqar Jan 21 '21 at 13:42