0

I'm not sure if this is the right place to ask, so forgive me if this sound un-related. Here is my situation:

My dataset is continual in time, and there are some errones data that I need to handle. They are sudden increase in value, relative to their neighbors.

Here is a part of the dataset. As you can see, there is a sudden increase for the 4th value(28.3). (values are in the last column)

19741212,0700,200,1,N,  4.6
19741212,0800,190,1,N,  4.6
19741212,0900,180,1,N,  5.7
19741212,1000,160,1,N, 28.3   # wrong data, need interpolate from neighbors
19741212,1100,170,1,N,  4.6
19741212,1200,200,1,N,  5.1
19741212,1300,230,1,N,  5.1

I need to identify them, and then do interpolate from nearby data to replace them. I'm wondering if there is any existing algorithm for this?

If I'm going it implement a method from scratch, I wound:

  1. Calculate increment from near data point
  2. Select a proper threshold for the detecting the corrupted data

But I'm not sure if this is good enough, maybe I neglect some other part, which would result huge amount of false positives.

Addtionally, I'm using Python and Pandas for dealing with the data, so related resources would be great.

cqcn1991
  • 17,595
  • 41
  • 122
  • 194

2 Answers2

3

One way to detect corrupted data or outliers is to first calculate the rolling median (it's robust to outlier) of a series, and then compute the distance between the actual observation and rolling median. Filter out those observations with distance larger than threshold.

# your data
# ====================================
print(df)


             A    B  C  D     E
19741212   700  200  1  N   4.6
19741212   800  190  1  N   4.6
19741212   900  180  1  N   5.7
19741212  1000  160  1  N  28.3
19741212  1100  170  1  N   4.6
19741212  1200  200  1  N   5.1
19741212  1300  230  1  N   5.1

# roling median, 3-term moving windows
# =================================================
res = pd.rolling_median(df['E'], window=3, center=True)
print(res)

19741212    NaN
19741212    4.6
19741212    5.7
19741212    5.7
19741212    5.1
19741212    5.1
19741212    NaN
dtype: float64

# threshold 20% from rolling median
threshold = 0.2
mask = abs(df['E'] - res)/res > threshold
# replace outliers with rolling medians
df.loc[mask, 'E'] = res[mask]

print(df)

             A    B  C  D    E
19741212   700  200  1  N  4.6
19741212   800  190  1  N  4.6
19741212   900  180  1  N  5.7
19741212  1000  160  1  N  5.7
19741212  1100  170  1  N  4.6
19741212  1200  200  1  N  5.1
19741212  1300  230  1  N  5.1
Jianxun Li
  • 22,380
  • 9
  • 54
  • 72
1

You can also identify the the ouliers, you can test how far they are from the mean and set a standard deviation treshold.

Based on https://stackoverflow.com/a/11686764/2477491 , you set your outliers to NaN with :

def reject_outliers(data, m=2): # 2 is the std treshold, fit for your needs.
    return data[abs(data - np.mean(data)) < m * np.std(data)]

data[6] = reject_outliers(data[5]) # creates a new column with outliers set to Nan

          0     1    2  3  4     5    6  
0  19741212   700  200  1  N   4.6  4.6  
1  19741212   800  190  1  N   4.6  4.6  
2  19741212   900  180  1  N   5.7  5.7  
3  19741212  1000  160  1  N  28.3  NaN  
4  19741212  1100  170  1  N   4.6  4.6  
5  19741212  1200  200  1  N   5.1  5.1  
6  19741212  1300  230  1  N   5.1  5.1  

If you have trends in your serie, you may rather apply it on time moving window instead of the whole serie.

So about applying custom functions on a window, I usually use scipy.ndimage.filters.generic_filter wich work also with 1d arrays and return a scalar applying a function on a moving window defined by a footprint. Here is an example about how interpolate mean value for NaN only in 1x3 footprint :

from scipy import ndimage as im

def interpNan(win): # with win the 1x3 window
    if win[1] != win[1]: # if center of footprint is a nan
        return round(np.nanmean(win), 1)
    else:
        return round(win[1], 1)

footprint  = np.array([1,1,1])
data[7]    = im.generic_filter(data[6], interpNan, footprint = footprint )

          0     1    2  3  4     5    6    7
0  19741212   700  200  1  N   4.6  4.6  4.6
1  19741212   800  190  1  N   4.6  4.6  4.6
2  19741212   900  180  1  N   5.7  5.7  5.7
3  19741212  1000  160  1  N  28.3  NaN  5.2
4  19741212  1100  170  1  N   4.6  4.6  4.6
5  19741212  1200  200  1  N   5.1  5.1  5.1
6  19741212  1300  230  1  N   5.1  5.1  5.1

[7 rows x 8 columns]

You can also merge the two functions toghether but for quality analysis, I don't and always keep raw data, valid data, and interpolated data.

Community
  • 1
  • 1
Delforge
  • 782
  • 7
  • 16