Algorithm for detecting corrupted data?

Question

I'm not sure if this is the right place to ask, so forgive me if this sound un-related. Here is my situation:

My dataset is continual in time, and there are some errones data that I need to handle. They are sudden increase in value, relative to their neighbors.

Here is a part of the dataset. As you can see, there is a sudden increase for the 4th value(28.3). (values are in the last column)

19741212,0700,200,1,N,  4.6
19741212,0800,190,1,N,  4.6
19741212,0900,180,1,N,  5.7
19741212,1000,160,1,N, 28.3   # wrong data, need interpolate from neighbors
19741212,1100,170,1,N,  4.6
19741212,1200,200,1,N,  5.1
19741212,1300,230,1,N,  5.1

I need to identify them, and then do interpolate from nearby data to replace them. I'm wondering if there is any existing algorithm for this?

If I'm going it implement a method from scratch, I wound:

Calculate increment from near data point
Select a proper threshold for the detecting the corrupted data

But I'm not sure if this is good enough, maybe I neglect some other part, which would result huge amount of false positives.

Addtionally, I'm using Python and Pandas for dealing with the data, so related resources would be great.

Jianxun Li · Answer 1 · 2015-08-29T11:34:03.360

One way to detect corrupted data or outliers is to first calculate the rolling median (it's robust to outlier) of a series, and then compute the distance between the actual observation and rolling median. Filter out those observations with distance larger than threshold.

# your data
# ====================================
print(df)


             A    B  C  D     E
19741212   700  200  1  N   4.6
19741212   800  190  1  N   4.6
19741212   900  180  1  N   5.7
19741212  1000  160  1  N  28.3
19741212  1100  170  1  N   4.6
19741212  1200  200  1  N   5.1
19741212  1300  230  1  N   5.1

# roling median, 3-term moving windows
# =================================================
res = pd.rolling_median(df['E'], window=3, center=True)
print(res)

19741212    NaN
19741212    4.6
19741212    5.7
19741212    5.7
19741212    5.1
19741212    5.1
19741212    NaN
dtype: float64

# threshold 20% from rolling median
threshold = 0.2
mask = abs(df['E'] - res)/res > threshold
# replace outliers with rolling medians
df.loc[mask, 'E'] = res[mask]

print(df)

             A    B  C  D    E
19741212   700  200  1  N  4.6
19741212   800  190  1  N  4.6
19741212   900  180  1  N  5.7
19741212  1000  160  1  N  5.7
19741212  1100  170  1  N  4.6
19741212  1200  200  1  N  5.1
19741212  1300  230  1  N  5.1

score 1 · Accepted Answer · edited May 23 '17 at 12:06

You can also identify the the ouliers, you can test how far they are from the mean and set a standard deviation treshold.

Based on https://stackoverflow.com/a/11686764/2477491 , you set your outliers to NaN with :

def reject_outliers(data, m=2): # 2 is the std treshold, fit for your needs.
    return data[abs(data - np.mean(data)) < m * np.std(data)]

data[6] = reject_outliers(data[5]) # creates a new column with outliers set to Nan

          0     1    2  3  4     5    6  
0  19741212   700  200  1  N   4.6  4.6  
1  19741212   800  190  1  N   4.6  4.6  
2  19741212   900  180  1  N   5.7  5.7  
3  19741212  1000  160  1  N  28.3  NaN  
4  19741212  1100  170  1  N   4.6  4.6  
5  19741212  1200  200  1  N   5.1  5.1  
6  19741212  1300  230  1  N   5.1  5.1

If you have trends in your serie, you may rather apply it on time moving window instead of the whole serie.

So about applying custom functions on a window, I usually use scipy.ndimage.filters.generic_filter wich work also with 1d arrays and return a scalar applying a function on a moving window defined by a footprint. Here is an example about how interpolate mean value for NaN only in 1x3 footprint :

from scipy import ndimage as im

def interpNan(win): # with win the 1x3 window
    if win[1] != win[1]: # if center of footprint is a nan
        return round(np.nanmean(win), 1)
    else:
        return round(win[1], 1)

footprint  = np.array([1,1,1])
data[7]    = im.generic_filter(data[6], interpNan, footprint = footprint )

          0     1    2  3  4     5    6    7
0  19741212   700  200  1  N   4.6  4.6  4.6
1  19741212   800  190  1  N   4.6  4.6  4.6
2  19741212   900  180  1  N   5.7  5.7  5.7
3  19741212  1000  160  1  N  28.3  NaN  5.2
4  19741212  1100  170  1  N   4.6  4.6  4.6
5  19741212  1200  200  1  N   5.1  5.1  5.1
6  19741212  1300  230  1  N   5.1  5.1  5.1

[7 rows x 8 columns]

You can also merge the two functions toghether but for quality analysis, I don't and always keep raw data, valid data, and interpolated data.

Algorithm for detecting corrupted data?

2 Answers2