Scanning a stock database for errors/flaws

Question

I'm currently working on some matlab code that is supposed to check a stock database for any errors (missing values, wrong values, etc.). The reason for this is that after reading this post I came to the conclusion that I'll probably have to write some data cleaning code in order to get accurate and reliable results when backtesting with this database.

The database has been downloaded from yahoo finance and contains the following columns for each stock: Date, Open, High, Low, Close, Volume, AdjClose.

So far the program scans for the following trivial errors:

Close > High
Close < Low
Open > High
Open < Low
High < Low

The program also checks if any of the data columns contains values less than zero or NaN.

What other errors/flaws could I look for in the database?

It will help if you explain the reasoning behind these checks. Why do you expect such errors to appear? — , Mar 15 '13 at 01:25
@EugeneS Data "scrubbing" is a common ritual in this industry. — chrisaycock, Mar 15 '13 at 01:47
You could check for duplicate entries, you could make sure that your dates are all unique. — Akavall, Mar 15 '13 at 02:58
@chrisaycock Hi and thanks for your comment. However I wonder how these errors might appear? Wouldn't it mean a faulty data source in the first place? Thanks! — , Mar 15 '13 at 07:39
@chrisaycock: Could you elaborate a bit more on data scrubbing? What procedures/routines are used to clean the data? — , Mar 15 '13 at 10:24
@classifire There are lots of things to look for: missing days, prices that seem abnormal compared to previous ticks, adjustments that went wrong, etc. — chrisaycock, Mar 15 '13 at 11:19
Can you give any practical advice on how to detect "adjustments that went wrong"? — , Mar 15 '13 at 11:55
@classifire Via an external data source, which this guy didn't seem to understand. — chrisaycock, Mar 15 '13 at 12:53

score 6 · Answer 1 · answered Mar 17 '13 at 00:57

Few points from my experience:

1 Another filters that you that you should consider is for price = 999 or 999.99 that appears in some data providers.

2 Another set of checks is to look at cross-section of e.g. range = (high-low)/close over all names. Check for the smallest range and largest range to see if the values make sense. You can also check daily % change from one day to another. Check all largest moves for errors in the data. Flash crash in the US have created huge ranges, but if you see abnormal ranges on different days, check out the quality of data. Also September 2008 there are many nonsensical values even in very liquid products.

3 You have to be careful using yahoo (and other sources) for companies changing names, or going in / out of bankruptcy.

+1 Very good points - I'm definitely gonna add those checks. — , Mar 17 '13 at 09:49

score 2 · Answer 2 · edited Mar 19 '13 at 11:16

The adjusted close will change after dividends and stock splits. So the old data will have to be replaced by the new. So it is usually a good idea to check for adj close of the downloaded values against current values.

I also like to check for downloaded data against some other source (like Google). I do this by writing a unit test that will randomly pick a date and download the data from Google and check against Yahoo's.

score 0 · Answer 3 · answered Mar 16 '13 at 21:37

0

What you're talking about is sanity checks and if you really have to do those, it's good grounds to classify the data source as unreliable (or, alternatively, your understanding of said data source is inadequate). Typically, when getting from Yahoo, you shouldn't need to do this.

answered Mar 16 '13 at 21:37

Dmitri Nesteruk

2,022
1
19
27

4

And yet this question clearly shows trouble with Yahoo's data. – chrisaycock Mar 17 '13 at 01:12
I partially agree with both of you: I just ran the code cleaning algorithm on my stock database downloaded from Yahoo (258 stocks for about 10 years each). It didn't detect any of those trivial errors mentioned in my question (e.g. close > high, etc.) - which supports Dmitri's answer. But I also agree with chrisaycock: There are a couple of posts indicating other serious issues with Yahoo's data (e.g. this one). – Mar 18 '13 at 13:02

Scanning a stock database for errors/flaws

3 Answers3

Linked

Related