10

I'm currently working on some matlab code that is supposed to check a stock database for any errors (missing values, wrong values, etc.). The reason for this is that after reading this post I came to the conclusion that I'll probably have to write some data cleaning code in order to get accurate and reliable results when backtesting with this database.

The database has been downloaded from yahoo finance and contains the following columns for each stock: Date, Open, High, Low, Close, Volume, AdjClose.

So far the program scans for the following trivial errors:

  • Close > High
  • Close < Low
  • Open > High
  • Open < Low
  • High < Low

The program also checks if any of the data columns contains values less than zero or NaN.

What other errors/flaws could I look for in the database?

3 Answers3

6

Few points from my experience:

1 Another filters that you that you should consider is for price = 999 or 999.99 that appears in some data providers.

2 Another set of checks is to look at cross-section of e.g. range = (high-low)/close over all names. Check for the smallest range and largest range to see if the values make sense. You can also check daily % change from one day to another. Check all largest moves for errors in the data. Flash crash in the US have created huge ranges, but if you see abnormal ranges on different days, check out the quality of data. Also September 2008 there are many nonsensical values even in very liquid products.

3 You have to be careful using yahoo (and other sources) for companies changing names, or going in / out of bankruptcy.

onlyvix.blogspot.com
  • 2,553
  • 19
  • 24
2

The adjusted close will change after dividends and stock splits. So the old data will have to be replaced by the new. So it is usually a good idea to check for adj close of the downloaded values against current values.

I also like to check for downloaded data against some other source (like Google). I do this by writing a unit test that will randomly pick a date and download the data from Google and check against Yahoo's.

chrisaycock
  • 9,817
  • 3
  • 39
  • 110
nitin
  • 149
  • 3
0

What you're talking about is sanity checks and if you really have to do those, it's good grounds to classify the data source as unreliable (or, alternatively, your understanding of said data source is inadequate). Typically, when getting from Yahoo, you shouldn't need to do this.

Dmitri Nesteruk
  • 2,022
  • 1
  • 19
  • 27
  • 4
    And yet this question clearly shows trouble with Yahoo's data. – chrisaycock Mar 17 '13 at 01:12
  • I partially agree with both of you: I just ran the code cleaning algorithm on my stock database downloaded from Yahoo (258 stocks for about 10 years each). It didn't detect any of those trivial errors mentioned in my question (e.g. close > high, etc.) - which supports Dmitri's answer. But I also agree with chrisaycock: There are a couple of posts indicating other serious issues with Yahoo's data (e.g. this one). –  Mar 18 '13 at 13:02