Best way to store hourly/daily options data for research purposes

Question

There are quite a few discussions here about storage, but I can't find quite what I'm looking for.

I'm in need to design a database to store (mostly) option data (strikes, premiums bid / ask, etc.). The problem I see with RDBMS is that given big number of strikes tables will be enormously long and, hence, result in slow processing. While I'm reluctant to use MongoDB or similar NoSQL solution, for now it seems a very good alternative (quick, flexible, scalable).

There is no need in tick data, it will be hourly and daily closing prices & whatever other parameters I'd want to add. So, no need for it to be updated frequently and writing speed is not that important.
The main performance requirement is in using it for data mining, stats and research, so it should be as quick as possible (and preferably easy) to pull and aggregate data from it. I.e., think of 10-year backtest which performs ~100 transactions weekly over various types of options or calculating volatility swap over some extended period of time. So the quicker is better.
There is lots of existent historical data which will be transferred into the database, and it will be updated on a daily basis. I'm not sure how much memory exactly it will take, but AFAIK memory should not be a constraint at all.
Support by popular programming languages & packages (C++, Java, Python, R) is very preferable, but would not be a deal breaker.

Any suggestions?

Have you seen this earlier question? The topic of tick storage comes-up a lot on here. — chrisaycock, May 08 '13 at 11:23
Yep, I've seen it. But that's why I'm differentiating from similar questions because I don't need tick data and so don't need quick writing. Also it seems that it is mostly about flat time series storage, but I want to have DB designed to fit options data particularly well. — sashkello, May 08 '13 at 11:28
@sashkello, I recommend you to think only about your requirements first. Do not get confused by someone who is ecstatic about Redis or SQL or what have you. You want to store data (speed is not so important), you want to query data fast and flexibly, you probably want to query it in R as well because you mentioned you want to profile and analyze said data. You want to look for a solution that can grow dynamically and which is extensible. Read up what people use to store time series. After that decide whether any SQL solution actually makes sense here or other solutions solve the problem better — Matt Wolf, May 09 '13 at 03:53
"Enourmously long"? My execution table in my backtest archive had - we just wiped it due to some code issues we found - 2.5 billion rows, and is expected to grow. We collect in db form minute price data (for fast charts) and market profile per hour. I would not call that "a lot of data". Putting in a couple of SSD is cheap - and gives you HUGH read speed. What exactly is the problem you have here? — TomTom, Jun 04 '13 at 02:36
@sashkello Have you come with the database schema and model. What have you used. Have you used SQL or NoSQL? I also have a requirement quite same to your requirement. — Arun Raja, Oct 24 '14 at 04:31
@ArunRaja Ended up with mongodb, since I don't care about writing speed, only reading speed. So, just raw timestamped data with a couple of proper indexes did it for me. — sashkello, Oct 28 '14 at 13:13
@sashkello I am planning to use windows machine. So please can you pass me the details of the database design and installation details. I would be writing the data once or twice per day. Can you please mail the details at arun1989.vj@hotmail.com — Arun Raja, Oct 29 '14 at 07:27
Just wondering what you ended up doing and how it worked out for you. I'm in the same place as you were when you asked this question and would really like to hear about your experience. — Cuedrah, Apr 29 '16 at 21:26
@Cuedrah I ended up with mongodb. The big benefit for me is that it's quite flexible, and so for someone not well versed in data management and have to reorganize and reshuffle stuff often (every time I understand I did things wrong), nosql is much better in that sense. I use flat structure with no arrays within documents or anything like that, so lots of duplicate information which is not good for space, but very good for quick reads. As soon as you feel comfortable with aggregate queries, summarizing and grouping data works awesome as well. — sashkello, Apr 30 '16 at 04:18
@Cuedrah Among the negatives, it's a bit steeper learning curve than SQL. Aggregation is not very easy to understand, indexing is very important and not that straightforward either. One of the things hard to get past is the fact that in SQL you just have a table in easily readable format you can browse through. Not so much for mongo, you need to first aggregate a library of functions in programming language of your choice, which will display data in an easily readable manner. As soon as you have that, it's easier to manage... Otherwise, at first it might get quite annoying. — sashkello, Apr 30 '16 at 04:25
@Cuedrah Retrospectively, I'm happy with my choice, and feel like for my purposes +'s outweigh -'s, but there were a few moments when I seriously was thinking I should migrate the whole thing... — sashkello, Apr 30 '16 at 04:27

score 7 · Answer 1 · answered May 08 '13 at 19:54

I recommend you optimize your SQL implementation instead of going for NoSQL, and throwing more expensive hardware at the problem.

Always benchmark first. The reason I'm saying this is that I've seen MS SQL Server scale perfectly fine for options data of the magnitude you're describing and "big number of strikes tables will be enormously long and, hence, result in slow processing" is not a good way to judge.
Redis is a very bad idea for what you're trying to accomplish. From what I can see from the other post, all it has going for it is that it has R bindings. But quite frankly, almost everything has multi-language bindings nowadays that it's an inactive selling point. Redis is designed to trade off consistency and durability for speed. Mongo is similar (it's not that there is no durability, you'd look up WALs to recover and this is rather sketchy but that's for another topic). To put this in perspective:
- This trade-off becomes a necessary evil if you're doing FB social ad metrics, logging 30 million events per second in realtime. But if you're logging 2000 options * 50 records per hour = 28 records per second in batch, you don't need those trade-offs. The risks are asymmetric: If you lose your market data, you would have to seek a vendor, pay, and spend time adjusting the backfill to your own storage format. If FB misses a few clicks for user statistics to deliver latency requirements, everything still moves along smoothly. So you have to work around this, set up persistence servers. The persistence servers should separated (e.g. you'd put them in NY2/NY4) in case of a localized failure. It sums up to be much costlier than mirroring your disks.
- You need to have a lot of memory. This is less a problem if you're hosting everything on the cloud (but this comes with other issues and chances are you aren't). 16 cores and 244 GB memory, a Redis slave per core and you are down to 15 GB of memory. See: https://moot.it/blog/technology/redis-as-primary-datastore-wtf.html
- NOTE: The problems stated above are different from the management concerns that see NoSQL being limited in established firms - the latter are usually misguided.

@Freddy: Thanks, Freddy. Please read point (2). It's expensive to colocate your persistence server and the trade-off is unnecessary for the OP since his latency/bandwidth objectives are not just smaller by several orders of magnitude, it's smaller by astronomical orders of magnitude (28 per second vs 30,000,000 per second). — madilyn, May 09 '13 at 04:57
@kristine, I pointed out I cited nobody. I derive my conclusions and recommendation from lots of implementations I have seen and that were presented to me as well as I have worked with. Do a simple Google search and if you still believe professionals store any sort of time series data, whether it be options chain data, tick data, or other time compressed data in SQL tables then all the power to you. I would never recommend anyone touching SQL to tackle time series data. I respect your answer (though I disagree) and hope you could also pay respect to others who put in time to help others. — Matt Wolf, May 09 '13 at 06:16
@Freddy, kristine: Please try to keep any discussion in the comments brief and to the point. Irrelevant comments deleted. — olaker, May 09 '13 at 09:21

Matt Wolf · Accepted Answer · 2013-05-09T15:31:27.233

A columnar database or No-SQL solution may be your best choice.

It depends on which OS you target, what your throughput and latency requirements are and whether you look to persist all data or not and finally how big the size of your data is expected to be. Obviously if you only look to store hourly/daily data then even a database that comprises a year of all options data of the SPX500 underlyings may fit into memory and if that is the case you should definitely look at RedisDB. It can persist/snapshot data but generally loads them back into memory.

If the size of your data is a constraint to fitting it into memory then another solution such as RavenDB (well written .Net library), or other non-.net solution, depending on requirements, such as Mongo or Couch db may fit a lot better. Please add more requirements and I am happy to edit my post and add information, given I believe I can add value.

Edit:

According to your updated information I recommend to look more closely at Redis : Not only are there libraries for pretty much any programming language and OS imaginable (I use it in my .Net framework with the BookSleeve API). You also get great support in R. You can dump literally time series with millions of elements into it, have it stay in memory (you can also persist it) but you can incredibly fast access the data out of R. I do not know a faster way to access time series data out of R to be honest plus additional indexing packages will give you a great accessor library. If you look for something fast, efficient and look to profile ideas/data, which hints at R usage then RedisDB Is what you want. Of course this is not supposed to be a solution for huge data requirements but you specifically mentioned you deal with hourly/daily data and for that purpose even a machine with 16gb of core memory running 64bit code will be plenty enough thus certain suggestions that redis will force you to invest in outrageous hardware are unwarranted. If you look to heavily profile those data in R or python or other languages and want blazing fast access then redis is your solution. If you are more about long term storage and are happy to accept disk access latencies then no need for a cache based db and rather look at document or other key/value or columnar data stores.

Other solutions may also work but if you look to solve all your requirements that do not force you to sit down for days just to think about schemata and table relationships just in order to get an SQL solution going then I recommend you look into the direction of Redis or similar approaches and not SQL. Columnar databases may also work though the open source solutions are rather clumsy to work with and I would say not very well thought out in terms of usability and extensibility. (Example: HDM5, though its more of a file format than a database, but until 2013 the guys have still not managed to enable dynamic sizing of data in hdf5 files. You can delete content but the allocated space is not returned. Whatever people tell me about NASA or other organizations using it, I settled the story in the way that those are big organizations with heavy government involvement where funding and resources in many areas that are not priority can get extremely scarce. I do not believe any large organization would still operate on 1990s type of databases if they had plenty enough funding to work on developing better ones. That is the reason certain organizations still use them but I do not find it a point that speaks in favor of using Berkeley DB or HDF5; I used both of them and find them very limited in many ways, plus the performance did not even turn out to be that great.

Summary: pick your weapon of choice, columnar databases could be the solution if you are willing to work around sometimes strange limitations in the api or core data structure but they are made to deal with time series data. Or chose no SQL solutions, either memory based/cache based solutions or purely persisted ones. But I would strongly advise you not to go with SQL. It's generally a giveaway of a beginner or junior person when being asked which database they use for time series storage and the answer comes out they use MySQL, MySQL or embedded SQL solutions. Those are just not designed to handle time series data well.

I added a bit more info. Let me know if there is something I should consider in terms of requirements I'm missing. — sashkello, May 08 '13 at 07:24
@sashkello, I added some content based on your requirements. — Matt Wolf, May 08 '13 at 09:13
@kristine, Freddy: The main site is not the place to engage into heated debates on plagiarism issues. Feel free to use http://meta.quant.stackexchange.com/ instead. In any case, please try to stay constructive and avoid personal attacks. — olaker, May 09 '13 at 09:29
@olaker, with all due respect to your function as moderator , may I point to the fact that I have not initiated any debate nor accused anyone. I was recommended by other moderators to flag unconstructive and accusing posts which I have done. I would appreciate if those who actually initiate false accusations to be called out and not those who take the time to help other users. I think I have done my best to tone down rhetoric. The other user has been very confrontational in similar ways before and I reserve the right to defend against false accusations. — Matt Wolf, May 09 '13 at 12:57
@Freddy, I always thought that NoSQL is designed for data which is hard to fit into ordinary tables... That way it seems that they should be good at time series - after all it is a flat structure date:open:close:etc.. Why would they be worse for it then? — sashkello, May 09 '13 at 22:14
@sashkello, well I find it incredibly hard to put time series data into sql tables and to efficiently query them later. Why? Because each time you want to retrieve just one metric you have to load the whole row for each time stamp. What if you want to expand? You need to re-write all schemata and table structures. And most sql solutions are orders of magnitude slower than most any nosql solution, regarding time series data. But if sql works for you then go for it. The above is just my opinion not any hard rule. — Matt Wolf, May 10 '13 at 04:24

score 3 · Answer 3 · answered May 08 '13 at 08:21

You need to log some data, and later use it for analysis. Dare I suggest you just append the values to a set of files? You can load up the data later for your runs, and cache anything needed frequently. Frankly buying an SSD and copying the files to that before you run analysis should solve your problem, and no-one had to get hurt.

score 3 · Answer 4 · answered May 09 '13 at 15:24

3

Check out http://discretelogics.com/teafiles/. It solves the large memory footprint by using memory mapped files.

answered May 09 '13 at 15:24

unclepaul84

513
4
10

I like Teafiles, just it's that the use base is (still) very small. I thought I read somewhere that an R library exists to load and query Teafiles but I may be mistaken. My own binary file database is structured actually very similarly. – Matt Wolf May 09 '13 at 15:29
@Freddy: Agreed. It provides an interface too raw for a quant. But its performace charactericts make it a prime candidate to build a very versatile and performant analysis tool. – unclepaul84 May 09 '13 at 15:44
It seems very fast but I admit I only ran some quick code through the nuget library targeting .net. I like their concept much better than most columnar db solutions. – Matt Wolf May 09 '13 at 15:50

Thomas Browne · Answer 5 · 2014-10-31T22:16:48.587

For any data that is not strictly tabular and unchanging in schema, you should rule out SQL solutions. Option pricing fits that description in my experience, because high-liquidity stocks, currencies, or bonds, will have a far bigger set of strikes and maturities than lower liquidity instruments. Thus in a relational database you will have to have columns for strikes and maturities, with clunky joins everywhere, whereas in NoSQL such as Mongo, you can have an atomized "document" that has all the data associated with a timestamp, that is extremely rapid to query, and is also flexible for meta-data. Who hasn't found the need to for meta data, even in financial time series? You'll be able to go back and systematically add new structures and substructures to your data without messing up the whole schema.

My own experience with Mongo is very positive. Once you have familiarised yourself with its JSON-based query syntax, it's easy, and extremely fast on cheap commodity hardware. It's also massively easy to setup. That is definitely not the case for the relational databases on equivalent hardware. Another bonus is that you'll learn JSON - a very useful and widely use format for data that allows you to export "flat" data CSV-style, but can also do hierarchical nesting which can be really useful, without going to the overkill that is xml.

If extremes of consistency are what is necessary, then SQL is your bet. Stuff like: this salary MUST be linked to an EXISTING employee that is NOT on sabbatical. This type of rule can be embedded in the database, which will enforce consistency, unlike the more flexible (and consequently more dangerous) NoSQL products where these types of rule must be performed in the terminal. But when we're talking about atomised data with no dependencies such as a vol surface, this consistency enforcement is probably not even going to be used. The design "fit" of SQL is overkill and comes with heavy burdens. Recall that SQL was designed for business logic in the 70s.

Don't believe those who tell you Mongo or other NoSQL is not reliable when it comes to data safety. That is false. There are all the usual tools to guarantee writes, journaling, redundant servers, everything you might need for multiple layers of security, including incremental backup to cloud. Redis of course could be problematic in that respect. But the established NoSQL databases are safe and reliable, indeed possibly more safe if you do not have a skilled DB administrator who knows how to get a complex SQL system up and running again on failure.

Finally if your use case is only single-machine, then strongly consider HDF5. This will outperform and be better suited to a large financial data mining application than any of the database technologies. It has been designed exactly for what you require (massively fast, flexible queries and schemas) but of course it's not as strong for sharing.

Best way to store hourly/daily options data for research purposes

5 Answers5