15

There are quite a few discussions here about storage, but I can't find quite what I'm looking for.

I'm in need to design a database to store (mostly) option data (strikes, premiums bid / ask, etc.). The problem I see with RDBMS is that given big number of strikes tables will be enormously long and, hence, result in slow processing. While I'm reluctant to use MongoDB or similar NoSQL solution, for now it seems a very good alternative (quick, flexible, scalable).

  1. There is no need in tick data, it will be hourly and daily closing prices & whatever other parameters I'd want to add. So, no need for it to be updated frequently and writing speed is not that important.

  2. The main performance requirement is in using it for data mining, stats and research, so it should be as quick as possible (and preferably easy) to pull and aggregate data from it. I.e., think of 10-year backtest which performs ~100 transactions weekly over various types of options or calculating volatility swap over some extended period of time. So the quicker is better.

  3. There is lots of existent historical data which will be transferred into the database, and it will be updated on a daily basis. I'm not sure how much memory exactly it will take, but AFAIK memory should not be a constraint at all.

  4. Support by popular programming languages & packages (C++, Java, Python, R) is very preferable, but would not be a deal breaker.

Any suggestions?

sashkello
  • 989
  • 1
  • 8
  • 20
  • Have you seen this earlier question? The topic of tick storage comes-up a lot on here. – chrisaycock May 08 '13 at 11:23
  • Yep, I've seen it. But that's why I'm differentiating from similar questions because I don't need tick data and so don't need quick writing. Also it seems that it is mostly about flat time series storage, but I want to have DB designed to fit options data particularly well. – sashkello May 08 '13 at 11:28
  • @sashkello, I recommend you to think only about your requirements first. Do not get confused by someone who is ecstatic about Redis or SQL or what have you. You want to store data (speed is not so important), you want to query data fast and flexibly, you probably want to query it in R as well because you mentioned you want to profile and analyze said data. You want to look for a solution that can grow dynamically and which is extensible. Read up what people use to store time series. After that decide whether any SQL solution actually makes sense here or other solutions solve the problem better – Matt Wolf May 09 '13 at 03:53
  • "Enourmously long"? My execution table in my backtest archive had - we just wiped it due to some code issues we found - 2.5 billion rows, and is expected to grow. We collect in db form minute price data (for fast charts) and market profile per hour. I would not call that "a lot of data". Putting in a couple of SSD is cheap - and gives you HUGH read speed. What exactly is the problem you have here? – TomTom Jun 04 '13 at 02:36
  • @sashkello Have you come with the database schema and model. What have you used. Have you used SQL or NoSQL? I also have a requirement quite same to your requirement. – Arun Raja Oct 24 '14 at 04:31
  • @ArunRaja Ended up with mongodb, since I don't care about writing speed, only reading speed. So, just raw timestamped data with a couple of proper indexes did it for me. – sashkello Oct 28 '14 at 13:13
  • @sashkello I am planning to use windows machine. So please can you pass me the details of the database design and installation details. I would be writing the data once or twice per day. Can you please mail the details at arun1989.vj@hotmail.com – Arun Raja Oct 29 '14 at 07:27
  • Just wondering what you ended up doing and how it worked out for you. I'm in the same place as you were when you asked this question and would really like to hear about your experience. – Cuedrah Apr 29 '16 at 21:26
  • 1
    @Cuedrah I ended up with mongodb. The big benefit for me is that it's quite flexible, and so for someone not well versed in data management and have to reorganize and reshuffle stuff often (every time I understand I did things wrong), nosql is much better in that sense. I use flat structure with no arrays within documents or anything like that, so lots of duplicate information which is not good for space, but very good for quick reads. As soon as you feel comfortable with aggregate queries, summarizing and grouping data works awesome as well. – sashkello Apr 30 '16 at 04:18
  • 1
    @Cuedrah Among the negatives, it's a bit steeper learning curve than SQL. Aggregation is not very easy to understand, indexing is very important and not that straightforward either. One of the things hard to get past is the fact that in SQL you just have a table in easily readable format you can browse through. Not so much for mongo, you need to first aggregate a library of functions in programming language of your choice, which will display data in an easily readable manner. As soon as you have that, it's easier to manage... Otherwise, at first it might get quite annoying. – sashkello Apr 30 '16 at 04:25
  • @Cuedrah Retrospectively, I'm happy with my choice, and feel like for my purposes +'s outweigh -'s, but there were a few moments when I seriously was thinking I should migrate the whole thing... – sashkello Apr 30 '16 at 04:27

5 Answers5

7

I recommend you optimize your SQL implementation instead of going for NoSQL, and throwing more expensive hardware at the problem.

  1. Always benchmark first. The reason I'm saying this is that I've seen MS SQL Server scale perfectly fine for options data of the magnitude you're describing and "big number of strikes tables will be enormously long and, hence, result in slow processing" is not a good way to judge.
  2. Redis is a very bad idea for what you're trying to accomplish. From what I can see from the other post, all it has going for it is that it has R bindings. But quite frankly, almost everything has multi-language bindings nowadays that it's an inactive selling point. Redis is designed to trade off consistency and durability for speed. Mongo is similar (it's not that there is no durability, you'd look up WALs to recover and this is rather sketchy but that's for another topic). To put this in perspective:

    • This trade-off becomes a necessary evil if you're doing FB social ad metrics, logging 30 million events per second in realtime. But if you're logging 2000 options * 50 records per hour = 28 records per second in batch, you don't need those trade-offs. The risks are asymmetric: If you lose your market data, you would have to seek a vendor, pay, and spend time adjusting the backfill to your own storage format. If FB misses a few clicks for user statistics to deliver latency requirements, everything still moves along smoothly. So you have to work around this, set up persistence servers. The persistence servers should separated (e.g. you'd put them in NY2/NY4) in case of a localized failure. It sums up to be much costlier than mirroring your disks.

    • You need to have a lot of memory. This is less a problem if you're hosting everything on the cloud (but this comes with other issues and chances are you aren't). 16 cores and 244 GB memory, a Redis slave per core and you are down to 15 GB of memory. See: https://moot.it/blog/technology/redis-as-primary-datastore-wtf.html

    • NOTE: The problems stated above are different from the management concerns that see NoSQL being limited in established firms - the latter are usually misguided.

madilyn
  • 5,240
  • 20
  • 39
  • AFAIK, FB uses SQL. The problem, as I mentioned is not the write speed (what you mean by latency I reckon), but read speed which I think is slower for SQL. 2. What's the actual difference in memory usage?
  • – sashkello May 08 '13 at 23:08
  • 1
    @Freddy: Thanks, Freddy. Please read point (2). It's expensive to colocate your persistence server and the trade-off is unnecessary for the OP since his latency/bandwidth objectives are not just smaller by several orders of magnitude, it's smaller by astronomical orders of magnitude (28 per second vs 30,000,000 per second). – madilyn May 09 '13 at 04:57
  • @kristine, I pointed out I cited nobody. I derive my conclusions and recommendation from lots of implementations I have seen and that were presented to me as well as I have worked with. Do a simple Google search and if you still believe professionals store any sort of time series data, whether it be options chain data, tick data, or other time compressed data in SQL tables then all the power to you. I would never recommend anyone touching SQL to tackle time series data. I respect your answer (though I disagree) and hope you could also pay respect to others who put in time to help others. – Matt Wolf May 09 '13 at 06:16
  • 2
    @Freddy, kristine: Please try to keep any discussion in the comments brief and to the point. Irrelevant comments deleted. – olaker May 09 '13 at 09:21