13

I often hear that column-oriented databases are the best choice method for storing time series data in finance applications. Especially by people selling expensive column-oriented databases.

Yet, at first glance it seems a poor choice. You want to append new ticks, or new bars, at the end (and you need to do this a lot and quickly). That is a classic row operation: you append to one file. In a column DB you have to update three files for a tick (timestamp/price/trade size), or five to six for a bar (datestamp, open, high, low, close, volume). (I said 5-6, as for regularly spaced bar data I suppose datestamp could be implicit from row number.)

For reading I don't normally want to just grab one column; I want to grab the whole bar so I can draw a candlestick (for instance). OK, I may just want the close column, or just want the volume column (but I still need two reads to also get the datestamps in a column-oriented DB, don't I?).

But what seems even more important is that when I want to read historical data I generally want to grab a sub-period, and that will be stored contiguously in the row-oriented DB.

Q1: Is there any good reason to go with column-oriented over row-oriented if all you store is trade ticks?

Q2: Is there any good reason to go with column-oriented over row-oriented if all you store is OHLCV bars?

Q3: If you think no for Q1 and Q2 what kind of columns do you need to have for column-oriented DBs to be the clearly superior choice?

UPDATE

Thanks to Chris Aycock for links to similar questions. Some of the reasoning why column-oriented DBs are better is still not making sense to me, but from the first part of https://quant.stackexchange.com/a/949/1587 I think people may be using row-oriented DBs differently. So, for the purposes of this question, please assume I have only one symbol per database table (as opposed to one huge table with a 'symbol' column). So, following the example in the above answer, the raw on-disk storage looks like:

09:30:01 | 164.05; 09:30:02 | 164.02; ...
Darren Cook
  • 1,427
  • 1
  • 17
  • 26
  • This question gets ask a lot on here [ 1, 2 ]. – chrisaycock Jul 09 '12 at 11:04
  • Thanks @chrisaycock I had read one of those in my hunt yesterday, but the first link I'd missed and it was very informative (I'm still working through the linked 85-page PDF, but that looks useful too). I've added more information to my question to explain why I don't feel the linked-to answers fully answer my question. – Darren Cook Jul 10 '12 at 00:42
  • As to Q2: For a row database, I don't see any other viable option than one huge table with Symbol, Date, O, H, L, C, V, as columns. You'll need two indices built into this table: say a primary key index ordered first by Symbol and then by Date, and another index ordered first by Date and then by Symbol. Essentially, at some level, a column-oriented database will have to do something like this internally to make its operations reasonably efficient. But if you're not willing to have (or your row database can't handle) one huge table with a Symbol column, then you do need a column database. – JL344 Aug 04 '12 at 18:17
  • @JL344 You didn't mention the row-oriented approach of one table per symbol (see the UPDATE in my question). Is there a reason people are not using that approach? – Darren Cook Aug 05 '12 at 02:47
  • The approach of one table per symbol in essence is the column-oriented approach. Row-oriented databases just aren't designed to handle a huge number of tables as effectively as a huge number of rows in one table. They are built on the assumption that different tables in a database hold fundamentally different types of data, so any relations among different tables are ad hoc, and have to be joined up at query time. (The primary key is the row number, by the way. Better to think about what fundamentally makes the row unique rather than just assign an arbitrary autoincremented integer.) – JL344 Aug 05 '12 at 05:56
  • @JL344 One table per symbol of OHLCV data in a row-oriented DB is very different from the column-oriented DB. In the former they are are stored in row order; in the latter they are stored in column order. In the former appending a new row requires appending, say, 44 bytes, to a single file; in the later it requires appending 8-12 bytes to each of 5 separate disk files. That is their advantage; the gist of Q1 and Q2 is for this usage scenario do column-oriented DBs bring any other advantage? – Darren Cook Aug 05 '12 at 06:17
  • @JL344 Re the "Row-oriented databases just aren't designed to handle a huge number of tables as effectively as a huge number of rows in one table." comment, it seems this is a good point. MySQL uses one file per table, so it should be fairly unlimited. However see (http://serverfault.com/q/83438/87322) where meta queries can become notably slow once you have thousands of tables, apparently. (This is a MySQL bug that is fixable, rather than a flaw with the row-oriented concept.) – Darren Cook Aug 05 '12 at 06:28
  • One file per table is not unlimited. It's taxing to the operating system to keep track of so many open files, so in effect, you've just offloaded the task of indexing all these independent tables from the database to the operating system's file descriptor table. The row-oriented concept is not flawed; you're just not taking full advantage of it when you make all these separate tables. – JL344 Aug 05 '12 at 06:49

2 Answers2

1

For Q1 and Q2 I would suggest you should not use a columnar database. The reasons are as follows:

  1. A typical write-access for your data-type would need to update several symbols with both timestamp and price together in different tables. Due to the high cardinality of your data (low no. of duplicates), columnar compression techniques would not be able to provide the promised speed benefits.
  2. Consider whether you would need joins on these large tables when reading them later, because columnar databases don't perform well on joins.
  3. For a time-series database with one symbol per table, I would recommend using a traditional RDBMS whose layout and queries have been fine-tuned considering the machine available and the data size/growth estimates. A properly partitioned conventional modern RDBMS would work well. Indexes may (or may not!) speed up read times, but will definitely slow down writes.

Answer to Q3: Columnar databases are good for low cardinality data, for example status flags - Y/N, male/female, address fields such as state/country, etc. with mostly repeated values throughout the column. A simplistic understanding would be that they break down a table by its fields and record its unique values into a dictionary, the column is then stored as an array of indexes to the dictionary, this allows for high compression and higher speeds as the amount of data retrieved/manipulated is reduced. Actual implementations use many other optimizations such as cache-aware sorting, etc. But the overhead makes writes much slower than modern conventional RDBMS. Columnar databases are specialized software and show great performance only for specific cases, whereas modern RDBMS can be customized and fine-tuned to many different use cases and provide much better help and support to accomplish this.

I've had very good results with fast writes using Oracle and PostgresQL; and materialized views for fast reads/reporting/analytics. For high performance applications, I've benefited greatly by advice from seasoned, experienced DBAs; I would strongly recommend investing in them instead of buying a shiny new columnar database which a consultant recommended.

-3

Like everything, which solution is most suitable completely depends on your specific case. But first I think you confuse couple concepts here. One thing is how fast a DB can retrieve data/read. Another is storing raw data. And an entirely different issue is analytics, queries. Columnar databases shine at reading and writing raw time series based data. Col DBs are not good at performing analytics. Keep in mind that even KDB itself does not shine at aggregating data, KDB itself is just a smart file system with index structures. Its the built-in query language that adds a lot of firepower in terms of query capabilities. Please keep this in mind.

1) Yes, think about how you generally read data. Think about Key/Value, which is essentially what columnar databases are all about (Edit: There is a very close connection, they are not identical). You want to retrieve a specific point in time or a time frame and its associated values. Columnar dbs are very fast at handling such requests. Once such data is in memory it can be operated upon much faster. 2) Same here: Essentially you want to read bars in the same way than raw ticks or any other time series for that matter. You want to acquire bars from Monday 9am to Tuesday 2pm. Whats the difference here? You store each value in its own column. 3) You mean if I answered "yes" to Q1 and/or Q2? Columns are symbol or symbol + open or whatever you chose. Keys are date/time/ticks...

Remember what I said first: Your use case is all that matters. If you constantly need to get prices/bars/... of many different symbols at a specific time point then a row-based database cannot be beat (well given you setup the schema in an intelligent way within an RDBMS). But if you pull out data over time of a single metric (or 4 metrics such as o/h/l/c of bars) then a columnar database is way faster than RDBMS. Why? Because I/O is the most expensive operation and having to only read the columns, needed, is way faster than having to read whole rows. Keep in mind your assertion that each column is stored in a different file is incorrect.

I would read the very same Wiki article you linked to because it answers most of your own question. Also, look at some open source structured, non-SQL, columnar databases to get started on the concepts.

But if you ask me to summarize my points in one sentence then here goes: Columnar databases are optimized for read-operations of time series like data, while row-based databases are more optimized for write operations.

Edit:

For clarification purposes, what I meant with "Think about Key/Value, which is essentially what columnar databases are all about" is the following:

I used the term "key-value" because its essentially the simplest No-SQL data storage approach. The point being is that one cannot run queries on values, cannot aggregate values or search by values such as one could in a purely RDBMS through schemata and indexes. This I think (and I am not alone here) is what sets RDBMS apart from "No-SQL" solutions. My point was that once this concept is understood that No-SQL databases are generally schema-less, lack tables (generally not always), and that, and here is the key similarity between key value and columnar dbs, queries are limited to just by keys, so that the DB knows exactly what node a query can run on. Please note that I am making the comparison looking at things from above 30,000 feet, not a detailed key-value store vs. columnar DB comparison. I just believe that once one understands the concept of key-value and the way key-values are queried then I find it much easier to understand columnar database concepts, EVEN THOUGH on the surface columnar databases look very similar to RDBMS which could not be any further from the truth.

Matt Wolf
  • 14,434
  • 3
  • 27
  • 56
  • Thanks for the reply Freddy. When you say column DBs are all about Key/Value, what is the key? When I think tick/bar data, the key is a datestamp, but that is a row-oriented concept, isn't it? – Darren Cook Jul 09 '12 at 05:38
  • Another question: I usually think in terms of one table per symbol (instrument/contract). Reading between the lines of your answer, when you have a column-oriented DB do you keep all symbols in one table? – Darren Cook Jul 09 '12 at 05:41
  • 4
    -1 Column oriented DBs are not essentially key-value. They do not shine at writing data but they are good at performing analytics. KDB definitely does shine at aggregating data (that's its primary use case). – chrisaycock Jul 09 '12 at 11:12
  • @chrisaycock By "aggregate" do you mean things like turning ticks into 1m bars, 1m bars into hourly bars, etc.? And/or do you mean making moving averages and other more complicated indicators? And/or something else? – Darren Cook Jul 10 '12 at 00:45
  • @chrisaycock I don't know much about column-oriented databases so can't tell who's right, and even though I suspect you are right, I think the down-vote is out of place. We should give people a chance to correct or explain themselves before downvoting. – Tal Fishman Jul 10 '12 at 14:08
  • @TalFishman Have you even seen Freddy's own downvote history here? – chrisaycock Jul 10 '12 at 14:11
  • @chrisaycock fair point. Freddy: If you find Chris's downvote unfair, please consider your own voting patterns. If you find information that you think may be useful and is correct, you should up-vote it. – Tal Fishman Jul 10 '12 at 14:14
  • I edited my post to clarify what I was trying to say. I still disagree with chrisaycock in his downvoting approach which differs fundamentally from how I downvote (please just look at the number downvotes others shared with my decisions in previous threads, plus I only downvoted after asking for clarification which I think should generally be going practice) – Matt Wolf Jul 11 '12 at 02:11
  • 1
    @Freddy Thanks for the update, sorry I only just saw it. It sounds like you are confusing NoSQL DBs (which, to me, means MongoDB, CouchDB, Cassandra, Redis, etc.) with column-oriented databases? Or do you regard them all as NoSQL solutions? – Darren Cook Aug 03 '12 at 13:32
  • @BlackMamba, examples please... – Matt Wolf Mar 24 '16 at 14:37
  • https://msdn.microsoft.com/en-us/library/dn935005.aspx "batch mode execution" section, table row "compute scalar" – Svisstack Mar 24 '16 at 15:41
  • previous comment was unclear - I mean columnstore indexes and optional partitions for better i/o perf – Svisstack Mar 24 '16 at 15:47
  • @BlackMamba, well that specific comment of mine was targeted at traditional row-based relational databases. Obviously a "columnstore index" does not pertain to a row-based relational database. In that I do not see how your comment applies and to what it specifically is supposed to apply. But admittedly this answer of mine was probably one of my weakest answers out of the other 256 to date. Am happy to be criticized as long as it logically applies which your comment imho does not. – Matt Wolf Mar 26 '16 at 16:24
  • @MattWolf Obviously a "columnstore index" does PERTAIN to a row-based relational database. Only available in MSSQL and Oracle, but can't be ignored. – Svisstack Mar 28 '16 at 08:28
  • ...well given that column store indexes were hardly around the time this question was posed your "sorry but it looks like you dont know much about databases" comment seems quite judgmental especially in light of the fact that you formed your judgement based on a few comments I made in one answer. But hey, thanks for the link and information so everyone can dig deeper. I hold to my claim that time series based data should never be handled by any SQL based solution. There are right tools for the trade and SQL is by wide agreement not the right tool to handle time series based data. – Matt Wolf Mar 30 '16 at 08:18
  • @MattWolf I deleted that comment for you. I see nothing wrong with SQL as a language. – Svisstack Apr 16 '16 at 00:24
  • What dbms specific you have in mind? – Svisstack Apr 16 '16 at 01:15