Is there any thing out there as a substitute for KDB?

Question

thanks a lot for your discussions on the original post.

following your suggestions, let me re-phrase a bit :

kdb is known for its efficiency, and such efficiency comes at a terrible price. However, with computational power so cheap this days, there must be a sweet spot where we can achieve a comparable efficiency of data manipulation, at a more reasonable cost.

For instance, if a KDB license cost a small shop $200K per year (I dont know how much it actually cost, do you know?), maybe there is a substitute solution: e.g., we pay 50K to build a decent cluster, storing all data onto a network file system, and parallelize all the data queries. This solution might not be as fast or as elegant as KDB, but it would be much more affordable and most important -- you take full control of it.

what do you think? is there anything like this?

I'm not sure what you mean. The main trick used by KDB is to store the data in columns instead of rows. This has the advantage that if one column is selected all the data can be read in one long read. This can also be done in Python. — Bob Jansen, Apr 01 '12 at 10:52
The minimal 2-core setup of kdb+ is actually pretty reasonable: $25K per year including maintenance. — user2303, Apr 19 '13 at 01:53
artic could be an option (it's free software, build on top of MongoDB), but I haven't used it, so I cannot give you an opiniion on that:
https://github.com/manahl/arctic — Juan Ignacio Gil, Jan 24 '18 at 09:46

thomas - discretelogics · Answer 1 · 2022-12-16T09:59:30.247

23

At discretelogics we just released a file format to store time series in flat files called "TeaFiles". In addition to raw data they can store the binary item layout and a description of the contents.

C#, C++, Python APIs are available open source, licensed under the GPL, see discretelogics.com/teafiles/

Using memory mapping, read performance reaches that of in-memory array processing for sequential usage of a file, as is the case for back testing.

The C# API at Codeplex holds micro benchmarks. Summing up a file with a single 8 byte double reaches 500 million operations per second on an older test machine. Using a Tick Item with int64 / double / int for Time/Price/Volume is 100 million operations.

edited Dec 16 '22 at 09:59

answered Apr 07 '12 at 22:26

thomas - discretelogics

401
2
8

2

+1 I like the idea. How do you handle cross-sectional time series? Will we be able to embed you visualization API into a custom app? – SRKX Apr 08 '12 at 09:07
thank you! by cross-sectional time series you mean the analysis of several series like computing an index or stat arb, right?
about embedding - our code base would allow embed-able components. we thought already about offering that. i take your question as valuable input for our product road map. at the moment our visualization tools TeaHouse/TeaShell are windows / XAML. our persistence layer is cross OS/app and we plan visualization on html5 architecture for the future. our decisions will be customer driven, so your vote counts. we are thankful for any feedback and inside into your needs.
– thomas - discretelogics Apr 08 '12 at 11:58
are there any benchmarks say this vs hdfs or other formats? – pyCthon Jan 22 '13 at 22:25
1

Benchmarks about raw time series data access using TeaFiles are included in the C#/C++ sources, see http://teafiles.codeplex.com/wikipage?title=Benchmarks. A benchmark against HDFS would first need to specify what to measure. TeaFiles perform 1:1 to raw data structs in files, as they are nothing else + allow memory mapping. This is hardly to beat at first. Read ahead, prefetching allow going beyond in some situations. HDFS would be a layer above raw file storage and we have it on our roadmap to take simple file persistence to the next level. For management not for performance reasons tough. – thomas - discretelogics Jan 23 '13 at 00:06
1

A little late to the party here but the link in the answer is now dead. – chollida Jun 23 '14 at 17:22
link fixed. thx for the heads up! – thomas - discretelogics Jun 23 '14 at 19:49
1

Awesome project! Thank you for making it open source. – Dec 04 '15 at 16:20
Browsing the various links, it is uncertain to me the TeaFiles project is still active. The C++ github has not seen a nontrivial update since 2016 – Brian B Jan 11 '22 at 20:29
The C++ version misses the implementation of some methods, it would need someone who puts a little love into it. The code is easy to read and maintain if one wants to adopt it. Its open source. – thomas - discretelogics Jan 11 '22 at 21:20

score 21 · Answer 2 · edited Dec 05 '15 at 14:50

21

As of April 2014, the 32-bit version of kdb+ is now free to try.

This free version may not be used in production systems.

The only technical limitation vs. the 64-bit version is that you can only address up to 4GB of memory per process.

edited Dec 05 '15 at 14:50

answered Apr 21 '14 at 23:04

mollmerx

327
2
2

I was not aware of that, thanks a lot for this update, +1 – Matt Wolf May 20 '14 at 07:39
20

It's not free as in free beer. You are restricted to using it for non-commercial purposes only. It's basically a trial. You answer makes it sound like it's open source which is not true. – Dec 04 '15 at 16:10
2

Not only is it not free, but as of 2019 it is time limited. From the licence:
1.4 Disabling Features. End User understands that the Kdb+ On Demand Software contains a feature which will automatically cause the Kdb+ On Demand Software to time-out six (6) months from the date of installation of the Kdb+ On Demand Software or such other date which Kx at its discretion identifies. Kx may at its discretion from time to time agree to extend use of the Kdb+ On Demand Software for a further six (6) months or such other period which Kx at its discretion identifies.
– Tullochgorum Dec 11 '19 at 21:44
Kdb+ On Demand is different from the 32-bit version of kdb. On Demand provides fill 64-Bit version but requires internet connection and license that will time out. – emc211 Jul 06 '20 at 22:54
Maybe https://hobbes.readthedocs.io/en/latest/ ? – reim Nov 29 '20 at 04:10

score 17 · Answer 3 · answered May 07 '14 at 18:25

17

I don't like KDB+/q. For KDB+ experts, I am not picking a fight. The following is just my own understanding on KDB+ and TimeSeries Database. You're warmly welcome to correct me if anything wrong in your eyes :).

First of all, during my near one year's KDB+/q development experience, I never ever find a paper based benchmark result indicating KDB+/q significantly out performs other storage system in time. By storage system, I don't limit the scope to disk based RDBMS. So I don't know why everybody in quant finance is talking about KDB+ and impressed by its efficiency without data backing their points.

Second, I once talked with an Oracle Certificated DBA about KDB+. From my description, the first word he came up is an in-memory cache not even a full fledge database. Maybe he is right to some extent, isn't he?

Next is the so-called TimeSeries. More than one manager advocating KDB+ from different positions and companies talked to me KDB+ is a TimeSeries Database. However, I don't think TimeSeries is a feature of KDB+, whose real feature is the column oriented storage engine. That is to say with a column oriented storage engine, that KDB+ is friendly to store TimeSeries data. While in these days, column oriented storage engine does not belong to KDB+ exclusive. For traditional RDBMS, like MySQL, you can also find corresponding column storage engines in the open source communities.

The last and the most important, I want to talk about the developer friendliness. KDB+ is distributed along with a DSL, which is q. q is very developer unfriendly. When an error encountered, it just raise the error type, without any line information for you to anchor, which increase the difficulty to bug fixing and code maintenance.

Alternatives to KDB+ I can come up with:

to give the system column oriented storage, a traditional K-V store works, even a column storage backed RDBMS;
for the in-memory feature, a lot of open source implementation of in-memory cache, like redis, memcached, etc.

The key feature for the alternative, I think, is that the API is written in standard C/C++. You could have a lot of utilities to ease your development.

answered May 07 '14 at 18:25

Summer_More_More_Tea

323
4
10

2

While not definitive as far as speed goes, it certainly helps that the entire executable is roughly 400kb, depending on the version, meaning it can live entirely in CPU cache. – afekz Jun 23 '14 at 11:04
@afekz Thank you for your comments. But if not for speed, why do people bothering the cache? :p Yes, you're right this may be one of the pros for KDB. – Summer_More_More_Tea Jun 23 '14 at 12:12
1

One reason KDB is so friendly to Time Series data is that queries are implicitly stable in terms of row ordering. Many other solutions listed don't offer this guarantee. – Chuu Jun 02 '15 at 19:16
@Chuu Thank you for your comment. By "other solutions", may I know your alternatives? :) – Summer_More_More_Tea Jun 03 '15 at 02:05
@Summer_More_More_Tea I could write an entire article on this. I will say (1) There is no substitute for KDB if you use it to its full potential, (2) Most users do not use, and do not need to use KDB to its full potential, (3) There is no FOSS solution that comes close, but with the data volumes most individuals handle, many FOSS systems work very well, (4) There are other commercial solutions that are eating into KDBs market share but I don't have enough hands on experience to make a recommendation. ExtremeDB is one that people I trust have strongly considered. – Chuu Sep 22 '15 at 17:16
"in-memory cache not even a full fledge database" <

kdb+ is a fully-fledged database. and more than that of course, if you ask people with non-trivial experience with it. it is seamlessly in-memory and/or on-disk, complex event processing, data analysis (anything Python can do) and more ...
– Daniel Krizian Nov 12 '21 at 14:57
1

DolphinDB is a great alternative if the users do not have experience in array programming language. – Wei Qiu Mar 19 '22 at 01:31
Very late to add but the key misunderstanding is "KDB+ is distributed along with a DSL" - the real insight is that the database is distributed along with q (and k). q is a functional, pass-by-value, vector style (i.e. APL like) language which happens to be good for finance and timeseries. – DangerMouse Sep 24 '22 at 05:27
Bother. Dyslexic moment. What I meant to day was - the real insight is that q (and k) is distributed with a file reading mechanism (the database)... It is q not the data loading that is key. – DangerMouse Sep 24 '22 at 05:57

score 17 · Answer 4 · edited Apr 25 '16 at 09:11

17

You could look into Pandas, a Python library that integrates with PyTables. It was created by someone at AQR and has some similar features as KDB.

edited Apr 25 '16 at 09:11

Dawny33

103
6

answered Jan 22 '13 at 20:48

chrisaycock

9,817
3
39
110

score 12 · Answer 5 · answered Apr 02 '12 at 15:51

12

KDB is useful for two reasons: - Storage of data; and easy access to the data (i.e. querying ticks..etc) - Rich query language that supports many Quant functions

however; what KDB does not do well; is the quant query language.

I have evaluated KDB, Matlab, and R. So far R is the winner.

I have not found any fast solution for storing and retrieving data; compared to using flat binary files; which are divided by month for ease of acces. My app can read 1 million of tick data in 3 seconds; and do backtesting accordingly.

for retail traders; i suggest you use flat files (binary instead of text; for quick read/write). MT4 data structure is a good example to follow.

It is cheap, free, and fast!!!

answered Apr 02 '12 at 15:51

alpha

474
3
5

22

How do you compare a database system and two scientific-oriented programming languages? – SRKX Apr 02 '12 at 15:56
can you rephrase your question? – alpha Apr 03 '12 at 00:05
3

KDB is a database. Matlab and R are programming languages/environments. Q is KDB's programming language and would be somewhat comparable strictly as a language to J. – Steve Severance Apr 08 '12 at 22:20
Unless you're intending to leverage Q though, there is no point to investing in KDB. As a pure tick warehouse it is inferior to other column stores. – Chuu Jul 24 '13 at 20:39
3

@Chuu, do you have any data to back up your claim? – Datageek Apr 07 '14 at 11:07
@Datageek No, however I seriously doubt many KDB users and administrators would disagree with the statement. Column stores are a dime a dozen these days and many have superior feature sets to KDB if you just want to get your data back out exactly as you put it in. – Chuu Apr 07 '14 at 13:00
@Chuu I only started looking into kdb+ but so far I can see that performance is much better than the column datastores that I know of. Interesting fact - kdb+ is a column based database as well. See: http://stackoverflow.com/questions/22840510/what-performance-can-we-expect-from-kdb-32bit/22902091 – Datageek Apr 07 '14 at 13:43
2

@Datageek Most KDB users are intimately aware that KDB+ is a column store because administrating it requires in-depth knowledge of the on-disk structure of the database. Also, keep in mind that KDB is an in-memory database, and the insert performance numbers cited do not include persisting to disk. – Chuu Apr 07 '14 at 14:10
try mongodb. For many cases it can be as fast as kdb – Nickpick Mar 08 '18 at 17:08

John at TimeStored · Answer 6 · 2016-10-06T13:09:39.797

We've created a roundup of the top column-oriented database systems: http://www.timestored.com/time-series-data/column-oriented-databases

This includes kdb+ and some open source alternatives.

Open: InfluxDB, Java Chronicles, OpenTSDB, KairosDB

Closed: oneTick, McObject, Teradata Database, vectorwise, sybase, vertica

We have done some initial work at benchmarking common time-series queries and found open source monetDB to be particularly fast...we will hopefully be able to publish some of those results at a later date.

score 9 · Answer 7 · edited May 20 '16 at 05:53

Have a look at Kona which is a FOSS project trying to be compatible. Also Tom Szczesny has done some work on its predecessors namely A. I hope this helps.

Also if you are not looking for a perfect substitute you can have a look at other Time Series Databases like InfluxDB, Java Chronicles, OpenTSDB, KairosDB which are all Open Source. There are commercial ones as well out of which OneTick is targeted at Tick Data management.

score 6 · Answer 8 · answered Apr 19 '20 at 12:24

At the risk of reopening an old question, I thought that I would offer my experience.

I worked for a competitor of Man AHL (who created artic). We used a columnar database called HP Vertica. Its not free unfortunately. We used it as a huge time series database for many use cases. We had one cluster of 3 fairly powerful machines that gave us redundancy if one failed, and had tables with over 100bn rows without issues. It is SQL compliant, and ACID compliant. There were tweaks that could be made to control table/column compression & distribution and replication.

It has great support for datawarehouse operations, like fast ingestion and deferrable constraints.

It did helpful things like auto-normalisation of low cardinality columns etc - ie, one could control the logical and physical schema quite precisely and easily. We could run a select distinct of a single column over 130bn+ rows in less that 10ms. We used it for time series storage (daily and tick) with one observation per row (ie, one tick/close etc per row).

We were able to build some interesting patterns:

We could store all incoming data against source identifiers (eg, bloomberg ticker), and then pre-materialize the symbology joins vs our internal identifiers.
We could store all versions of a tick and pre-materialize the filter for only the latest values.
We wrote a batch job that could copy entire SQL server databases directly into the data warehouse - with automatic handling of nice columns / tables etc. This made dealing with legacy / small / complex datasets very nice as now you could join all enterprise schemas over a single SQL connection. The data scientists used that a LOT.

We were able to do that over the entire dataset of 20k+ symbols * 20+ years of tick data as a series of daily batch runs. This made it very easy to manage the dataset for operators, as they could write selects and deletes using the standard SQL that we all love.

The jdbc driver was also pretty good and offered all the usual semantics and datatypes. At one point I also hooked it up to an apache Spark cluster of 256 cores and managed to achieve a parallel write over jdbc to the same destination table of over 1.2m rows per second (including the commit).

The developer experience was great as it just mainly worked as a massively powerful RMDBS with a lovely SQL experience. Vertica / HP has invested a great deal of time in providing many useful helper functions (time & date, and analytical functions) and the overall feeling was quite similar to PostgreSQL (which is a good thing).

Overall, a nice way to achieve horizontal scalability with little required of the developer or the DBA. You just need your cheque book.

Nowadays, even this approach is probably out of data now that we have things like (really) fast SSDs, cheaper and larger RAM capabilities (epyc rome supports 4TB ram on a single machine!), SPARK, and better tooling around NoSQL implementations. We also have great new initiatives like TimeScaleDB that is FOSS. Combine FOSS with the short setup times of modern hardware on public cloud and you could probably iterate to something similar for little upfront time and money.

score 6 · Answer 9 · answered May 13 '21 at 13:40

The only thing which comes close to kdb in my opinion is QuestDB.

They are one of the few projects as a TSDB which have speed as a priority. They recently added out-of-order inserts which allowed them to benchmark agains some of the other TSDBs out there and the results were a quite impressive 1.4 millions writes per second (source).

I have been using it (having switched from Influx) and so far it's the best thing I have found.

score 4 · Answer 10 · answered Jul 11 '17 at 13:21

Recent benchmarks of KDB+ vs. other big data technologies, Kdb still comes out on top 5, especially considering hardware costs for this particular data set.

[1.1 Billion Taxi Rides on kdb+/q & 4 Xeon Phi CPUs] http://tech.marksblogg.com/billion-nyc-taxi-kdb.html

score 3 · Answer 11 · answered Jan 05 '15 at 12:19

Lots of people focus on data storage ability and compare KDB with other SQL/query-based databases. Such comparison is like considering "Is a Ferrari good for running a bus route?"

KDB+ is capable of manipulating and querying large data set. The performance is fast (in comparison to most RDBMS) due to its column based storage, but it's not her strongest suit. In fact, it could become a serious pain if the query is not well written or when the server just doesn't have enough memory. From my experience, lots of KDB applications do not even host Realtime/Historical databases.

With her simple messages publish/subscribe mechanism and flexible IPC calls, one could build complex event processing engines (and even a network of engines) to operate on these time series data easily. Its performance can easily outmatch other engines written in Java/C due to Q vector processing and loads of functions that tailored made for time series data.

Hi MK Lee, what do you mean by 'her' in 'With her simple messages'? — Bob Jansen, Jan 05 '15 at 12:36
@BobJansen Lee's been using her throughout, in place of and interchangably with KDB+ — stucash, Dec 15 '22 at 02:45

Bob · Answer 12 · 2019-05-22T08:52:02.520

From my own experience, DolphinDB is a pretty good substitute for kdb+. The speed of DolphinDB is actually faster than kdb+ in many cases, for examples, asof join, window join and rolling window calculations. Also, the scripting language of DolphinDB is SO easy to use. It's mostly SQL, but also include advanced features, such as functional programming. DolphinDB is also a genuine distributed system with a built-in distributed file system. It's very easy to deploy DolphinDB on a cluster. Can't say the same for kdb+. After all, it's a 20 years old system.

Bonaparte · Answer 13 · 2019-05-17T11:51:20.577

One of the reasons q/kdb+ is attractive for tick data is that it can process data tick by tick as well as using the SQL-like query language. You'll read elsewhere that it is a columnar database that uses the vector processor on modern CPUs for noticeably faster throughput.

The only system I've used that can process a tick-feed like q/kdb+ is Esper. It is very difficult to work with because it's an embedded language invoked by an Esper runtime environment within a JVM. There are extra tools available at a fee.

It is possible now to use columnar datastores with Spark/Hadoop: Parquet.

I've used q/kdb+ extensively. The licence changes are really good, I can now use it as part of my toolkit at work and use it within business processes. If you persevere with it, you do appreciate it is an amazing system.

Sergei Rodionov · Answer 14 · 2021-01-20T18:41:00.910

Axibase Time Series Database is free on a single node regardless of RAM, CPU, disk. You can store trades, Level 1 quotes, order book statistics, day/auction summaries and reference data for instruments of various types.

Time precision is microseconds.

Supports SQL with various functions for filtering by trading calendar, auction stages, indices etc.

SELECT datetime, symbol, close(), vwap()
  FROM atsd_trade
WHERE in_index('<index-name>')
  AND in_session(DAY, CLOSING)
  AND datetime BETWEEN '2021-01-01' AND '2021-01-15' EXCL
GROUP BY exchange, class, symbol, PERIOD(1 DAY)

(I work Axibase)

score 2 · Answer 15 · edited Jun 21 '23 at 11:05

2

man-group/ArcticDB: High performance datastore for time series and tick data seems a good alternative for time series and tick data. It is built upon mongodb.

edited Jun 21 '23 at 11:05

Tomap

103
4

answered Jun 17 '16 at 01:04

xgdgsc

246
1
4

for Arctic - very nice!

chris

Feb 17 '20 at 14:22

score 2 · Answer 16 · answered Oct 18 '16 at 16:25

If you're looking for a columnar store alternative, have a look at Druid.io

Druid is a good fit if you are after 1. fast aggregations and searches 2. Real-time analysis 3. Huge amount of data (petabytes) 4. High availability data store with no single point of failure Source: http://druid.io/druid.html

Kokizzu · Answer 17 · 2021-07-07T12:27:56.607

2

If you need the column-store for OLAP use cases (not the time series, but this could be done too), try Clickhouse, for 1.1 billion taxi benchmark, it comes 2nd after kx kdb+, and the fastest if eliminating all databases with GPU-based/Xeon-Phi setup.

You may also want to try the newer TiDB that optimized for scaling, it has TiFlash for OLAP use cases

edited Jul 07 '21 at 12:27

answered Mar 24 '21 at 16:15

Kokizzu

121
4

databento · Answer 18 · 2022-12-16T13:52:41.387

If you absolutely need a DBMS, the two obvious candidates are Clickhouse and QuestDB, which others have mentioned.

However, there are good chances that you don't need a DBMS. In which case, a plain binary flat file or one with some structure (like HDF5, Parquet) is often the best way to go.

Edit (Dec 16, 2022)

Among flat file options, I now highly recommend the use of Databento Binary Encoding.

It's written in Rust, with C++ and Python bindings available.
It's blazing fast thanks to zero copy semantics, binary format and cache line optimization. Benchmarks on a 2020 MacBook Air with the Databento Python client library:
- Reads are practically I/O-bound at 3.5 GB/s with a single core.
- Writes are compression-bound at 1.3 GB/s.
Extremely compact because it primarily uses zstd.
It's very easy to read into pandas in Python and reserialize it to third party storage formats (pyarrow/parquet, HDF5).
It's versatile:
- The protocol supports some instrument metadata and symbology.
- It's suitable as a message encoding or wire format. We already use it for all our internal normalized market data messages and externally for real-time market data.
- It's suitable as a on-disk file storage format. We use it internally to store over 3 PB of normalized data, and as the default encoding for all of our customers. Compare this to something like ArcticDB, which has only been used in production at 50+ TB scale.

Disclosure: I'm one of the developers. – databento Dec 16 '22 at 09:58 — databento, Dec 16 '22 at 09:58

score 2 · Answer 19 · answered Dec 16 '22 at 09:55

2

We used Vertica at my last job at a large proprietary trading firm with hundreds of employees.

answered Dec 16 '22 at 09:55

Katie

304
1
5

score 1 · Answer 20 · answered Jan 11 '22 at 18:13

1

Actually, there is https://theplatform.technology. It does a bit more than kdb, but the internal language is quite close to k with some modern extensions (like pattern matching, etc)

answered Jan 11 '22 at 18:13

Viktor Sovietov

11
1

Your work here is really impressive- keep it up! – cjm2671 Mar 12 '23 at 23:18

score 0 · Answer 21 · answered Nov 01 '18 at 17:36

0

Nowadays many RDBMS do inmemory and columnstore. SQL Server 2017 can do that, and runs very fast

answered Nov 01 '18 at 17:36

Will

1

Is there any thing out there as a substitute for KDB?

21 Answers21

Linked