Just a simple comparison. To load data (ticks, candles - irrelevant) from local DB (using named pipes as a fastest inter-process communication) will be slower in 100-200 times than read it from regular CSV file.
Market data is not oriented for relation databases. First of all - there is no any relations. The pricing data like a streaming video - oriented on raw data format (text or binary). The fastest way - upload 1-2 weeks in tick data and use it memory.
Wanna to upload it into some cloud services? Pretty easy. We use AWS for iteration backtesting (optimization). Allocation at the same time ~20 servers with replicated market data. Cheapest data storage - S3. Replicating it to EC2. As a blobs for sure. No any SQL queries or something like that can kill the performance.
DB has a great potential but not with a market data and backtesting.
The following user says Thank You to stocksharp for this post:
Sorry for the extreme hypothetical, but I'd doubt Facebook can run their data collection/storage back-end off of CSV files
There are more than several builds that individual traders in this thread have alluded to as using DBs (Cassandra, Mongo, etc), and if in context of financial market data, going flat-file CSV is faster, cheaper, and just as easy or easier to scale to some infinite horizon, why would anybody deploy a DB?
I'm going to be pulling unfiltered, streaming tick data for some of the most active futures instruments, such as Crude Oil, E-Mini S&P, etc...the number of rows will be in the tens of millions rather quickly. I get that reading from / writing to a CSV is a lot simpler & faster than a DB table, but when we're talking large-scale setups, seems this argument won't be able to keep up with the reality of the demand.
For backtesting? Actually I didn't seen any commercial trading app that keep market data in DB. NT, MC, SC, TS - they use they own format of a flat files.
Sure. CSV like a binary format - just a files. They are not a silver bullets. I'm sure there is hundreds and thousand cases where DB works brilliant than just files. Even in trading and backtesting. For my cases (running data on local environment or use a cloud services) raw files more adapted.
My point it this thread - to make a seed of doubt. And terms "Data" and "Database" are not equals. Right choice depends from many-many details (storage type, count of users, distributed, parallel working, fast reading or writing, etc.).
A DBMS and a flat file are not all that different. Many DBMSes are just executables that sit on top of a flat file on your file system. You can always roll your own and achieve comparable functionality, but if you find yourself replicating most of the functionality of a DBMS, then chances are that years of painstaking B-tree optimization etc. are superior to your own solution.
The truth is, platform designers for NT, MC, SC, TS etc. are 20 years behind in their design choices because their users don't run into serious use cases. On the other hand, a company like FB has many flexible and unstructured use cases for their data that they have not yet discovered. The principle of 'big data' in a firm like FB is that they're collecting more data than they can churn analytics on for months to come, whereas most retail traders can backtest through their entire collection of data overnight. So you should let neither sway your decision on what to do.
I can name several use cases where a DBMS is superior:
1. Administrative privileges. This isn't just a matter of having multiple people in a company that you're working with. Even in a 1 man shop, it is poor taste for your backtesting application to have same write privileges as your storing application.
2. Constraints. For example, let's say you're storing EUR/USD data for FXCM and then you decide to trade EUR/USD on Hotspot. Now maybe you want to rename them to EUR/USD.FXCM and EUR/USD.Hotspot and modify all the associated data (tables). It's easy to make this modification with a `CASCADE` operation, but troublesome if you rolled your own solution with flat files. Or let's say you rolled a software change to the application that is storing your tick data last Thursday at 5 PM ET, and after a backtest today, you realized that all your data since last Thursday 5 PM ET was glitched. Maybe all of it has incorrect rollover offsets. With careful planning in your schema and a few statements in your DBMS's DSL, you can fix this.
3. Indexing and range selection. Let's say you're trying to select data from 3 AM on Aug 15, 2014 to 4.15 PM on Aug 17, 2014 for a backtest. Walking through the array that backs the file system directory structure is cheap because you probably don't have a large number of files, however at some point you're going to have to walk through the timestamps. It takes O(2 log n) to do both endpoints, but many CSV parsing modules/libraries are agnostic to sort order and will brainless walk O(2n) through this. Because your handrolled solution is almost surely in row major order, you're paying as much as 2 days of backtesting in I/O cost to qualify your data for an actual loop for 3~ days of backtesting. Similarly, hash indices make short work of certain useful queries ("fetch me all ticks at support/resistance level"). It's ugly to serialize a B-tree or hash index blob in CSV or your own binary format to compete, and that's code that doesn't bring you joy or money.
4. One-stop shop for redundancy and backup. It's entirely possible to do backup with RAID and a bunch of Bash scripts (rsync etc.) but for many users, it's cleaner to take care of backup entirely at the DBMS application level. Newer DBMSes ensure you're backed up on commit and can fall back on your backup in a hardware-agnostic way. You could have 2 separate RAID0 machines and it would still work.
The following 2 users say Thank You to artemiso for this post: