Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Time series Databases
1. A Time Series Database – how do I choose?
A Time Series Database (TSDB) is defined simply as a software system that is designed to
handle time series data the best possible way. The bigger question is, what is a time
series? The answer is that it is a series of points of data arranged in their order in time,
usually captured at specific intervals.
The TSDB, then, collects, analyzes, plots and otherwise decides what to do with the data,
based on the predefined algorithms of the database used. Outputs from such systems fall
under a range of basic descriptive terms, depending generally on the type of data begin
collected over time, such as profiles, curves, traces, etc. For example, a time series for
student grades is often referred to as a student grades time series tables, while one for
analyzing learning outcomes being used in Ireland is called time series clustering. We
often hear terms such as price curve, bell curve, load profile, etc., which all may be used to
describe time series in different areas.
In fact, the same muliplicity of operations are performed for analyzing a variety of these
time series. TSDBs are optimized for these operations, where other systems may not be
practical. TSDBs impose their model on the time series, rather than the other way around.
With the above points in mind, what kind of choices do I have for my particular needs?
The short answer is that there are a lot of choices. Well, as of April 9th
, 2016, there were...
well... you can count them here! You will notice if you scroll down to the comment section
there that even then, which is ancient history in Internet time, people were shouting out to
include others. Compare that list to this one to see how much has changed in just months.
Therefore, it might be very handy to have a short list of the top time series databases to
choose from for your IoT project for example. Please remember that grading anything is
somewhat subjective. To quote Steven Acreman, with more than 12 years experience in
Operations and DevOps, “Databases are a crazy topic and it seems everyone has an
opinion. The trouble is that opinions are like belly buttons. Just because everyone has one
2. it doesn’t mean they are useful for anything.” Everyone has their own idea of what makes
something better or best, so please decide for yourself what will serve you most efficiently.
One point I would really like to raise right here is this: Of those who have taken the time to
do the comparisons, it has been determined that time series databases built from scratch
are much faster than those sitting on very popular non-purpose-built databases such as
Hadoop, Riak KV or Cassandra. If you have an issue with this analysis, please share it in
the comments! Remember that these are all open source time series databases, which
means, in my opinion, that they are all developed by artists who do it for their love of
programming.
This is a top ten list, but it isn't an ordered list. It would be interesting to get some
comments below to see how readers think they should be ordered. It would be even more
interesting to see if anyone agrees with every entry on this list, and as Spock would say,
“Fascinating,” if two people agree on the order...
InfluxDB scores right up near the very top on several software blogs, making it into the top
ten multiple times.
Druid scores within the top 10 time series databases, again on multiple lists.
Riak TS is again in the top 10 on several sites. Interestingly enough, they advertise
themselves as being engineered to be faster than Cassandra. Isn't that interesting?
Prometheus has been around forever in Internet time, and it still manages to rank right up
there among users. Users say it may need a few tweaks, due to the fact it wasn't
specifically designed as a time series database, but it's still a powerful option.
Graphite, which Prometheus compared themselves to (see above) is a top ten site,
obviously ranking very high in the opinion of the engineers at Prometheus! It also ranks
high on several other very respected sites. The crew at Graphite must have a real sense
of humor. The statement on their website reads, “Graphite does three things: Kick *ss.
Chew bubblegum. Make it easy to store and graph metrics. (And it's all out of
bubblegum.)” It's almost worth choosing them for the fun you could have with their crew!
They also say that it runs just as well on cheap hardware or on the Cloud. Graphite has
been around since 2006, making it almost prehistory. The fact that it's still here almost
makes it a contender for the top ten list on that fact, alone.
OpenTSDB presents itself as The Scalable Time Series Database, with the ability to “store
and serve massive amounts of time series data without losing granularity.” Multiple users
who have written about this rank it somewhere in their personal top ten.
Elasticsearch has also landed in the top ten on several software blogs, including, but not
limited to, the Netsil Inc blog, who did a comparison of time series databases against
Druid, which they use. Netsil also gave high marks to Cassandra (see above).
DaltaminerDB gets top marks from me because of the dalmatian they have on their home
page. Seriously though, if you want blazing speed in a reliable database, this one does
deserves to be in the top ten. Plus they like dogs. They might be at or near the top on
everyone's list once they've been around a little longer.
3. Blueflood is built by the Rackspace engineers. They lovingly call it “a giant distributed
calculator that loves numbers.” Blueflood actually uses Cassandra, among other things,
because of its high write throughput peak of 60,000 points/sec on a single box, as well as
the very reliable support, but uses Elasticsearch as an index. It is billed by some as being
a decent replacement for Graphite.
Cassandra is old and slow, but it's still the standard by which many others measure
themselves or are measured. While they are WAY slower than some of the TSDBs
available now, a LOT of engineers still like using it for comparison purposes. It is also still
used as starting point for new databases.
Scylla is one to look at. There aren't enough people talking about it, yet, but dang... it
looks good. It's billed as the world's fastest NoSQL database. According to their website
and also anyone I've found who have tried it, it's “fully compatible with Apache Cassandra
at 10x the throughput and jaw dropping low latency.” It's also listed on MISFRAME as a
much faster C++ implementation of Cassandra. Every comment I've read on Scylla has
been positive, and they all say it really is exceptionally fast. It isn't in the top ten on this
list, but it might be soon!
Please leave your comments below, and tell us what your thoughts are on this list.
Jean-Christophe Huc (Jay C)
Follow me on Twitter @cto_software, and visit
my blog for more articles www.software-development.blog