This document discusses the need for a time series database and introduces OpenTSDB as an option. Some key points:
- Time series data is useful for analyzing metrics and patterns over time but is currently scattered across different databases.
- OpenTSDB is an open source time series database that can store trillions of data points, scale using HBase, and never loses precision.
- It is optimized for write throughput and can handle thousands of data points per second. Reads depend on the cardinality of metrics but it supports time-based queries.
- OpenTSDB uses HBase under the hood and stores tags with metrics to allow for flexible filtering of time series data without affecting performance.
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Need for Time series Database
1. Need For Time Series
Database
Pramit Choudhary, ML Engineer @eHarmony
2. Motivation
Speed Matters
We want to know, what’s happening NOW
User accessing data through different mobile platform, no patience
Data is scattered around
MongoDb, Voldemort, Netezza, Hive, Whisper, may be more
For cross platform analytical work, data is still moved around ( cause of worry )
Need for simplifying the Database Tech Stack
Increase in complexity as we start tracking more metrics in-regards to Mobile
devices
Data-Analytics Use-cases:
Most of the time we study data pattern over a period of time
e.g. 1. What are probable times for the user to get matches ? => need to start tracking
the amount of time user spends during the day
2. Feature exploration and extraction: What other features could we possibly use ?
=> more t/f/z/p statistics tests probably ?
3. Re-CAP
Consistency: Data remains consistent after the execution
of an operation. E.g. Post update all client have the same
state of the data.
Availability: Always on ( no downtime)
Partition Tolerance: System continues to function even
with no communication with one another
4. Different Combinations
CA : Single Cite cluster, all nodes are always in contact. e.g.
SQL type RDMS
CP : Some data may not be accessible, but the rest is
consistent and accurate e.g. MongoDB, HBase, Redis
AP : Available under partitioning, but no guarantee on
consistency e.g. Cassandra, Riak, DynamoDb
5. No SQL World
• Key-Value Store (Redis, Riak)
• Document Store (MongoDB, Couchbase)
• Column Store (Cassandra, Hbase, OpenTSDB)
• Graph Store (Neo4j, Node.js)
7. What is OpenTSDB?
Open Source Time Series Database
Store trillions of data points
Sucks up all data and keeps going
Never loses precision
Scales using HBase
Note: Using this as an example, better results with KairosDB or InfluxDB.
They work on similar principles.
Author: Benoit Sigoure and Chris Larsen
8. Use-Cases
MongoDB and Couchbase : user profiles, product catalogs,
geospatial, financial products, social media, digital
content, gaming, metadata, events, bills and invoices
Hbase and Cassandra : Structured, semi-structured,
unstructured data, full table scans, read, intensive
operations, time series interval data, geospatial data
10. What are Time Series?
Time Series: Data points for an identity over time
Typical Identity:
Dotted string: web01.sys.cpu.user.0 ( no concept of filters )
OpenTSDB Identity:
Metric: sys.cpu.user
Tags (name/value pairs): act as filters
host=web01 cpu=0
Author: Benoit Sigoure and Chris Larsen
11. What are Time Series?
Data Point:
Metric + Tags
+ Value: 42
+ Timestamp: 123
„ sys.cpu.user 1234567890 42 host=web01 cpu=0 „
Author: Benoit Sigoure and Chris Larsen
14. About TSDs
Write throughput
Are CPU bounded
Worst Case: Can handle 2000 points/sec on an old 2006 dual core CPU
Read throughput
Depends on the cardinality of a metric
Timespan and number of data points retrieved
Reliability
No single point of failure no concept of master daemon
Dependency, needs HBase with zookeeper
Has single point of failure if running over HDFS, but none with
respect to database.
More info on the Wiki : http://opentsdb.net/faq.html
15. Simplistic View of the
Table
Without OpenTSDB Hbase Table Representation
Author: Oliver Hankeln
16. OpenTSDB Magic
“Compact columns by concatenation “
Author: Oliver Hankeln
• Tags are put at the end of the row key
• Timestamp is normalized on 1hr boundaries
23. Is it being extensively
used?
OVH: #3 largest cloud/hosting provider : Monitor
everything includes network performance, resource
utilization, application performance, customer facing
metric
35 servers, 100k writes/s, 25tb raw data
5 day moving window of Hbase snapshot
Redis cache on top for customer facing data
24. Yahoo: Monitoring application performance and
statistics ( 15 servers, 280k writes/s
Arista Networks: High performance network
monitoring
5k writes/s uses varnish for caching
MapR
“OpenTSDB is a widely used database intended to store
and analyze time-series data. Originally designed for
only data center monitoring, poor ingest performance
had limited the expansion of its use. This benchmark
demonstrates a viable option for new applications, such
as IoT and other real-time data-analysis applications,
using OpenTSDB running on MapR. “ Ted Dunning, Chief
Application Architect
HBase has unconquerable superiority in writes, and with a pre-created regions it showed us up to 40K ops/sec. Cassandra also provides noticeable performance during loading phase with around 15K ops/sec. MySQL Cluster can show much higher numbers in “just in-memory” mode
Deferred log flush does the right job for HBase during mutation ops. Edits are committed to the memstore firstly and then aggregated edits are flushed to HLog asynchronously. Cassandra has great write throughput since writes are first written to the commit log with append method which is fast operation. MongoDB’s latency suffers from global write lock.
Riak behaves more stably than MongoDB.