Making Hadoop and Cassandra Work Together2. About Altoros
Software delivery acceleration specialist for big data application implementation
services
200+ employees globally (US, Eastern Europe, UK, Denmark, Norway)
Big data practice areas
Automated device analytics
Advertising analytics
Big data warehouse
Customers
Partners
Implementation Partner
© Altoros Systems, Inc.
4. The Problem: Data is Big
10-20 sensors per house
Ability to support tens of thousands of households
1 sensor ~1.1 MB/day
1,000 Households: 11 GB/day
500,000 Households: 5TB/day
© Altoros Systems, Inc.
7. The Problem: Performance
MySQL showed slow performance under intensive writes
Target throughput isn’t scalable
Disk performance is a bottleneck
Monitoring with iostat -dmx
Old fashion single-threaded batch processing is slow
Make it parallel!
© Altoros Systems, Inc.
8. Requirements
High responsive system with parallel processing
Reliable
– Partial failure is acceptable
– Node and data recoverability
Scalable
– Load capacity
– Max throughput
Total cost of ownership
– Data compression
© Altoros Systems, Inc.
9. NoSQL Database Requirements
– Fast writes are critical
– Querying by column and range of keys
– Secondary indices
– Good map/reduce compatibility using Apache Hadoop
© Altoros Systems, Inc.
11. Why Cassandra
– Good overall balance of features, scalability, reliability
– We wanted BigTable-like features: columns, column
families
– Well suited for large streams of non-transactional data
– Provides good, consistent write throughput
– Tunable trade-offs for distribution and replication (N,
R, W)
© Altoros Systems, Inc.
12. File system
HDFS
– Is a file system behind our Cassandra implementation
– Data coherency: write-once-read-many access
© Altoros Systems, Inc.
13. Cassandra: Best Used When…
When you write more than you read (logging).
If every component of the system must be in Java
You need/may need in the future complex configuration
requirements
© Altoros Systems, Inc.
14. Cassandra Challenges
High, Unpredictable Write Volume
Varying Schema, Variable Msg Size
2 Types of Series - Data, Lookups
All time-series, even metadata - no supplemental DB
© Altoros Systems, Inc.
15. No Cassandra Compression?
Built-in Cassandra compression claims to compress
across columns with identical names.
All our data columns are timestamped, so no two will
ever have identical names.
© Altoros Systems, Inc.
16. Numbers
“Benchmark” Cassandra node
LZO Compression
© Altoros Systems, Inc.
17. Lessons Learned
Consider hybrid
RDBMS + NoSQL + Hadoop
Hadoop
Is for offline processing and analysis
Is NOT for random reading and writing records
Cassandra complements Hadoop with querying capabilities
© Altoros Systems, Inc.