3. „Must Haves“ for Big Data?
What do modern businesses need for big data?
A scalable high-performance database
that is easy to use and
cost effective
Scalable
Performance
Cost Operational
Effective Ease
4. „Must Haves“ for Big Data?
„Modern businesses need to be able to manage large
volumes of realtime data and run analytic and enterprise
search operations on that same data as quickly as possible
to make business decisions.“
Real-Time Analytic/Search
Databases Databases
Data Movement
ETL Process
5. Legacy RDBMS ≠ Big Data
„Big data is comprised of (1) Velocity – how fast the data is coming in;
(2) Variety – all types are new being captured; (3) Volume – TB‘s to
PB‘s of data; (4) Complexity – mulit-location, data center, etc.“
“Big data technologies describe a new generation of technologies and
architectures, designed to economically extract value from very large
volumes of a wide variety of data, by enabling high-velocity capture,
discovery, and/or analysis.”
“Big data is data that exceeds the processing capacity of conventional
database systems. The data is too big, moves too fast, or doesn’t fit the
strictures of your database architectures. To gain value from this data,
you must choose an alternative way to process it.”
6. Trends & Challenges in Data Mngt.
Exponential Data
Key Value
Growth
Cloud Wide Column
Semi Structured
Document
Data
More Connected
Graph
Data
7. Trends & Challenges in Data Mngt.
Exponential Data
Key Value
Growth
Apache
Cloud
Cassandra
Semi Structured
Document
Data
More Connected
Graph
Data
8. Apache Cassandra
A massively scalable, decentralized, structured
data store (aka database).
Project history:
9. Nodes Token
Cassandra is… A
B
C
0
4
8
D 12
E 16
F 20
G 24
O(1) Distributed Hash Table H 28
Sharding, Replication
Elastic H A
G B
Fault tolerant
No Single Point of Failure
F C
Durable
E D
10. Cassandra is…
C
AP-System (CAP Theorem)
Eventual consistency
A P
Tunable trade-offs:
Consistency vs. Latency
Choose between synchronous or asynchronous
replication for each update
C = Consistency
A = High Availability
P = Partitioning Tolerance
11. Cassandra is…
Keyspace
A BigTable Clone
Column Family
No schema Key Row
Column Column
Key Row
Column
Key Row
Predestined for Column Column Column
Semi-structured data Column Family
Row
Sparse data SuperColumn SuperColumn
Column Column Column Column
Row
SuperColumn
Column Column Column
12. Cassandra-based Big Data
Solution
Analytics
Hadoop
Real-time
Cassandra
Real-time queries with
Cassandra
Analytics Real-time
Cassandra
Hadoop
Cassandra Cluster Distributed Search with
(Replication) Solr
Real-time
Search
Solr
Cassandra
Analytics with Hadoop
MapReduce
Search Search
Solr Solr
13. Summary
Apache Cassandra is a elastic scalable, fault-
tolerant data store
Tunable consistency levels
Wide Column: flexible datamodel without schema
Supports: real-time queries, analytics through
Hadoop integration, Solr-based fulltext search