Atlanta meetup presentation, discussion around big data processing engines (Hive, HBase, Druid, Spark). Weighs the relative strengths of each engine and which use cases each of the engines are most suited for
9. What’s Hive good at?
⬢ Jack of all trades
⬢ Key component of the real-time
database
⬢ Familiar interface for analysts – unified
SQL
⬢ Can perform joins, filtering,
aggregations
⬢ Read structured (CSV) or semi-
structured (JSON) data
HiveInterface
HBase/Phoenix
Druid
JDBC
Files
12. What Are Apache HBase and Phoenix?
Flexible Schema
Millisecond Latency
SQL and NoSQL Interfaces
Store and Process Petabytes of Data
Scale out on Commodity Servers
Integrated with YARN
100% Open Source
YARN : Data Operating System
HBase
RegionServer
1 ° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° ° N
HDFS
(Permanent Data Storage)
HBase
RegionServer
HBase
RegionServer
Flexible Schema
Extreme Low Latency
Directly Integrated with Hadoop
SQL and NoSQL Interfaces
What Are Apache HBase and Phoenix?
Flexible Schema
Millisecond Latency
SQL and NoSQL Interfaces
Store and Process Petabytes of Data
Scale out on Commodity Servers
Integrated with YARN
100% Open Source
YARN : Data Operating System
HBase
RegionServer
1 ° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° ° N
HDFS
(Permanent Data Storage)
HBase
RegionServer
HBase
RegionServer
Flexible Schema
Extreme Low Latency
Directly Integrated with Hadoop
SQL and NoSQL Interfaces
13. Kinds of Apps Built with HBase
Write Heavy Low-Latency
Search /
Indexing
Messaging
Audit /
Log Archive AdvertisingData Cubes
Time Series
Sensor /
Device
14. Key HBase Features
Page 14
High Availability
• Data is stored on multiple
nodes and HBase coordinates
failover.
• Data stays available if nodes
fail.
Strong Consistency
• HBase doesn’t sacrifice
consistency for scale.
• Improve quality by avoiding
difficult-to-detect bugs.
Deep Hadoop Integration
• Add deep insight to your apps
through seamless integration
with Hadoop tools like Hive
Multi Datacenter
• Replicate data between 2 or
more datacenters.
• Keeps data safe and available
through datacenter outages.
15. Data Storage – Relational vs. HBase
Column1 Column2 Column3 Column4
Row1 f - t5
a – t1
null null d – t4
Row2 null b – t1 null null
Row3 null null null e – t4
Row4 c – t3 null g – t5 null
Relational Data Base
f – t5
a – t1
C – t3
B – t1 g – t5 d – t4
e – t4
HBase Data is located by cell coordinates consisting of row key,
column family name, column qualifier and timestamp
Column1 Column2 Column3 Column4
HFile
17. Druid is for real-time, providing aggregations and fast
access
Streaming ingestion capability
Data Freshness – analyze events as they occur
Fast response time (ideally < 1sec query time)
Arbitrary slicing and dicing
Multi-tenancy – 1000s of concurrent users
Scalability and Availability
Rich real-time visualization with Superset
Superset
Druid is a distributed, real-time, column-oriented datastore
designed to quickly ingest and index large amounts of data and
make it available for real-time query.
18. Who is Using Druid
http://druid.io/druid-powered.html
19. Druid
cubing
Here’s how Druid usually fits into your architecture
Streaming
data source
(Kafka, etc.) Real-
time
ingest
Druid
Jobs, batch
processes,
scheduled
tasks
HDFS Hive
Superset
VisualizationQuery engineStorageData sources
Druid-backed
Hive tables,
predicate
pushdown
HDFS-backed
Hive tables
Tableau,
Qlik,
Excel
Query
Hive/Druid via
ODBC
Batch
ingest
23. What is Apache Spark?
Classification Regression
• Support vector
• logistic regression
Collaborative Filtering
Clustering
• K-means
Optimization
• Stochastic Gradient
Descent
ML lib (Machine Learning)
Scalable
• High-throughput, fault-
tolerant stream processing
of live data streams
(micro-batches)
Data Ingest Sources
• Kafka, Flume, Twitter,
ZeroMQ, Kinesis or TCP
sockets
Reuse Spark APIs
• Complex algorithms
expressed with high-level
functions like map, reduce,
join and window
Data Persistence
• Processed data can be
pushed out to file systems,
databases and live
dashboards
Spark Streaming
Structured Data Processing
• Programming abstraction
called DataFrames
• Distributed SQL query
engine
Infer Schema
• Automatically infer scheme
of a JSON dataset and
load it as a DataFrame.
Spark SQL
Resource Management
Storage
Applications
Spark Core Engine
Scala
Java
Python
libraries
MLlib
(Machine
learning)
Spark
SQL*
Spark
Streaming*
24. Benefits of Apache Spark
• Performance
– Deliver high performance large scale data processing and analysis by
leveraging in memory computing
• Ease of Use
– Easy to use APIs for operating on large datasets
– Operators for transforming data
– DataFrames provides support for manipulating structured and semi-
structured data
• Efficiency
– Enhanced developer productivity through prepackaged libraries that can be
combined in the same application
• SQL queries
• Streaming data
• Machine learning
• Graph processing
Resource Management
Storage
Applications
Spark Core Engine
Scala
Java
Python
libraries
MLlib
(Machine
learning)
Spark
SQL*
Spark
Streaming*
* Tech Preview
37. Hive as the Single Interface
HiveInterface
HBase/Phoenix
Druid
JDBC
Files
38. Hive Query Delegation by Calcite
filter time
group by
order by
Calcite rewrites to
Druid query fragment
Complex joins,
etc would be
computed here
39. BI on Hadoop : Different tools for different use cases
File / RAW storage
Unknown questions
Latency is not a issue
Non structured / Data Mining /
Data Science
Structured Data
Data cleansed / Enriched
Questions are known but not answers
Concepts and data regularly updated
Streaming / low latency
Pre-aggregation to answer specific
questions
Known Questions and answers
Operational dashboards
LLAP
Druid
Cold Warm
Hot