SlideShare une entreprise Scribd logo
1  sur  42
© Hortonworks Inc. 2011- 2017. All rights reserved | 1
3.77 288.9 0.76
© Hortonworks Inc. 2011- 2017. All rights reserved | 2
Counting Rows
© Hortonworks Inc. 2011- 2017. All rights reserved | 3
3.77 288.9 0.76
© Hortonworks Inc. 2011- 2017. All rights reserved | 4
3.77 288.9 0.76
Hive HBase Druid
© Hortonworks Inc. 2011- 2017. All rights reserved | 5
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
© Hortonworks Inc. 2011 – 2017
Big Data Processing Engines –
Which one do I use?
Ashish Narasimham, Solutions Engineer @ Cloudera
© Hortonworks Inc. 2011- 2017. All rights reserved | 7
Processing Engines Overview
© Hortonworks Inc. 2011- 2017. All rights reserved | 8
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
What’s Hive good at?
⬢ Jack of all trades
⬢ Key component of the real-time
database
⬢ Familiar interface for analysts – unified
SQL
⬢ Can perform joins, filtering,
aggregations
⬢ Read structured (CSV) or semi-
structured (JSON) data
HiveInterface
HBase/Phoenix
Druid
JDBC
Files
© Hortonworks Inc. 2011- 2017. All rights reserved | 10
HDP3: EDW analyst pipeline
Tableau
BI systems
Materialized
view
Surrogate
key
Constraints
Query
Result
Cache
Workload
management
ACID v2
&
ACID on
default
• Results return
from HDFS/cache
directly
• Reduce load from
repetitive queries
• Allows more
queries to be run
in parallel
• Reduce resource
starvation in large
clusters
• Also:
Active/Passive HA
• More “tools” for
optimizer to use
• More ”tools” for
DBAs to
tune/optimize
• Invisible tuning of
DB from users’
perspective
• ACID v2 is as fast
as regular tables
© Hortonworks Inc. 2011- 2017. All rights reserved | 11
Ad-hoc,
analytics
Look-ups,
updates
Aggregations,
drill-downs
What Are Apache HBase and Phoenix?
Flexible Schema
Millisecond Latency
SQL and NoSQL Interfaces
Store and Process Petabytes of Data
Scale out on Commodity Servers
Integrated with YARN
100% Open Source
YARN : Data Operating System
HBase
RegionServer
1 ° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° ° N
HDFS
(Permanent Data Storage)
HBase
RegionServer
HBase
RegionServer
Flexible Schema
Extreme Low Latency
Directly Integrated with Hadoop
SQL and NoSQL Interfaces
What Are Apache HBase and Phoenix?
Flexible Schema
Millisecond Latency
SQL and NoSQL Interfaces
Store and Process Petabytes of Data
Scale out on Commodity Servers
Integrated with YARN
100% Open Source
YARN : Data Operating System
HBase
RegionServer
1 ° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° ° N
HDFS
(Permanent Data Storage)
HBase
RegionServer
HBase
RegionServer
Flexible Schema
Extreme Low Latency
Directly Integrated with Hadoop
SQL and NoSQL Interfaces
Kinds of Apps Built with HBase
Write Heavy Low-Latency
Search /
Indexing
Messaging
Audit /
Log Archive AdvertisingData Cubes
Time Series
Sensor /
Device
Key HBase Features
Page 14
High Availability
• Data is stored on multiple
nodes and HBase coordinates
failover.
• Data stays available if nodes
fail.
Strong Consistency
• HBase doesn’t sacrifice
consistency for scale.
• Improve quality by avoiding
difficult-to-detect bugs.
Deep Hadoop Integration
• Add deep insight to your apps
through seamless integration
with Hadoop tools like Hive
Multi Datacenter
• Replicate data between 2 or
more datacenters.
• Keeps data safe and available
through datacenter outages.
Data Storage – Relational vs. HBase
Column1 Column2 Column3 Column4
Row1 f - t5
a – t1
null null d – t4
Row2 null b – t1 null null
Row3 null null null e – t4
Row4 c – t3 null g – t5 null
Relational Data Base
f – t5
a – t1
C – t3
B – t1 g – t5 d – t4
e – t4
HBase Data is located by cell coordinates consisting of row key,
column family name, column qualifier and timestamp
Column1 Column2 Column3 Column4
HFile
© Hortonworks Inc. 2011- 2017. All rights reserved | 16
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
Druid is for real-time, providing aggregations and fast
access
 Streaming ingestion capability
 Data Freshness – analyze events as they occur
 Fast response time (ideally < 1sec query time)
 Arbitrary slicing and dicing
 Multi-tenancy – 1000s of concurrent users
 Scalability and Availability
 Rich real-time visualization with Superset
Superset
Druid is a distributed, real-time, column-oriented datastore
designed to quickly ingest and index large amounts of data and
make it available for real-time query.
Who is Using Druid
http://druid.io/druid-powered.html
Druid
cubing
Here’s how Druid usually fits into your architecture
Streaming
data source
(Kafka, etc.) Real-
time
ingest
Druid
Jobs, batch
processes,
scheduled
tasks
HDFS Hive
Superset
VisualizationQuery engineStorageData sources
Druid-backed
Hive tables,
predicate
pushdown
HDFS-backed
Hive tables
Tableau,
Qlik,
Excel
Query
Hive/Druid via
ODBC
Batch
ingest
© Hortonworks Inc. 2011- 2017. All rights reserved | 20
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
© Hortonworks Inc. 2011- 2017. All rights reserved | 21
And one more honorable mention
© Hortonworks Inc. 2011- 2017. All rights reserved | 22
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
Complex ETL, ML
What is Apache Spark?
Classification Regression
• Support vector
• logistic regression
Collaborative Filtering
Clustering
• K-means
Optimization
• Stochastic Gradient
Descent
ML lib (Machine Learning)
Scalable
• High-throughput, fault-
tolerant stream processing
of live data streams
(micro-batches)
Data Ingest Sources
• Kafka, Flume, Twitter,
ZeroMQ, Kinesis or TCP
sockets
Reuse Spark APIs
• Complex algorithms
expressed with high-level
functions like map, reduce,
join and window
Data Persistence
• Processed data can be
pushed out to file systems,
databases and live
dashboards
Spark Streaming
Structured Data Processing
• Programming abstraction
called DataFrames
• Distributed SQL query
engine
Infer Schema
• Automatically infer scheme
of a JSON dataset and
load it as a DataFrame.
Spark SQL
Resource Management
Storage
Applications
Spark Core Engine
Scala
Java
Python
libraries
MLlib
(Machine
learning)
Spark
SQL*
Spark
Streaming*
Benefits of Apache Spark
• Performance
– Deliver high performance large scale data processing and analysis by
leveraging in memory computing
• Ease of Use
– Easy to use APIs for operating on large datasets
– Operators for transforming data
– DataFrames provides support for manipulating structured and semi-
structured data
• Efficiency
– Enhanced developer productivity through prepackaged libraries that can be
combined in the same application
• SQL queries
• Streaming data
• Machine learning
• Graph processing
Resource Management
Storage
Applications
Spark Core Engine
Scala
Java
Python
libraries
MLlib
(Machine
learning)
Spark
SQL*
Spark
Streaming*
* Tech Preview
Driver
MetaStore
HiveServer+Tez
LLAP DaemonsExecutors
Spark
Meta
Hive
Meta
Executors LLAP Daemons
Isolate Spark and Hive Catalogs/Tables
Leverage connector for Spark <-> Hive
HWC
HWC
© Hortonworks Inc. 2011- 2017. All rights reserved | 26
Which one should I choose?
© Hortonworks Inc. 2011- 2017. All rights reserved | 27
Use Case Analysis - Each engine has its niche
HBase Hive Druid Spark
Ultra-low latency
Random access
(key-based
lookup)
ACID, real-time
database, EDW
Low-latency
OLAP, concurrent
queries
Complex ETL
Large-volume
OLTP
Unified SQL
interface, JDBC
Aggregations,
drilldowns
ML model
training
Updates Reporting, batch Time-series SparkSQL
Deletes Joins, large
aggregates, ad-
hoc
Real-time
ingestion
Spark
Streaming
© Hortonworks Inc. 2011- 2017. All rights reserved | 28
Which use cases make sense?
⬢ HBase – operational data store, lots of changing data
– Financial transaction data
– Frequent customer updates
– CDC
⬢ Druid – analytics across dataset, sums and other aggregations
– Analyzing number of cars being produced by region
– Number of flights departing from a certain airport
⬢ Hive – large queries across tons of data
– Fact-dimension join across billions of rows, e.g. joining loyalty data to a day’s retail
transactions for insights into spending
⬢ Spark – predictive modeling, complex ETL (ELT) jobs
– Building a predictive maintenance model for infrastructure that a transportation company
owns
© Hortonworks Inc. 2011- 2017. All rights reserved | 29
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
Complex ETL, ML
© Hortonworks Inc. 2011- 2017. All rights reserved | 30
Performance Analysis
© Hortonworks Inc. 2011- 2017. All rights reserved | 31
Performance Analysis - Setup
⬢ Caching disabled
⬢ Query types
– Simple count
– Select with a where
– Join
– Update
– An aggregation (e.g. Sum)
8 cores
16GB RAM
8 nodes
30GB,
200MM rows
1.35
15.00 15.00 15.00
1.52
8.71
4.72
8.66
9.16
9.75
0.34 0.71
1.72
0.00 0.00
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
Select with filter Count(*) Aggregation with
filter
Select with join and
filter
Update with filter
Comparing Hive, HBase and Druid
HBase/Phoenix Hive Druid
© Hortonworks Inc. 2011- 2017. All rights reserved | 33
Data Load Times
Engine Load Time
Hive <1 hr
HBase 4+ hrs
Druid 2 hrs
⬢ Issues with HBase – sequential, serial
© Hortonworks Inc. 2011- 2017. All rights reserved | 34
Space Considerations
⬢ You may get better storage from HBase with different compression
Engine Size on Disk with Replication
Hive – ORC w/ Zlib 28.4GB
HBase – Snappy compression 89.5GB
Druid 31.5GB
© Hortonworks Inc. 2011- 2017. All rights reserved | 35
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
© Hortonworks Inc. 2011- 2017. All rights reserved | 36
Unified SQL
Hive as the Single Interface
HiveInterface
HBase/Phoenix
Druid
JDBC
Files
Hive Query Delegation by Calcite
filter time
group by
order by
Calcite rewrites to
Druid query fragment
Complex joins,
etc would be
computed here
BI on Hadoop : Different tools for different use cases
 File / RAW storage
 Unknown questions
 Latency is not a issue
 Non structured / Data Mining /
Data Science
 Structured Data
 Data cleansed / Enriched
 Questions are known but not answers
 Concepts and data regularly updated
 Streaming / low latency
 Pre-aggregation to answer specific
questions
 Known Questions and answers
 Operational dashboards
LLAP
Druid
Cold Warm
Hot
© Hortonworks Inc. 2011- 2017. All rights reserved | 40
Ad-hoc
analytics
Random
look-ups
Aggregations
drill-downs
© Hortonworks Inc. 2011- 2017. All rights reserved | 41
What conclusions can we draw?
⬢ The use case dictates the tool. This is seen in the numbers
– Druid is extremely fast for aggregations
– HBase is great with lookups and OLTP-style updates on fast-moving data
– Hive is used a lot for analytics on large quantities of data, where the query isn’t known
beforehand
– Spark has great libraries for ML and is customizable for complex ETL
⬢ Use case sprawl – watch for this (no one engine does it all)
⬢ Unified SQL – the tools complement each other in the larger enterprise
architecture
© Hortonworks Inc. 2011- 2017. All rights reserved | 42
Further Reading
⬢ Use case discussion on the engines
– https://hortonworks.com/blog/big-data-processing-engines-which-one-do-i-use-part-1/
⬢ Performance analysis
– https://community.hortonworks.com/articles/232317/big-data-processing-engines-the-
technical-series-p.html
– https://community.hortonworks.com/articles/233083/big-data-processing-engines-the-
technical-series-p-1.html

Contenu connexe

Tendances

MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
MapR Technologies
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
alanfgates
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
DataWorks Summit
 

Tendances (19)

Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World Considerations
 
An Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present FutureAn Overview on Optimization in Apache Hive: Past, Present Future
An Overview on Optimization in Apache Hive: Past, Present Future
 

Similaire à Big data processing engines, Atlanta Meetup 4/30

Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 

Similaire à Big data processing engines, Atlanta Meetup 4/30 (20)

An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]Discover.hdp2.2.h base.final[2]
Discover.hdp2.2.h base.final[2]
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 

Dernier

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Big data processing engines, Atlanta Meetup 4/30

  • 1. © Hortonworks Inc. 2011- 2017. All rights reserved | 1 3.77 288.9 0.76
  • 2. © Hortonworks Inc. 2011- 2017. All rights reserved | 2 Counting Rows
  • 3. © Hortonworks Inc. 2011- 2017. All rights reserved | 3 3.77 288.9 0.76
  • 4. © Hortonworks Inc. 2011- 2017. All rights reserved | 4 3.77 288.9 0.76 Hive HBase Druid
  • 5. © Hortonworks Inc. 2011- 2017. All rights reserved | 5 Ad-hoc analytics Random look-ups Aggregations drill-downs
  • 6. © Hortonworks Inc. 2011 – 2017 Big Data Processing Engines – Which one do I use? Ashish Narasimham, Solutions Engineer @ Cloudera
  • 7. © Hortonworks Inc. 2011- 2017. All rights reserved | 7 Processing Engines Overview
  • 8. © Hortonworks Inc. 2011- 2017. All rights reserved | 8 Ad-hoc analytics Random look-ups Aggregations drill-downs
  • 9. What’s Hive good at? ⬢ Jack of all trades ⬢ Key component of the real-time database ⬢ Familiar interface for analysts – unified SQL ⬢ Can perform joins, filtering, aggregations ⬢ Read structured (CSV) or semi- structured (JSON) data HiveInterface HBase/Phoenix Druid JDBC Files
  • 10. © Hortonworks Inc. 2011- 2017. All rights reserved | 10 HDP3: EDW analyst pipeline Tableau BI systems Materialized view Surrogate key Constraints Query Result Cache Workload management ACID v2 & ACID on default • Results return from HDFS/cache directly • Reduce load from repetitive queries • Allows more queries to be run in parallel • Reduce resource starvation in large clusters • Also: Active/Passive HA • More “tools” for optimizer to use • More ”tools” for DBAs to tune/optimize • Invisible tuning of DB from users’ perspective • ACID v2 is as fast as regular tables
  • 11. © Hortonworks Inc. 2011- 2017. All rights reserved | 11 Ad-hoc, analytics Look-ups, updates Aggregations, drill-downs
  • 12. What Are Apache HBase and Phoenix? Flexible Schema Millisecond Latency SQL and NoSQL Interfaces Store and Process Petabytes of Data Scale out on Commodity Servers Integrated with YARN 100% Open Source YARN : Data Operating System HBase RegionServer 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Permanent Data Storage) HBase RegionServer HBase RegionServer Flexible Schema Extreme Low Latency Directly Integrated with Hadoop SQL and NoSQL Interfaces What Are Apache HBase and Phoenix? Flexible Schema Millisecond Latency SQL and NoSQL Interfaces Store and Process Petabytes of Data Scale out on Commodity Servers Integrated with YARN 100% Open Source YARN : Data Operating System HBase RegionServer 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Permanent Data Storage) HBase RegionServer HBase RegionServer Flexible Schema Extreme Low Latency Directly Integrated with Hadoop SQL and NoSQL Interfaces
  • 13. Kinds of Apps Built with HBase Write Heavy Low-Latency Search / Indexing Messaging Audit / Log Archive AdvertisingData Cubes Time Series Sensor / Device
  • 14. Key HBase Features Page 14 High Availability • Data is stored on multiple nodes and HBase coordinates failover. • Data stays available if nodes fail. Strong Consistency • HBase doesn’t sacrifice consistency for scale. • Improve quality by avoiding difficult-to-detect bugs. Deep Hadoop Integration • Add deep insight to your apps through seamless integration with Hadoop tools like Hive Multi Datacenter • Replicate data between 2 or more datacenters. • Keeps data safe and available through datacenter outages.
  • 15. Data Storage – Relational vs. HBase Column1 Column2 Column3 Column4 Row1 f - t5 a – t1 null null d – t4 Row2 null b – t1 null null Row3 null null null e – t4 Row4 c – t3 null g – t5 null Relational Data Base f – t5 a – t1 C – t3 B – t1 g – t5 d – t4 e – t4 HBase Data is located by cell coordinates consisting of row key, column family name, column qualifier and timestamp Column1 Column2 Column3 Column4 HFile
  • 16. © Hortonworks Inc. 2011- 2017. All rights reserved | 16 Ad-hoc analytics Random look-ups Aggregations drill-downs
  • 17. Druid is for real-time, providing aggregations and fast access  Streaming ingestion capability  Data Freshness – analyze events as they occur  Fast response time (ideally < 1sec query time)  Arbitrary slicing and dicing  Multi-tenancy – 1000s of concurrent users  Scalability and Availability  Rich real-time visualization with Superset Superset Druid is a distributed, real-time, column-oriented datastore designed to quickly ingest and index large amounts of data and make it available for real-time query.
  • 18. Who is Using Druid http://druid.io/druid-powered.html
  • 19. Druid cubing Here’s how Druid usually fits into your architecture Streaming data source (Kafka, etc.) Real- time ingest Druid Jobs, batch processes, scheduled tasks HDFS Hive Superset VisualizationQuery engineStorageData sources Druid-backed Hive tables, predicate pushdown HDFS-backed Hive tables Tableau, Qlik, Excel Query Hive/Druid via ODBC Batch ingest
  • 20. © Hortonworks Inc. 2011- 2017. All rights reserved | 20 Ad-hoc analytics Random look-ups Aggregations drill-downs
  • 21. © Hortonworks Inc. 2011- 2017. All rights reserved | 21 And one more honorable mention
  • 22. © Hortonworks Inc. 2011- 2017. All rights reserved | 22 Ad-hoc analytics Random look-ups Aggregations drill-downs Complex ETL, ML
  • 23. What is Apache Spark? Classification Regression • Support vector • logistic regression Collaborative Filtering Clustering • K-means Optimization • Stochastic Gradient Descent ML lib (Machine Learning) Scalable • High-throughput, fault- tolerant stream processing of live data streams (micro-batches) Data Ingest Sources • Kafka, Flume, Twitter, ZeroMQ, Kinesis or TCP sockets Reuse Spark APIs • Complex algorithms expressed with high-level functions like map, reduce, join and window Data Persistence • Processed data can be pushed out to file systems, databases and live dashboards Spark Streaming Structured Data Processing • Programming abstraction called DataFrames • Distributed SQL query engine Infer Schema • Automatically infer scheme of a JSON dataset and load it as a DataFrame. Spark SQL Resource Management Storage Applications Spark Core Engine Scala Java Python libraries MLlib (Machine learning) Spark SQL* Spark Streaming*
  • 24. Benefits of Apache Spark • Performance – Deliver high performance large scale data processing and analysis by leveraging in memory computing • Ease of Use – Easy to use APIs for operating on large datasets – Operators for transforming data – DataFrames provides support for manipulating structured and semi- structured data • Efficiency – Enhanced developer productivity through prepackaged libraries that can be combined in the same application • SQL queries • Streaming data • Machine learning • Graph processing Resource Management Storage Applications Spark Core Engine Scala Java Python libraries MLlib (Machine learning) Spark SQL* Spark Streaming* * Tech Preview
  • 25. Driver MetaStore HiveServer+Tez LLAP DaemonsExecutors Spark Meta Hive Meta Executors LLAP Daemons Isolate Spark and Hive Catalogs/Tables Leverage connector for Spark <-> Hive HWC HWC
  • 26. © Hortonworks Inc. 2011- 2017. All rights reserved | 26 Which one should I choose?
  • 27. © Hortonworks Inc. 2011- 2017. All rights reserved | 27 Use Case Analysis - Each engine has its niche HBase Hive Druid Spark Ultra-low latency Random access (key-based lookup) ACID, real-time database, EDW Low-latency OLAP, concurrent queries Complex ETL Large-volume OLTP Unified SQL interface, JDBC Aggregations, drilldowns ML model training Updates Reporting, batch Time-series SparkSQL Deletes Joins, large aggregates, ad- hoc Real-time ingestion Spark Streaming
  • 28. © Hortonworks Inc. 2011- 2017. All rights reserved | 28 Which use cases make sense? ⬢ HBase – operational data store, lots of changing data – Financial transaction data – Frequent customer updates – CDC ⬢ Druid – analytics across dataset, sums and other aggregations – Analyzing number of cars being produced by region – Number of flights departing from a certain airport ⬢ Hive – large queries across tons of data – Fact-dimension join across billions of rows, e.g. joining loyalty data to a day’s retail transactions for insights into spending ⬢ Spark – predictive modeling, complex ETL (ELT) jobs – Building a predictive maintenance model for infrastructure that a transportation company owns
  • 29. © Hortonworks Inc. 2011- 2017. All rights reserved | 29 Ad-hoc analytics Random look-ups Aggregations drill-downs Complex ETL, ML
  • 30. © Hortonworks Inc. 2011- 2017. All rights reserved | 30 Performance Analysis
  • 31. © Hortonworks Inc. 2011- 2017. All rights reserved | 31 Performance Analysis - Setup ⬢ Caching disabled ⬢ Query types – Simple count – Select with a where – Join – Update – An aggregation (e.g. Sum) 8 cores 16GB RAM 8 nodes 30GB, 200MM rows
  • 32. 1.35 15.00 15.00 15.00 1.52 8.71 4.72 8.66 9.16 9.75 0.34 0.71 1.72 0.00 0.00 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 16.00 Select with filter Count(*) Aggregation with filter Select with join and filter Update with filter Comparing Hive, HBase and Druid HBase/Phoenix Hive Druid
  • 33. © Hortonworks Inc. 2011- 2017. All rights reserved | 33 Data Load Times Engine Load Time Hive <1 hr HBase 4+ hrs Druid 2 hrs ⬢ Issues with HBase – sequential, serial
  • 34. © Hortonworks Inc. 2011- 2017. All rights reserved | 34 Space Considerations ⬢ You may get better storage from HBase with different compression Engine Size on Disk with Replication Hive – ORC w/ Zlib 28.4GB HBase – Snappy compression 89.5GB Druid 31.5GB
  • 35. © Hortonworks Inc. 2011- 2017. All rights reserved | 35 Ad-hoc analytics Random look-ups Aggregations drill-downs
  • 36. © Hortonworks Inc. 2011- 2017. All rights reserved | 36 Unified SQL
  • 37. Hive as the Single Interface HiveInterface HBase/Phoenix Druid JDBC Files
  • 38. Hive Query Delegation by Calcite filter time group by order by Calcite rewrites to Druid query fragment Complex joins, etc would be computed here
  • 39. BI on Hadoop : Different tools for different use cases  File / RAW storage  Unknown questions  Latency is not a issue  Non structured / Data Mining / Data Science  Structured Data  Data cleansed / Enriched  Questions are known but not answers  Concepts and data regularly updated  Streaming / low latency  Pre-aggregation to answer specific questions  Known Questions and answers  Operational dashboards LLAP Druid Cold Warm Hot
  • 40. © Hortonworks Inc. 2011- 2017. All rights reserved | 40 Ad-hoc analytics Random look-ups Aggregations drill-downs
  • 41. © Hortonworks Inc. 2011- 2017. All rights reserved | 41 What conclusions can we draw? ⬢ The use case dictates the tool. This is seen in the numbers – Druid is extremely fast for aggregations – HBase is great with lookups and OLTP-style updates on fast-moving data – Hive is used a lot for analytics on large quantities of data, where the query isn’t known beforehand – Spark has great libraries for ML and is customizable for complex ETL ⬢ Use case sprawl – watch for this (no one engine does it all) ⬢ Unified SQL – the tools complement each other in the larger enterprise architecture
  • 42. © Hortonworks Inc. 2011- 2017. All rights reserved | 42 Further Reading ⬢ Use case discussion on the engines – https://hortonworks.com/blog/big-data-processing-engines-which-one-do-i-use-part-1/ ⬢ Performance analysis – https://community.hortonworks.com/articles/232317/big-data-processing-engines-the- technical-series-p.html – https://community.hortonworks.com/articles/233083/big-data-processing-engines-the- technical-series-p-1.html