SlideShare a Scribd company logo
1 of 44
Apache Tez : Accelerating
Hadoop Query Processing
Page 1
Arun C. Murthy Bikas Saha
Founder & Architect Hortonworks
@acmurthy @bikassaha
(@hortonworks)
© Hortonworks Inc. 2013
Hello!
• Founder/Architect at Hortonworks Inc.
–Lead - Map-Reduce/YARN/Tez
–Formerly, Architect Hadoop MapReduce, Yahoo
–Responsible for running Hadoop MapReduce as a service for all
of Yahoo (~50k nodes footprint)
• Apache Hadoop, ASF
–Frmr. VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC)
–Long-term Committer/PMC member (full time for 7 years)
–Release Manager for hadoop-2.x
Page 2
© Hortonworks Inc. 2013
Once upon a time …
Page 3
… long, long ago, there was a kingdom we shall call
Apache Hadoop
http://2.bp.blogspot.com/-hIp99urgxCk/UAsSFo4i8YI/AAAAAAAAAFg/IzjNDwrBBVg/s1600/magickingdo
© Hortonworks Inc. 2013
Hadoop begat …
Page 4
… a two-headed monster on every node in the kingdom;
each belonged to a different clan and answered to a
different master
http://4.bp.blogspot.com/_C7CsfdqySYc/TNSKvIwiFcI/AAAAAAAAAbs/2FSU2TV_rRA/s1600/Two-Headed+Monster+-+With+Identifiers+-+Jan+19,+2009_0.jpg
© Hortonworks Inc. 2013
Knights of Bytes - HDFS
Page 5
… stored data uncompromisingly in directories/files, nary a
care about contents
http://whoiscraigmoser.com/Images/identity/knight.png
© Hortonworks Inc. 2013
Prince of Processing - MapReduce
Page 6
He ruled with an iron fist by mapping,
and then by mercilessly reducing datahttp://media.comicvine.com/uploads/14/144886/2868181-sauron.jpg
© Hortonworks Inc. 2013
Peace Reigned
Page 7
… for a while with the odd change in the direction of the wind
http://www.get-covers.com/wp-content/uploads/2012/07/Peace.jpg
© Hortonworks Inc. 2013
Slowly, but surely …
Page 8
Human beings define reality through misery and suffering.
- Agent Smith
http://api.ning.com/files/*oWmhl7LBlXuodD2itWUUtOautEVfD*pbBn57L8ThCyYIykiTuzkO4lJY1bwaNbJF7GecTDwsVj3EFHpDM-F1y-UW4b3Xsvh/matrix_revolutions_agent_smith_04.bmp
© Hortonworks Inc. 2013
Slowly, but surely …
Page 9
Human beings define reality through misery and suffering.
- Agent Smith
http://api.ning.com/files/*oWmhl7LBlXuodD2itWUUtOautEVfD*pbBn57L8ThCyYIykiTuzkO4lJY1bwaNbJF7GecTDwsVj3EFHpDM-F1y-UW4b3Xsvh/matrix_revolutions_agent_smith_04.bmp
© Hortonworks Inc. 2013
Slowly, but surely …
Page 10
… people of the kingdom clamored for more.
A palpable sense of greed & expectation.
http://sidoxia.files.wordpress.com/2011/11/wall-st-greed-st1.jpg
© Hortonworks Inc. 2013
Signs of Distress
Page 11
SQL said some, others said Machine Learning,
still others said Real-Time Event Processing
http://www.truth-seeker.info/wp-content/uploads/2012/11/distress.jpg
© Hortonworks Inc. 2013
A Meeting at the Summit
Page 12
MapReduce is dead!
Err… not quite.
We need more options! We need more!
True…
http://4.bp.blogspot.com/-
oqr1t6avx6g/TW55kUnmQvI/AAAAAAAAMMk/q9Jc87MSG4g/s400/arab%2Bleague%2Bround%2Btable%2B%2Bbig%2Bgood%2B2011.bmp
© Hortonworks Inc. 2013
A Meeting at the Summit
Page 13
A common thread YARN running through all applications…
Long live the King!
http://whipup.net/wp-content/images/2008/08/yarn.gif
© Hortonworks Inc. 2013
The Edict
Page 14
Henceforth, in the Kingdom of King YARN…
MapReduce has been relegated to the status
of, merely, one of the applications!
http://www.napavintners.org/images/winery_Labels/EdictWines-800HW.jpg
© Hortonworks Inc. 2013
Reign of King YARN
Page 15
King YARN came to throne
with promises to return power
to all applications
equally, lower performance
taxes and resource
management…
http://images.fineartamerica.com/images-medium-large/the-coronation-the-crown-that-queen-everett.jpg
© Hortonworks Inc. 2013
Oh the Shame!
Page 16
Well, at least, Prince
MapReduce still had
powerful allies like
Highness
Hive, Powerful
Pig, Cheery
Cascading…
http://www.gibbsmagazine.com/MPj03414090000%5B1%5D.jpg
© Hortonworks Inc. 2013
Things get worse before better
Page 17
Unfortunately, things got a lot worse for the Prince MapReduce…
http://www.deviantart.com/download/144412184/Smile__Tomorrow_will_be_worse__by_daGrevis.jpg
© Hortonworks Inc. 2013
Knight Tez
Page 18
He did MapReduce, and so much more…
Smartly aligned himself to Kingdom YARN.
http://twomorrows.com/alterego/media/08shiningknight.gif
© Hortonworks Inc. 2013
Knight Tez
Page 19
… they decided to throw their
lot with Knight Tez!
http://informatica.upg-ploiesti.ro/62689/img/partners.jpg
Long term alliances of MapReduce with
Hive, Pig, Cascading etc. broke up…
http://www.officialpsds.com/images/thumbs/broken-glass-psd44132.png
© Hortonworks Inc. 2013
Happily ever after…
Page 20
(nothing cute to say)
© Hortonworks Inc. 2013
On a more serious note…
Page 21
© Hortonworks Inc. 2013
Every season has a flavor…
Page 22
SQL-on-Hadoop is the new black!
SQL-on-Hadoop will be solved within
the existing ecosystem
© Hortonworks Inc. 2013
Looking ahead
Page 23
What will it be next year?
Real-time event processing?
Machine Learning?
© Hortonworks Inc. 2013
Play to our strengths
Page 24
Invest in the Apache Hadoop platform
and the ecosystem (Hive et al).
© Hortonworks Inc. 2013
Seriously…
Technical Details
Page 25
© Hortonworks Inc. 2013
Tez – Introduction
Page 26
• Distributed execution
framework targeted towards
data-processing applications.
• Based on expressing a
computation as a dataflow
graph.
• Built on top of YARN – the
resource management
framework for Hadoop.
• Open source Apache incubator
project and Apache licensed.
© Hortonworks Inc. 2013
Tez – Design Themes
Page 27
• Empowering End Users
• Execution Performance
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Expressive dataflow definition API’s
• Flexible Input-Processor-Output runtime model
• Data type agnostic
• Simplifying deployment
Page 28
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Expressive dataflow definition API’s
–Enable definition of complex data flow pipelines using simple
graph connection API’s. Tez expands the logical plan at runtime.
–Targeted towards data processing applications like Hive/Pig but
not limited to it. Hive/Pig query plans naturally map to Tez dataflow
graphs with no translation impedance.
Page 29
TaskA-1 TaskA-2 TaskB-1 TaskB-2 TaskC-1 TaskC-2
TaskD-1 TaskD-2 TaskE-1 TaskE-2
© Hortonworks Inc. 2013
Aggregate Stage
Partition Stage
Preprocessor Stage
Tez – Empowering End Users
• Expressive dataflow definition API’s
Page 30
Sampler
Task-1 Task-2
Task-1 Task-2
Task-1 Task-2
Samples
Ranges
Distributed Sort
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Flexible Input-Processor-Output runtime model
–Construct physical runtime executors dynamically by connecting
different inputs, processors and outputs.
–End goal is to have a library of inputs, outputs and processors that
can be programmatically composed to generate useful operators.
Page 31
IntermediateReduce
ShuffleInput
ReduceProcessor
FileSortedOutput
FinalReduce
ShuffleInput
ReduceProcessor
HDFSOutput
PairwiseJoin
Input1
JoinProcessor
FileSortedOutput
Input2
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Data type agnostic
–Tez is only concerned with the movement of data. Files and
streams of bytes.
–Does not impose any data format on the user application. MR
application can use Key-Value pairs on top of Tez. Hive and Pig
can use tuple oriented formats that are natural and native to them.
Page 32
File
Stream
Key Value
Tez Task
Tuples
User Code
Bytes Bytes
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Simplifying deployment
–Tez is a completely client side application.
–No deployments to do. Simply upload to any accessible
FileSystem and change local Tez configuration to point to that.
–Enables running different versions concurrently. Easy to test new
functionality while keeping stable versions for production.
–Leverages YARN local resources and distributed cache.
Page 33
Client
Machine
Node
Manager
TezTask
Node
Manager
TezTaskTezClient
HDFS
Tez Lib 1 Tez Lib 2
Client
Machine
TezClient
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Expressive dataflow definition API’s
• Flexible Input-Processor-Output runtime model
• Data type agnostic
• Simplifying usage
With great power API’s come great responsibilities 
Page 34
© Hortonworks Inc. 2013
Tez – Execution Performance
• Performance gains over Map Reduce
• Plan reconfiguration at runtime
• Optimal resource management
• Dynamic physical data flow decisions
Page 35
© Hortonworks Inc. 2013
Tez – Execution Performance
• Performance gains over Map Reduce
–Eliminate replicated write barrier between successive
computations.
–Eliminate job launch overhead of workflow jobs.
–Eliminate extra stage of map reads in every workflow job.
–Eliminate queue and resource contention suffered by workflow
jobs that are started after a predecessor job completes.
Page 36
Pig/Hive - MR
Pig/Hive - Tez
© Hortonworks Inc. 2013
Tez – Execution Performance
• Plan reconfiguration at runtime
–Dynamic runtime concurrent control based on data size, user
operator resources, available cluster resources and locality.
–Advanced changes in dataflow graph structure.
–Progressive graph construction in concert with user optimizer.
Page 37
HDFS
Blocks
YARN
Resources
Stage 1
50 maps
100
partitions
Stage 2
100
reducers
Stage 1
50 maps
100
partitions
Stage 2
100 10
reducers
Only 10GB’s
of data
© Hortonworks Inc. 2013
Tez – Execution Performance
• Optimal resource management
–Reuse YARN containers to launch new tasks.
–Reuse YARN containers to enable shared objects across tasks.
Page 38
YARN Container
TezTask Host
TezTask1
TezTask2
SharedObjects
YARN Container
Tez
Application Master
Start Task
Task Done
Start Task
© Hortonworks Inc. 2013
Tez – Execution Performance
• Dynamic physical data flow decisions
–Decide the type of physical byte movement and storage on the fly.
–Store intermediate data on distributed store, local store or in-
memory.
–Transfer bytes via blocking files or streaming and the spectrum in
between.
Page 39
Producer
(small size)
In-Memory
Consumer
Producer
Local File
Consumer
At Runtime
© Hortonworks Inc. 2013
Tez – Current status
• Apache Incubator Project
–Rapid development. Over 270 jiras opened. Over 170 resolved.
–Growing community.
• Focus on stability
–Testing and quality are highest priority.
–Code ready and deployed on multi-node clusters.
• DAG of MR processing is working
– Already functionally equivalent to Map Reduce. Existing Map
Reduce jobs can be executed on Tez with few or no changes.
– Working Hive prototype that can target Tez for execution of
queries.
–Work started on prototype of Pig that can target Tez.
Page 40
© Hortonworks Inc. 2013
Tez – Current status
Page 41
Fact Table
Dimension
Table 1
Result
Table 1
Dimension
Table 2
Result
Table 2
Dimension
Table 3
Result
Table 3
Join
Join
Join
Typical pattern in a
TPC-DS query
Fact Table
Dimension
Table 1
Dimension
Table 1
Dimension
Table 1
Optimization for
small data sets
Both can now run
as a single Tez job
© Hortonworks Inc. 2013
Tez – Looking ahead
• Early adopters and contributors welcome
–Adopters to drive more scenarios. Contributors to make them
happen.
• Stay tuned for Tez meetups with deep dives on Tez
architecture and using Tez
• Useful links
–Work tracking: https://issues.apache.org/jira/browse/TEZ
–Code: https://github.com/apache/incubator-tez
–High level design document and API specification:
https://issues.apache.org/jira/browse/TEZ-65
– Developer list: dev@tez.incubator.apache.org
User list: user@tez.incubator.apache.org
Issues list: issues@tez.incubator.apache.org
Page 42
© Hortonworks Inc. 2013
Tez – Takeaways
• Distributed execution framework that works on
computations represented as dataflow graphs
• Naturally maps to execution plans produced by query
optimizers
• Execution architecture designed to enable dynamic
performance optimizations at runtime
• Open source Apache project – your use-cases and
code are welcome
• It works and is already being used by Hive
Page 43
© Hortonworks Inc. 2013
Tez
Thanks for your time and attention!
Questions?
Page 44

More Related Content

What's hot

HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 

What's hot (20)

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations!
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop security
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 

Similar to Apache Tez: Accelerating Hadoop Query Processing

YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
Hortonworks
 

Similar to Apache Tez: Accelerating Hadoop Query Processing (20)

February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 
Hadoop past, present and future
Hadoop past, present and futureHadoop past, present and future
Hadoop past, present and future
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
 
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
 
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Apache Tez: Accelerating Hadoop Query Processing

  • 1. Apache Tez : Accelerating Hadoop Query Processing Page 1 Arun C. Murthy Bikas Saha Founder & Architect Hortonworks @acmurthy @bikassaha (@hortonworks)
  • 2. © Hortonworks Inc. 2013 Hello! • Founder/Architect at Hortonworks Inc. –Lead - Map-Reduce/YARN/Tez –Formerly, Architect Hadoop MapReduce, Yahoo –Responsible for running Hadoop MapReduce as a service for all of Yahoo (~50k nodes footprint) • Apache Hadoop, ASF –Frmr. VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC) –Long-term Committer/PMC member (full time for 7 years) –Release Manager for hadoop-2.x Page 2
  • 3. © Hortonworks Inc. 2013 Once upon a time … Page 3 … long, long ago, there was a kingdom we shall call Apache Hadoop http://2.bp.blogspot.com/-hIp99urgxCk/UAsSFo4i8YI/AAAAAAAAAFg/IzjNDwrBBVg/s1600/magickingdo
  • 4. © Hortonworks Inc. 2013 Hadoop begat … Page 4 … a two-headed monster on every node in the kingdom; each belonged to a different clan and answered to a different master http://4.bp.blogspot.com/_C7CsfdqySYc/TNSKvIwiFcI/AAAAAAAAAbs/2FSU2TV_rRA/s1600/Two-Headed+Monster+-+With+Identifiers+-+Jan+19,+2009_0.jpg
  • 5. © Hortonworks Inc. 2013 Knights of Bytes - HDFS Page 5 … stored data uncompromisingly in directories/files, nary a care about contents http://whoiscraigmoser.com/Images/identity/knight.png
  • 6. © Hortonworks Inc. 2013 Prince of Processing - MapReduce Page 6 He ruled with an iron fist by mapping, and then by mercilessly reducing datahttp://media.comicvine.com/uploads/14/144886/2868181-sauron.jpg
  • 7. © Hortonworks Inc. 2013 Peace Reigned Page 7 … for a while with the odd change in the direction of the wind http://www.get-covers.com/wp-content/uploads/2012/07/Peace.jpg
  • 8. © Hortonworks Inc. 2013 Slowly, but surely … Page 8 Human beings define reality through misery and suffering. - Agent Smith http://api.ning.com/files/*oWmhl7LBlXuodD2itWUUtOautEVfD*pbBn57L8ThCyYIykiTuzkO4lJY1bwaNbJF7GecTDwsVj3EFHpDM-F1y-UW4b3Xsvh/matrix_revolutions_agent_smith_04.bmp
  • 9. © Hortonworks Inc. 2013 Slowly, but surely … Page 9 Human beings define reality through misery and suffering. - Agent Smith http://api.ning.com/files/*oWmhl7LBlXuodD2itWUUtOautEVfD*pbBn57L8ThCyYIykiTuzkO4lJY1bwaNbJF7GecTDwsVj3EFHpDM-F1y-UW4b3Xsvh/matrix_revolutions_agent_smith_04.bmp
  • 10. © Hortonworks Inc. 2013 Slowly, but surely … Page 10 … people of the kingdom clamored for more. A palpable sense of greed & expectation. http://sidoxia.files.wordpress.com/2011/11/wall-st-greed-st1.jpg
  • 11. © Hortonworks Inc. 2013 Signs of Distress Page 11 SQL said some, others said Machine Learning, still others said Real-Time Event Processing http://www.truth-seeker.info/wp-content/uploads/2012/11/distress.jpg
  • 12. © Hortonworks Inc. 2013 A Meeting at the Summit Page 12 MapReduce is dead! Err… not quite. We need more options! We need more! True… http://4.bp.blogspot.com/- oqr1t6avx6g/TW55kUnmQvI/AAAAAAAAMMk/q9Jc87MSG4g/s400/arab%2Bleague%2Bround%2Btable%2B%2Bbig%2Bgood%2B2011.bmp
  • 13. © Hortonworks Inc. 2013 A Meeting at the Summit Page 13 A common thread YARN running through all applications… Long live the King! http://whipup.net/wp-content/images/2008/08/yarn.gif
  • 14. © Hortonworks Inc. 2013 The Edict Page 14 Henceforth, in the Kingdom of King YARN… MapReduce has been relegated to the status of, merely, one of the applications! http://www.napavintners.org/images/winery_Labels/EdictWines-800HW.jpg
  • 15. © Hortonworks Inc. 2013 Reign of King YARN Page 15 King YARN came to throne with promises to return power to all applications equally, lower performance taxes and resource management… http://images.fineartamerica.com/images-medium-large/the-coronation-the-crown-that-queen-everett.jpg
  • 16. © Hortonworks Inc. 2013 Oh the Shame! Page 16 Well, at least, Prince MapReduce still had powerful allies like Highness Hive, Powerful Pig, Cheery Cascading… http://www.gibbsmagazine.com/MPj03414090000%5B1%5D.jpg
  • 17. © Hortonworks Inc. 2013 Things get worse before better Page 17 Unfortunately, things got a lot worse for the Prince MapReduce… http://www.deviantart.com/download/144412184/Smile__Tomorrow_will_be_worse__by_daGrevis.jpg
  • 18. © Hortonworks Inc. 2013 Knight Tez Page 18 He did MapReduce, and so much more… Smartly aligned himself to Kingdom YARN. http://twomorrows.com/alterego/media/08shiningknight.gif
  • 19. © Hortonworks Inc. 2013 Knight Tez Page 19 … they decided to throw their lot with Knight Tez! http://informatica.upg-ploiesti.ro/62689/img/partners.jpg Long term alliances of MapReduce with Hive, Pig, Cascading etc. broke up… http://www.officialpsds.com/images/thumbs/broken-glass-psd44132.png
  • 20. © Hortonworks Inc. 2013 Happily ever after… Page 20 (nothing cute to say)
  • 21. © Hortonworks Inc. 2013 On a more serious note… Page 21
  • 22. © Hortonworks Inc. 2013 Every season has a flavor… Page 22 SQL-on-Hadoop is the new black! SQL-on-Hadoop will be solved within the existing ecosystem
  • 23. © Hortonworks Inc. 2013 Looking ahead Page 23 What will it be next year? Real-time event processing? Machine Learning?
  • 24. © Hortonworks Inc. 2013 Play to our strengths Page 24 Invest in the Apache Hadoop platform and the ecosystem (Hive et al).
  • 25. © Hortonworks Inc. 2013 Seriously… Technical Details Page 25
  • 26. © Hortonworks Inc. 2013 Tez – Introduction Page 26 • Distributed execution framework targeted towards data-processing applications. • Based on expressing a computation as a dataflow graph. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache incubator project and Apache licensed.
  • 27. © Hortonworks Inc. 2013 Tez – Design Themes Page 27 • Empowering End Users • Execution Performance
  • 28. © Hortonworks Inc. 2013 Tez – Empowering End Users • Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying deployment Page 28
  • 29. © Hortonworks Inc. 2013 Tez – Empowering End Users • Expressive dataflow definition API’s –Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical plan at runtime. –Targeted towards data processing applications like Hive/Pig but not limited to it. Hive/Pig query plans naturally map to Tez dataflow graphs with no translation impedance. Page 29 TaskA-1 TaskA-2 TaskB-1 TaskB-2 TaskC-1 TaskC-2 TaskD-1 TaskD-2 TaskE-1 TaskE-2
  • 30. © Hortonworks Inc. 2013 Aggregate Stage Partition Stage Preprocessor Stage Tez – Empowering End Users • Expressive dataflow definition API’s Page 30 Sampler Task-1 Task-2 Task-1 Task-2 Task-1 Task-2 Samples Ranges Distributed Sort
  • 31. © Hortonworks Inc. 2013 Tez – Empowering End Users • Flexible Input-Processor-Output runtime model –Construct physical runtime executors dynamically by connecting different inputs, processors and outputs. –End goal is to have a library of inputs, outputs and processors that can be programmatically composed to generate useful operators. Page 31 IntermediateReduce ShuffleInput ReduceProcessor FileSortedOutput FinalReduce ShuffleInput ReduceProcessor HDFSOutput PairwiseJoin Input1 JoinProcessor FileSortedOutput Input2
  • 32. © Hortonworks Inc. 2013 Tez – Empowering End Users • Data type agnostic –Tez is only concerned with the movement of data. Files and streams of bytes. –Does not impose any data format on the user application. MR application can use Key-Value pairs on top of Tez. Hive and Pig can use tuple oriented formats that are natural and native to them. Page 32 File Stream Key Value Tez Task Tuples User Code Bytes Bytes
  • 33. © Hortonworks Inc. 2013 Tez – Empowering End Users • Simplifying deployment –Tez is a completely client side application. –No deployments to do. Simply upload to any accessible FileSystem and change local Tez configuration to point to that. –Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. –Leverages YARN local resources and distributed cache. Page 33 Client Machine Node Manager TezTask Node Manager TezTaskTezClient HDFS Tez Lib 1 Tez Lib 2 Client Machine TezClient
  • 34. © Hortonworks Inc. 2013 Tez – Empowering End Users • Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying usage With great power API’s come great responsibilities  Page 34
  • 35. © Hortonworks Inc. 2013 Tez – Execution Performance • Performance gains over Map Reduce • Plan reconfiguration at runtime • Optimal resource management • Dynamic physical data flow decisions Page 35
  • 36. © Hortonworks Inc. 2013 Tez – Execution Performance • Performance gains over Map Reduce –Eliminate replicated write barrier between successive computations. –Eliminate job launch overhead of workflow jobs. –Eliminate extra stage of map reads in every workflow job. –Eliminate queue and resource contention suffered by workflow jobs that are started after a predecessor job completes. Page 36 Pig/Hive - MR Pig/Hive - Tez
  • 37. © Hortonworks Inc. 2013 Tez – Execution Performance • Plan reconfiguration at runtime –Dynamic runtime concurrent control based on data size, user operator resources, available cluster resources and locality. –Advanced changes in dataflow graph structure. –Progressive graph construction in concert with user optimizer. Page 37 HDFS Blocks YARN Resources Stage 1 50 maps 100 partitions Stage 2 100 reducers Stage 1 50 maps 100 partitions Stage 2 100 10 reducers Only 10GB’s of data
  • 38. © Hortonworks Inc. 2013 Tez – Execution Performance • Optimal resource management –Reuse YARN containers to launch new tasks. –Reuse YARN containers to enable shared objects across tasks. Page 38 YARN Container TezTask Host TezTask1 TezTask2 SharedObjects YARN Container Tez Application Master Start Task Task Done Start Task
  • 39. © Hortonworks Inc. 2013 Tez – Execution Performance • Dynamic physical data flow decisions –Decide the type of physical byte movement and storage on the fly. –Store intermediate data on distributed store, local store or in- memory. –Transfer bytes via blocking files or streaming and the spectrum in between. Page 39 Producer (small size) In-Memory Consumer Producer Local File Consumer At Runtime
  • 40. © Hortonworks Inc. 2013 Tez – Current status • Apache Incubator Project –Rapid development. Over 270 jiras opened. Over 170 resolved. –Growing community. • Focus on stability –Testing and quality are highest priority. –Code ready and deployed on multi-node clusters. • DAG of MR processing is working – Already functionally equivalent to Map Reduce. Existing Map Reduce jobs can be executed on Tez with few or no changes. – Working Hive prototype that can target Tez for execution of queries. –Work started on prototype of Pig that can target Tez. Page 40
  • 41. © Hortonworks Inc. 2013 Tez – Current status Page 41 Fact Table Dimension Table 1 Result Table 1 Dimension Table 2 Result Table 2 Dimension Table 3 Result Table 3 Join Join Join Typical pattern in a TPC-DS query Fact Table Dimension Table 1 Dimension Table 1 Dimension Table 1 Optimization for small data sets Both can now run as a single Tez job
  • 42. © Hortonworks Inc. 2013 Tez – Looking ahead • Early adopters and contributors welcome –Adopters to drive more scenarios. Contributors to make them happen. • Stay tuned for Tez meetups with deep dives on Tez architecture and using Tez • Useful links –Work tracking: https://issues.apache.org/jira/browse/TEZ –Code: https://github.com/apache/incubator-tez –High level design document and API specification: https://issues.apache.org/jira/browse/TEZ-65 – Developer list: dev@tez.incubator.apache.org User list: user@tez.incubator.apache.org Issues list: issues@tez.incubator.apache.org Page 42
  • 43. © Hortonworks Inc. 2013 Tez – Takeaways • Distributed execution framework that works on computations represented as dataflow graphs • Naturally maps to execution plans produced by query optimizers • Execution architecture designed to enable dynamic performance optimizations at runtime • Open source Apache project – your use-cases and code are welcome • It works and is already being used by Hive Page 43
  • 44. © Hortonworks Inc. 2013 Tez Thanks for your time and attention! Questions? Page 44