SlideShare une entreprise Scribd logo
1  sur  79
Télécharger pour lire hors ligne
04.12.2013

uweseiler

Hadoop 2 Going beyond MapReduce
04.12.2013

About me

Big Data Nerd

Hadoop Trainer MongoDB Author

Photography Enthusiast

Travelpirate
04.12.2013

About us

is a bunch of…

Big Data Nerds

Agile Ninjas

Continuous Delivery Gurus

Join us!
Enterprise Java Specialists Performance Geeks
04.12.2013

Agenda

• Why Hadoop 2?
• HDFS 2
• YARN
• YARN Apps
• Write your own YARN App
• Stinger Initiative & Hive
04.12.2013

Agenda

• Why Hadoop 2?
• HDFS 2
• YARN
• YARN Apps
• Write your own YARN App
• Stinger Initiative & Hive
Hadoop 1

04.12.2013

Built for web-scale batch apps
Single App

Single App

Batch

Batch

Single App

Single App

Single App

Batch

Batch

Batch

HDFS

HDFS

HDFS
04.12.2013

MapReduce is good for…

• Embarrassingly parallel algorithms
• Summing, grouping, filtering, joining
• Off-line batch jobs on massive data
sets
• Analyzing an entire large dataset
04.12.2013

MapReduce is OK for…

• Iterative jobs (i.e., graph algorithms)
– Each iteration must read/write data to
disk
– I/O and latency cost of an iteration is
high
04.12.2013

MapReduce is not good for…

• Jobs that need shared state/coordination
– Tasks are shared-nothing
– Shared-state requires scalable state store

• Low-latency jobs
• Jobs on small datasets
• Finding individual records
04.12.2013

•

MapReduce limitations

Scalability
– Maximum cluster size ~ 4,500 nodes
– Maximum concurrent tasks – 40,000
– Coarse synchronization in JobTracker

•

Availability
– Failure kills all queued and running jobs

•

Hard partition of resources into map & reduce slots
– Low resource utilization

•

Lacks support for alternate paradigms and services
– Iterative applications implemented using MapReduce are 10x
slower
04.12.2013

A brief history of Hadoop 2

• Originally conceived & architected by the
team at Yahoo!
–

Arun Murthy created the original JIRA in 2008 and now is
the Hadoop 2 release manager

The community has been working on
Hadoop 2 for over 4 years

•

• Hadoop 2 based architecture running at
scale at Yahoo!
–

Deployed on 35,000+ nodes for 6+ months
04.12.2013

Hadoop 2: Next-gen platform

Single use system

Multi-purpose platform

Batch Apps

Batch, Interactive, Streaming, …

Hadoop 1

MapReduce
Cluster resource mgmt.
+ data processing

HDFS
Redundant, reliable
storage

Hadoop 2
MapReduce

Others

Data processing

Data processing

YARN
Cluster resource management

HDFS 2
Redundant, reliable storage
Taking Hadoop beyond batch

04.12.2013

Store all data in one place
Interact with data in multiple ways
Applications run natively in Hadoop
Batch

Interactive

Online

MapReduce

Tez

HOYA

Streaming Graph In-Memory
Storm, …

Giraph

YARN
Cluster resource management

HDFS 2
Redundant, reliable storage

Spark

Other
Search, …
04.12.2013

Agenda

• Why Hadoop 2?
• HDFS 2
• YARN
• YARN Apps
• Write your own YARN App
• Stinger Initiative & Hive
04.12.2013

HDFS 2: In a nutshell

• Removes tight coupling of Block
Storage and Namespace
• High Availability
• Scalability & Isolation
• Increased performance
Details: https://issues.apache.org/jira/browse/HDFS-1052
HDFS 2: Federation

04.12.2013

NameNodes do not talk to each other

NameNode 1

NameNode 2

NameNodes manages
Namespace 2 only slice of namespace

Namespace 1
logs

finance

insights

Pool 1

Pool 2

Block

DataNode
1

reports

DataNode
2

Block Storage as
generic storage service

Pools

DataNode
3

Common storage

DataNode
4

DataNodes can store
blocks managed by
any NameNode
04.12.2013

HDFS 2: Architecture

Only the active
writes edits

NFS

Active NameNode
Block
Map

DataNode

Shared state on
shared directory
on NFS

Edits
File

DataNode

Standby NameNode

Block
Map

DataNode

Edits
File

DataNode

The Standby
simultaneously
reads and applies
the edits

DataNode

DataNodes report to both NameNodes but listen
only to the orders from the active one
04.12.2013

Only the active
writes edits

HDFS 2: Quorum based storage
Journal
Node

Journal
Node

Active NameNode
Block
Map

DataNode

Edits
File

DataNode

Journal
Node

Standby NameNode

Block
Map

DataNode

Edits
File

DataNode

The state is shared
on a quorum of
journal nodes
The Standby
simultaneously
reads and applies
the edits

DataNode

DataNodes report to both NameNodes but listen
only to the orders from the active one
04.12.2013

When healthy,
holds persistent
session
Holds special
lock znode
ZKFailover
Controller

Process that
monitors health
of NameNode
DataNode

HDFS 2: High Availability
ZooKeeper
Node

ZooKeeper
Node

Journal
Node

ZooKeeper
Node

Journal
Node

Journal
Node

SharedNameNode state
Active NameNode
Block
Map

Edits
File

DataNode

Standby NameNode

Block
Map

DataNode

Send block reports to Active & Standby NameNode

Edits
File

DataNode

When healthy,
holds persistent
session

ZKFailover
Controller

Process that
monitors health
of NameNode
DataNode
04.12.2013

HDFS 2: Snapshots

• Admin can create point in time snapshots of HDFS
• Of the entire file system (/root)
• Of a specific data-set (sub-tree directory of file system)

• Restore state of entire file system or data-set to a
snapshot (like Apple Time Machine)
• Protect against user errors

• Snapshot diffs identify changes made to data set
• Keep track of how raw or derived/analytical data changes
over time
04.12.2013

HDFS 2: NFS Gateway

• NFS v3 Standard
• Supports all HDFS commands
• List files
• Copy, move files
• Create and delete directories

• Ingest for large scale analytical workloads
• Load immutable files as source for analytical processing
• No random writes

• Stream files into HDFS
• Log ingest by applications writing directly to HDFS client
mount
04.12.2013

HDFS 2: Performance

• Many improvements
• SSE4.2 CRC32C: ~3x less CPU on read path
• Read path improvements for fewer memory copies
• Short-circuit local reads for 2-3x faster random
reads
• I/O improvements using posix_fadvise()
• libhdfs improvements for zero copy reads

• Significant improvements: I/O 2.5 - 5x faster
04.12.2013

Agenda

• Why Hadoop 2?
• HDFS 2
• YARN
• YARN Apps
• Write your own YARN App
• Stinger Initiative & Hive
04.12.2013

YARN: Design Goals

• Build a new abstraction layer by splitting up
the two major functions of the JobTracker
• Cluster resource management
• Application life-cycle management

• Allow other processing paradigms
• Flexible API for implementing YARN apps
• MapReduce becomes YARN app
• Lots of different YARN apps
04.12.2013

YARN: Concepts

• Application
– Application is a job committed to the YARN framework
– Example: MapReduce job, Storm topology, …

• Container
– Basic unit of allocation
– Fine-grained resource allocation across multiple resource
types (RAM, CPU, Disk, Network, GPU, etc.)
•
•

container_0 = 4 GB, 1 CPU
container_1 = 512 MB, 6 CPU’s

– Replaces the fixed map/reduce slots
04.12.2013

YARN: Architecture

• Resource Manager
– Global resource scheduler
– Hierarchical queues

•

Node Manager
– Per-machine agent
– Manages the life-cycle of container
– Container resource monitoring

•

Application Master
– Per-application
– Manages application scheduling and task execution
– e.g. MapReduce Application Master
04.12.2013

YARN: Architectural Overview

Split up the two major functions of the JobTracker
Cluster resource management & Application life-cycle management
ResourceManager

Scheduler

NodeManager

NodeManager

AM 1

NodeManager

Container 1.1

NodeManager
Container 2.1

Container 2.3
NodeManager

NodeManager

NodeManager

NodeManager

Container 1.2

AM 2

Container 2.2
04.12.2013

YARN: BigDataOS™
ResourceManager

Scheduler
NodeManager

NodeManager

NodeManager

NodeManager

MapReduce 1

map 1.1

reduce 2.2

map 2.1

Region server 2

reduce 2.1

nimbus 1

vertex 3

NodeManager

NodeManager

NodeManager

NodeManager

HBase Master

map 1.2

MapReduce 2

map 2.2

nimbus 2

Region server 1

vertex 4

vertex 2

NodeManager

NodeManager

NodeManager

NodeManager

HOYA

reduce 1.1

Tez

map 2.3

vertex 1

Region server 3

Storm
04.12.2013

YARN: Multi-tenancy
ResourceManager

Scheduler
NodeManager

NodeManager

MapReduce 1

map 1.1

Region server 2

reduce 2.1

NodeManager
HBase Master

nimbus 2

NodeManager

NodeManager

reduce 2.2
root
nimbus 1

map 2.1

vertex 3

30%
10%
60%
NodeManager
NodeManager
NodeManager
AdDWH
User
Hoc
map 1.2
MapReduce 2
map 2.2
75%
25%
20%
80%
Region server 1

NodeManager

NodeManager

HOYA

reduce 1.1

vertex 1

vertex 2

vertex 4

Dev
Dev
Prod
NodeManager
60%
40%

Tez

Dev1

Region server 3

Prod
NodeManager

Dev2

map 2.3
Storm
04.12.2013

YARN: Capacity Scheduler

• Queues
• Economics as queue capacities
•
•

• Resource Isolation
• Administration
•
•
•

Capacity
Scheduler

Hierarchical queues
SLA‘s via preemption

Queue ACL‘s
Runtime reconfig
Charge-back

root
nimbus 1
30%
10%
60%
NodeManager
AdDWH
User
Hoc
MapReduce 2
75%
25%
20%
80%

vertex 4

Dev

Prod
60%

Tez

Dev1

Dev

Prod

40%
Dev2
04.12.2013

Agenda

• Why Hadoop 2?
• HDFS 2
• YARN
• YARN Apps
• Write your own YARN App
• Stinger Initiative & Hive
04.12.2013

•
•
•
•
•
•
•
•
•
•
•

YARN Apps: Overview

MapReduce 2
Apache Tez
HOYA
Storm
Samza
Apache S4
Spark
Apache Giraph
Apache Hama
Elastic Search
Cloudera Llama

Batch
Batch/Interactive
HBase on YARN
Stream Processing
Stream Processing
Stream Processing
Iterative/Interactive
Graph Processing
Bulk Synchronous Parallel
Scalable Search
Impala on YARN
04.12.2013

•
•
•
•
•
•
•
•
•
•
•

YARN Apps: Overview

MapReduce 2
Apache Tez
HOYA
Storm
Samza
Apache S4
Spark
Apache Giraph
Apache Hama
Elastic Search
Cloudera Llama

Batch
Batch/Interactive
HBase on YARN
Stream Processing
Stream Processing
Stream Processing
Iterative/Interactive
Graph Processing
Bulk Synchronous Parallel
Scalable Search
Impala on YARN
04.12.2013

MapReduce 2: In a nutshell

• MapReduce is now a YARN app
•
•

No more map and reduce slots, it’s containers now
No more JobTracker, it’s YarnAppmaster library now

• Multiple versions of MapReduce
•
•

The older mapred APIs work without modification or recompilation
The newer mapreduce APIs may need to be recompiled

• Still has one master server component: the Job History Server
•
•
•

The Job History Server stores the execution of jobs
Used to audit prior execution of jobs
Will also be used by YARN framework to store charge backs at that level

• Better cluster utilization
• Increased scalability & availability
04.12.2013

MapReduce 2: Shuffle

• Faster Shuffle
•

Better embedded server: Netty

• Encrypted Shuffle
•
•
•
•

Secure the shuffle phase as data moves across the cluster
Requires 2 way HTTPS, certificates on both sides
Causes significant CPU overhead, reserve 1 core for this work
Certificates stored on each node (provision with the cluster), refreshed every
10 secs

• Pluggable Shuffle Sort
•
•
•
•

Shuffle is the first phase in MapReduce that is guaranteed to not be datalocal
Pluggable Shuffle/Sort allows application developers or hardware
developers to intercept the network-heavy workload and optimize it
Typical implementations have hardware components like fast networks and
software components like sorting algorithms
API will change with future versions of Hadoop
04.12.2013

MapReduce 2: Performance

• Key Optimizations
• No hard segmentation of resource into map and reduce slots
• YARN scheduler is more efficient
• MR2 framework has become more efficient than MR1: shuffle
phase in MRv2 is more performant with the usage of Netty.

• 40.000+ nodes running YARN across over 365 PB of data.
• About 400.000 jobs per day for about 10 million hours of
compute time.
• Estimated 60% – 150% improvement on node usage per day
• Got rid of a whole 10,000 node datacenter because of their
increased utilization.
04.12.2013

•
•
•
•
•
•
•
•
•
•
•

YARN Apps: Overview

MapReduce 2
Apache Tez
HOYA
Storm
Samza
Apache S4
Spark
Apache Giraph
Apache Hama
Elastic Search
Cloudera Llama

Batch
Batch/Interactive
HBase on YARN
Stream Processing
Stream Processing
Stream Processing
Iterative/Interactive
Graph Processing
Bulk Synchronous Parallel
Scalable Search
Impala on YARN
04.12.2013

Apache Tez: In a nutshell

•
•

Low level data-processing execution engine
Tez is Hindi for “speed”

•
•
•

Enables pipelining of jobs
Removes task and job launch times
Does not write intermediate output to HDFS
– Much lighter disk and network usage

The new primitive for MapReduce, Hive, Pig,
Cascading, etc.
Apache Tez: The new primitive

04.12.2013

MapReduce as Base

Apache Tez as Base
Hadoop 2

Hadoop 1
MR
Pig

Hive

Pig

Hive

Other

MapReduce
Cluster resource mgmt. + data
processing

Tez

Real
time
Storm

Execution Engine

O
t
h
e
r

MapReduce
Cluster resource mgmt. + data processing

HDFS
Redundant, reliable storage

HDFS
Redundant, reliable storage
04.12.2013

Apache Tez: Architecture

• Task with pluggable Input, Processor & Output
Task
HDFS
Input

Map
Processor

Sorted
Output

„Classical“ Map
Task
Shuffle
Input

Reduce
Processor
„Classical“ Reduce

HDFS
Output

YARN ApplicationMaster to run
DAG of Tez Tasks
04.12.2013

Apache Tez: Performance

SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id= b.id) GROUP BY a
UNION SELECT x, AVERAGE(y) AS AVG
FROM c GROUP BY x
ORDER BY AVG;

Synchronization
barrier

Tez avoids unnecessary
writes to HDFS
04.12.2013

•
•
•
•
•
•
•
•
•
•
•

YARN Apps: Overview

MapReduce 2
Apache Tez
HOYA
Storm
Samza
Apache S4
Spark
Apache Giraph
Apache Hama
Elastic Search
Cloudera Llama

Batch
Batch/Interactive
HBase on YARN
Stream Processing
Stream Processing
Stream Processing
Iterative/Interactive
Graph Processing
Bulk Synchronous Parallel
Scalable Search
Impala on YARN
04.12.2013

HOYA: In a nutshell

• Create on-demand HBase clusters
•
•
•
•
•

Small HBase cluster in large YARN cluster
Dynamic HBase clusters
Self-healing HBase Cluster
Elastic HBase clusters
Transient/intermittent clusters for workflows

• Configure custom configurations & versions
• Better isolation
• More efficient utilization/sharing of cluster
04.12.2013

HOYA Client

YARNClient
HOYA
specific API

HOYA: Creation of AppMaster
ResourceManager

Scheduler
NodeManager

NodeManager
Container

HOYA
Application
Master

Container
NodeManager

Container

Container

Container

Container
04.12.2013

HOYA Client

YARNClient
HOYA
specific API

HOYA: Deployment of HBase
ResourceManager

Scheduler
NodeManager

NodeManager
Region Server
Container

HOYA
Application
Master

Container
NodeManager

Container
HBase Master

Container

Container

Region Server
Container
04.12.2013

HOYA Client

YARNClient
HOYA
specific API

HOYA: Bind via ZooKeeper
ResourceManager

Scheduler
NodeManager

NodeManager
Region Server
Container

HBase
Client

HOYA
Application
Master

Container
NodeManager

Container
HBase Master

Container

Container

Region Server
Container

ZooKeeper
04.12.2013

•
•
•
•
•
•
•
•
•
•
•

YARN Apps: Overview

MapReduce 2
Apache Tez
HOYA
Storm
Samza
Apache S4
Spark
Apache Giraph
Apache Hama
Elastic Search
Cloudera Llama

Batch
Batch/Interactive
HBase on YARN
Stream Processing
Stream Processing
Stream Processing
Iterative/Interactive
Graph Processing
Bulk Synchronous Parallel
Scalable Search
Impala on YARN
04.12.2013

Twitter Storm

• Stream-processing
• Real-time processing
• Developed as standalone application
• https://github.com/nathanmarz/storm

• Ported on YARN
• https://github.com/yahoo/storm-yarn
04.12.2013

Storm: Conceptual view

Bolt

Spout
Source of streams

• Consumer of streams
• Processing of tuples
• Possibly emits new
tuples

Bolt
Stream

Spout

Unbound sequence of
tuples

Tuple
List of name-value pairs

Bolt
Tuple

Bolt

Tuple

Spout

Tuple

Bolt
Bolt

Topology
Network of spouts & bolts as the nodes and
streams as the edges
04.12.2013

•
•
•
•
•
•
•
•
•
•
•

YARN Apps: Overview

MapReduce 2
Apache Tez
HOYA
Storm
Samza
Apache S4
Spark
Apache Giraph
Apache Hama
Elastic Search
Cloudera Llama

Batch
Batch/Interactive
HBase on YARN
Stream Processing
Stream Processing
Stream Processing
Iterative/Interactive
Graph Processing
Bulk Synchronous Parallel
Scalable Search
Impala on YARN
04.12.2013

Spark: In a nutshell

• High-speed in-memory analytics over
Hadoop and Hive
• Separate MapReduce-like engine
–
–

Speedup of up to 100x
On-disk queries 5-10x faster

• Compatible with Hadoop‘s Storage API
• Available as standalone application
– https://github.com/mesos/spark

• Experimental support for YARN since 0.6
– http://spark.incubator.apache.org/docs/0.6.0/running-on-yarn.html
04.12.2013

Spark: Data Sharing
04.12.2013

•
•
•
•
•
•
•
•
•
•
•

YARN Apps: Overview

MapReduce 2
Apache Tez
HOYA
Storm
Samza
Apache S4
Spark
Apache Giraph
Apache Hama
Elastic Search
Cloudera Llama

Batch
Batch/Interactive
HBase on YARN
Stream Processing
Stream Processing
Stream Processing
Iterative/Interactive
Graph Processing
Bulk Synchronous Parallel
Scalable Search
Impala on YARN
04.12.2013

Giraph: In a nutshell

• Giraph is a framework for processing semistructured graph data on a massive scale.
• Giraph is loosely based upon Google's
Pregel
• Giraph performs iterative calculations on top
of an existing Hadoop cluster.
• Available on GitHub
– https://github.com/apache/giraph
04.12.2013

Agenda

• Why Hadoop 2?
• HDFS 2
• YARN
• YARN Apps
• Write your own YARN App
• Stinger Initiative & Hive
04.12.2013

YARN: API & client libaries

• Application Client Protocol: Client to RM interaction
•
•
•

Library: YarnClient
Application lifecycle control
Access cluster information

• Application Master Protocol: AM – RM interaction
•
•
•

Library: AMRMClient / AMRMClientAsync
Resource negotiation
Heartbeat to the RM

• Container Management Protocol: AM to NM interaction
•
•
•

Library: NMClient/NMClientAsync
Launching allocated containers
Stop running containers

Use external frameworks like Weave, REEF or Spring
04.12.2013
App client

YARNClient
App specific
API

YARN: Application flow
Application
Client
Protocol

ResourceManager

Scheduler

Application
Master
Protocol
NodeManager

NodeManager

myAM

Container

AMRMClient

NodeManager
Container

NMClient

Container Management Protocol
04.12.2013

YARN: More things to consider

• Container / Tasks
• Client
• UI
• Recovery
• Container-to-AM communication
• Application history
YARN: Testing / Debugging

04.12.2013

•

MiniYARNCluster
•
•
•

•

Unmanaged AM
•
•

•

Cluster within process in threads (HDFS, MR, HBase)
Useful for regression tests
Avoid for unit tests: too slow

Support to run the AM outside of a YARN cluster for
manual testing
Debugger, heap dumps, etc.

Logs
•
•
•

Log aggregation support to push all logs into HDFS
Accessible via CLI, UI
Useful to enable all application logs by default
04.12.2013

Agenda

• Why Hadoop 2?
• HDFS 2
• YARN
• YARN Apps
• Write your own YARN App
• Stinger Initiative & Hive
04.12.2013

Real-Time
• Online systems
• R-T analytics
• CEP

0-5s

Hive: Current Focus Area
Interactive
• Parameterized
Reports
• Drilldown
• Visualization
• Exploration

NonInteractive

Batch

• Data preparation
• Operational
• Incremental
batch
batch
processing
processing
• Enterprise
• Dashboards /
Reports
Scorecards
• Data Mining
Current Hive Sweet Spot

1m – 1h

5s – 1m
Data Size

1h+
04.12.2013

Real-Time
• Online systems
• R-T analytics
• CEP

Stinger: Extending the sweet spot
NonInteractive

Interactive
• Parameterized
Reports
• Drilldown
• Visualization
• Exploration

• Data preparation
• Incremental
batch
processing
• Dashboards /
Scorecards

Batch
• Operational
batch
processing
• Enterprise
Reports
• Data Mining

Future Hive Expansion

0-5s

1m – 1h

5s – 1m

1h+

Data Size
Improve Latency & Throughput
• Query engine improvements
• New “Optimized RCFile” column store
• Next-gen runtime (elim’s M/R latency)

Extend Deep Analytical Ability
• Analytics functions
• Improved SQL coverage
• Continued focus on core Hive use cases
04.12.2013

Stinger Initiative: In a nutshell
04.12.2013

Stinger: Enhancing SQL Semantics
04.12.2013

Stinger: Tez revisited
SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Job 1
Job 2

I/O
Synchronization
Barrier

I/O
Synchronization
Barrier

Job 3

Hive - MapReduce

Single
Job

Hive - Tez
04.12.2013

Stinger: Tez Service

• MapReduce Query Startup is expensive:
– Job launch & task-launch latencies are fatal for
short queries (in order of 5s to 30s)

• Solution:
– Tez Service (= Preallocated Application Master)
• Removes job-launch overhead (Application Master)
• Removes task-launch overhead (Pre-warmed Containers)

– Hive (or Pig)
• Submit query-plan to Tez Service

– Native Hadoop service, not ad-hoc
04.12.2013

Stinger: Tez Performance
SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state

Existing Hive

Tez & Hive Service

Hive/Tez

Parse Query

0.5s

Parse Query

0.5s

Parse Query

0.5s

Create Plan

0.5s

Create Plan

0.5s

Create Plan

0.5s

Launch MapReduce

20s

Launch MapReduce

20s

Submit to Tez
Service

0.5s

Process MapReduce

10s

Process MapReduce

2s

Total

31s

Total

23s

Process Map-Reduce
Total

2s
3.5s

* No exact numbers, for illustration only
04.12.2013

Hive: Persistence Options

• Built-in Formats
•
•
•
•
•
•
•

ORCFile
RCFile
Avro
Delimited Text
Regular Expression
S3 Logfile
Typed Bytes

• 3rd Party Addons
• JSON
• XML
04.12.2013

Stinger: ORCFile in a nutshell

• Optimized Row Columnar File
• Columns stored separately

• Knows types
• Uses type-specific encoders
• Stores statistics (min, max, sum, count)

• Has light-weight index
• Skip over blocks of rows that don’t matter

• Larger blocks
• 256 MB by default
• Has an index for block boundaries
04.12.2013

Large block
size well
suited for
HDFS

Stinger: ORCFile Layout
Columnar format
arranges columns
adjacent within the
file for compression
and fast access.
04.12.2013

Stinger: ORCFile advantages

• High Compression
• Many tricks used out-of-the-box to ensure high
compression rates.
• RLE, dictionary encoding, etc.

• High Performance
• Inline indexes record value ranges within blocks of
ORCFile data.
• Filter pushdown allows efficient scanning during
precise queries.

• Flexible Data Model
• All Hive types including maps, structs and unions.
04.12.2013

Stinger: ORCFile compression
Data set from TPC-DS
Measurements by Owen
O‘Malley, Hortonworks
04.12.2013

Stinger: More optimizations

• Join Optimizations
• New Join Types added or improved in Hive 0.11

• More Efficient Query Plan Generation

• Vectorization
• Make the most use of L1 and L2 caches
• Avoid branching whenever possible

• Optimized Query Planner
• Automatically determine optimal execution parameters.

• Buffering
• Cache hot data in memory.

• …
04.12.2013

Stinger: Overall Performance

* Real numbers, but handle with care!
04.12.2013

Hadoop 2: Summary

1. Scale
2. New programming models &
services
3. Improved cluster utilization
4. Agility
5. Beyond Java
04.12.2013

One more thing…

Let’s get started with
Hadoop 2!
04.12.2013

Hortonworks Sandbox 2.0

http://hortonworks.com/products/hortonworks-sandbox
04.12.2013

Dude, where’s my Hadoop 2?

* Table by Adam Muise, Hortonworks
04.12.2013

The end…or the beginning?

Contenu connexe

Tendances

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and FutureApache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and FutureDataWorks Summit
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnhdhappy001
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureVinod Kumar Vavilapalli
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureVinod Kumar Vavilapalli
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop DataWorks Summit/Hadoop Summit
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceHortonworks
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeFlurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeTrieu Nguyen
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopHortonworks
 

Tendances (20)

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and FutureApache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and Future
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeFlurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo Hadoop
 

En vedette

Big data landscape version 2.0
Big data landscape version 2.0Big data landscape version 2.0
Big data landscape version 2.0Matt Turck
 
Phoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBasePhoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBaseSalesforce Developers
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHortonworks
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoopColin Su
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataRichard Vidgen
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataKaran Desai
 
Stock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationStock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationMaruthi Nataraj K
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 

En vedette (20)

storm at twitter
storm at twitterstorm at twitter
storm at twitter
 
Big data landscape version 2.0
Big data landscape version 2.0Big data landscape version 2.0
Big data landscape version 2.0
 
Phoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBasePhoenix - A High Performance Open Source SQL Layer over HBase
Phoenix - A High Performance Open Source SQL Layer over HBase
 
Something about Kafka - Why Kafka is so fast
Something about Kafka - Why Kafka is so fastSomething about Kafka - Why Kafka is so fast
Something about Kafka - Why Kafka is so fast
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and Hadoop
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoop
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Stock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce ImplementationStock Analyzer Hadoop MapReduce Implementation
Stock Analyzer Hadoop MapReduce Implementation
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 

Similaire à Hadoop 2 - Going beyond MapReduce

Similaire à Hadoop 2 - Going beyond MapReduce (20)

Welcome to Hadoop2Land!
Welcome to Hadoop2Land!Welcome to Hadoop2Land!
Welcome to Hadoop2Land!
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
HADOOP
HADOOPHADOOP
HADOOP
 
Anju
AnjuAnju
Anju
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop
HadoopHadoop
Hadoop
 
Big data applications
Big data applicationsBig data applications
Big data applications
 

Plus de Uwe Printz

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureUwe Printz
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Lightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesLightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesUwe Printz
 
MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)Uwe Printz
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)Uwe Printz
 
MongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererMongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererUwe Printz
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBUwe Printz
 
First meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtFirst meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtUwe Printz
 

Plus de Uwe Printz (17)

Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Lightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesLightning Talk: Agility & Databases
Lightning Talk: Agility & Databases
 
MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)
 
MongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererMongoDB für Java-Programmierer
MongoDB für Java-Programmierer
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDB
 
First meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtFirst meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group Frankfurt
 

Dernier

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Dernier (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Hadoop 2 - Going beyond MapReduce

  • 2. 04.12.2013 About me Big Data Nerd Hadoop Trainer MongoDB Author Photography Enthusiast Travelpirate
  • 3. 04.12.2013 About us is a bunch of… Big Data Nerds Agile Ninjas Continuous Delivery Gurus Join us! Enterprise Java Specialists Performance Geeks
  • 4. 04.12.2013 Agenda • Why Hadoop 2? • HDFS 2 • YARN • YARN Apps • Write your own YARN App • Stinger Initiative & Hive
  • 5. 04.12.2013 Agenda • Why Hadoop 2? • HDFS 2 • YARN • YARN Apps • Write your own YARN App • Stinger Initiative & Hive
  • 6. Hadoop 1 04.12.2013 Built for web-scale batch apps Single App Single App Batch Batch Single App Single App Single App Batch Batch Batch HDFS HDFS HDFS
  • 7. 04.12.2013 MapReduce is good for… • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset
  • 8. 04.12.2013 MapReduce is OK for… • Iterative jobs (i.e., graph algorithms) – Each iteration must read/write data to disk – I/O and latency cost of an iteration is high
  • 9. 04.12.2013 MapReduce is not good for… • Jobs that need shared state/coordination – Tasks are shared-nothing – Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records
  • 10. 04.12.2013 • MapReduce limitations Scalability – Maximum cluster size ~ 4,500 nodes – Maximum concurrent tasks – 40,000 – Coarse synchronization in JobTracker • Availability – Failure kills all queued and running jobs • Hard partition of resources into map & reduce slots – Low resource utilization • Lacks support for alternate paradigms and services – Iterative applications implemented using MapReduce are 10x slower
  • 11. 04.12.2013 A brief history of Hadoop 2 • Originally conceived & architected by the team at Yahoo! – Arun Murthy created the original JIRA in 2008 and now is the Hadoop 2 release manager The community has been working on Hadoop 2 for over 4 years • • Hadoop 2 based architecture running at scale at Yahoo! – Deployed on 35,000+ nodes for 6+ months
  • 12. 04.12.2013 Hadoop 2: Next-gen platform Single use system Multi-purpose platform Batch Apps Batch, Interactive, Streaming, … Hadoop 1 MapReduce Cluster resource mgmt. + data processing HDFS Redundant, reliable storage Hadoop 2 MapReduce Others Data processing Data processing YARN Cluster resource management HDFS 2 Redundant, reliable storage
  • 13. Taking Hadoop beyond batch 04.12.2013 Store all data in one place Interact with data in multiple ways Applications run natively in Hadoop Batch Interactive Online MapReduce Tez HOYA Streaming Graph In-Memory Storm, … Giraph YARN Cluster resource management HDFS 2 Redundant, reliable storage Spark Other Search, …
  • 14. 04.12.2013 Agenda • Why Hadoop 2? • HDFS 2 • YARN • YARN Apps • Write your own YARN App • Stinger Initiative & Hive
  • 15. 04.12.2013 HDFS 2: In a nutshell • Removes tight coupling of Block Storage and Namespace • High Availability • Scalability & Isolation • Increased performance Details: https://issues.apache.org/jira/browse/HDFS-1052
  • 16. HDFS 2: Federation 04.12.2013 NameNodes do not talk to each other NameNode 1 NameNode 2 NameNodes manages Namespace 2 only slice of namespace Namespace 1 logs finance insights Pool 1 Pool 2 Block DataNode 1 reports DataNode 2 Block Storage as generic storage service Pools DataNode 3 Common storage DataNode 4 DataNodes can store blocks managed by any NameNode
  • 17. 04.12.2013 HDFS 2: Architecture Only the active writes edits NFS Active NameNode Block Map DataNode Shared state on shared directory on NFS Edits File DataNode Standby NameNode Block Map DataNode Edits File DataNode The Standby simultaneously reads and applies the edits DataNode DataNodes report to both NameNodes but listen only to the orders from the active one
  • 18. 04.12.2013 Only the active writes edits HDFS 2: Quorum based storage Journal Node Journal Node Active NameNode Block Map DataNode Edits File DataNode Journal Node Standby NameNode Block Map DataNode Edits File DataNode The state is shared on a quorum of journal nodes The Standby simultaneously reads and applies the edits DataNode DataNodes report to both NameNodes but listen only to the orders from the active one
  • 19. 04.12.2013 When healthy, holds persistent session Holds special lock znode ZKFailover Controller Process that monitors health of NameNode DataNode HDFS 2: High Availability ZooKeeper Node ZooKeeper Node Journal Node ZooKeeper Node Journal Node Journal Node SharedNameNode state Active NameNode Block Map Edits File DataNode Standby NameNode Block Map DataNode Send block reports to Active & Standby NameNode Edits File DataNode When healthy, holds persistent session ZKFailover Controller Process that monitors health of NameNode DataNode
  • 20. 04.12.2013 HDFS 2: Snapshots • Admin can create point in time snapshots of HDFS • Of the entire file system (/root) • Of a specific data-set (sub-tree directory of file system) • Restore state of entire file system or data-set to a snapshot (like Apple Time Machine) • Protect against user errors • Snapshot diffs identify changes made to data set • Keep track of how raw or derived/analytical data changes over time
  • 21. 04.12.2013 HDFS 2: NFS Gateway • NFS v3 Standard • Supports all HDFS commands • List files • Copy, move files • Create and delete directories • Ingest for large scale analytical workloads • Load immutable files as source for analytical processing • No random writes • Stream files into HDFS • Log ingest by applications writing directly to HDFS client mount
  • 22. 04.12.2013 HDFS 2: Performance • Many improvements • SSE4.2 CRC32C: ~3x less CPU on read path • Read path improvements for fewer memory copies • Short-circuit local reads for 2-3x faster random reads • I/O improvements using posix_fadvise() • libhdfs improvements for zero copy reads • Significant improvements: I/O 2.5 - 5x faster
  • 23. 04.12.2013 Agenda • Why Hadoop 2? • HDFS 2 • YARN • YARN Apps • Write your own YARN App • Stinger Initiative & Hive
  • 24. 04.12.2013 YARN: Design Goals • Build a new abstraction layer by splitting up the two major functions of the JobTracker • Cluster resource management • Application life-cycle management • Allow other processing paradigms • Flexible API for implementing YARN apps • MapReduce becomes YARN app • Lots of different YARN apps
  • 25. 04.12.2013 YARN: Concepts • Application – Application is a job committed to the YARN framework – Example: MapReduce job, Storm topology, … • Container – Basic unit of allocation – Fine-grained resource allocation across multiple resource types (RAM, CPU, Disk, Network, GPU, etc.) • • container_0 = 4 GB, 1 CPU container_1 = 512 MB, 6 CPU’s – Replaces the fixed map/reduce slots
  • 26. 04.12.2013 YARN: Architecture • Resource Manager – Global resource scheduler – Hierarchical queues • Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring • Application Master – Per-application – Manages application scheduling and task execution – e.g. MapReduce Application Master
  • 27. 04.12.2013 YARN: Architectural Overview Split up the two major functions of the JobTracker Cluster resource management & Application life-cycle management ResourceManager Scheduler NodeManager NodeManager AM 1 NodeManager Container 1.1 NodeManager Container 2.1 Container 2.3 NodeManager NodeManager NodeManager NodeManager Container 1.2 AM 2 Container 2.2
  • 28. 04.12.2013 YARN: BigDataOS™ ResourceManager Scheduler NodeManager NodeManager NodeManager NodeManager MapReduce 1 map 1.1 reduce 2.2 map 2.1 Region server 2 reduce 2.1 nimbus 1 vertex 3 NodeManager NodeManager NodeManager NodeManager HBase Master map 1.2 MapReduce 2 map 2.2 nimbus 2 Region server 1 vertex 4 vertex 2 NodeManager NodeManager NodeManager NodeManager HOYA reduce 1.1 Tez map 2.3 vertex 1 Region server 3 Storm
  • 29. 04.12.2013 YARN: Multi-tenancy ResourceManager Scheduler NodeManager NodeManager MapReduce 1 map 1.1 Region server 2 reduce 2.1 NodeManager HBase Master nimbus 2 NodeManager NodeManager reduce 2.2 root nimbus 1 map 2.1 vertex 3 30% 10% 60% NodeManager NodeManager NodeManager AdDWH User Hoc map 1.2 MapReduce 2 map 2.2 75% 25% 20% 80% Region server 1 NodeManager NodeManager HOYA reduce 1.1 vertex 1 vertex 2 vertex 4 Dev Dev Prod NodeManager 60% 40% Tez Dev1 Region server 3 Prod NodeManager Dev2 map 2.3 Storm
  • 30. 04.12.2013 YARN: Capacity Scheduler • Queues • Economics as queue capacities • • • Resource Isolation • Administration • • • Capacity Scheduler Hierarchical queues SLA‘s via preemption Queue ACL‘s Runtime reconfig Charge-back root nimbus 1 30% 10% 60% NodeManager AdDWH User Hoc MapReduce 2 75% 25% 20% 80% vertex 4 Dev Prod 60% Tez Dev1 Dev Prod 40% Dev2
  • 31. 04.12.2013 Agenda • Why Hadoop 2? • HDFS 2 • YARN • YARN Apps • Write your own YARN App • Stinger Initiative & Hive
  • 32. 04.12.2013 • • • • • • • • • • • YARN Apps: Overview MapReduce 2 Apache Tez HOYA Storm Samza Apache S4 Spark Apache Giraph Apache Hama Elastic Search Cloudera Llama Batch Batch/Interactive HBase on YARN Stream Processing Stream Processing Stream Processing Iterative/Interactive Graph Processing Bulk Synchronous Parallel Scalable Search Impala on YARN
  • 33. 04.12.2013 • • • • • • • • • • • YARN Apps: Overview MapReduce 2 Apache Tez HOYA Storm Samza Apache S4 Spark Apache Giraph Apache Hama Elastic Search Cloudera Llama Batch Batch/Interactive HBase on YARN Stream Processing Stream Processing Stream Processing Iterative/Interactive Graph Processing Bulk Synchronous Parallel Scalable Search Impala on YARN
  • 34. 04.12.2013 MapReduce 2: In a nutshell • MapReduce is now a YARN app • • No more map and reduce slots, it’s containers now No more JobTracker, it’s YarnAppmaster library now • Multiple versions of MapReduce • • The older mapred APIs work without modification or recompilation The newer mapreduce APIs may need to be recompiled • Still has one master server component: the Job History Server • • • The Job History Server stores the execution of jobs Used to audit prior execution of jobs Will also be used by YARN framework to store charge backs at that level • Better cluster utilization • Increased scalability & availability
  • 35. 04.12.2013 MapReduce 2: Shuffle • Faster Shuffle • Better embedded server: Netty • Encrypted Shuffle • • • • Secure the shuffle phase as data moves across the cluster Requires 2 way HTTPS, certificates on both sides Causes significant CPU overhead, reserve 1 core for this work Certificates stored on each node (provision with the cluster), refreshed every 10 secs • Pluggable Shuffle Sort • • • • Shuffle is the first phase in MapReduce that is guaranteed to not be datalocal Pluggable Shuffle/Sort allows application developers or hardware developers to intercept the network-heavy workload and optimize it Typical implementations have hardware components like fast networks and software components like sorting algorithms API will change with future versions of Hadoop
  • 36. 04.12.2013 MapReduce 2: Performance • Key Optimizations • No hard segmentation of resource into map and reduce slots • YARN scheduler is more efficient • MR2 framework has become more efficient than MR1: shuffle phase in MRv2 is more performant with the usage of Netty. • 40.000+ nodes running YARN across over 365 PB of data. • About 400.000 jobs per day for about 10 million hours of compute time. • Estimated 60% – 150% improvement on node usage per day • Got rid of a whole 10,000 node datacenter because of their increased utilization.
  • 37. 04.12.2013 • • • • • • • • • • • YARN Apps: Overview MapReduce 2 Apache Tez HOYA Storm Samza Apache S4 Spark Apache Giraph Apache Hama Elastic Search Cloudera Llama Batch Batch/Interactive HBase on YARN Stream Processing Stream Processing Stream Processing Iterative/Interactive Graph Processing Bulk Synchronous Parallel Scalable Search Impala on YARN
  • 38. 04.12.2013 Apache Tez: In a nutshell • • Low level data-processing execution engine Tez is Hindi for “speed” • • • Enables pipelining of jobs Removes task and job launch times Does not write intermediate output to HDFS – Much lighter disk and network usage The new primitive for MapReduce, Hive, Pig, Cascading, etc.
  • 39. Apache Tez: The new primitive 04.12.2013 MapReduce as Base Apache Tez as Base Hadoop 2 Hadoop 1 MR Pig Hive Pig Hive Other MapReduce Cluster resource mgmt. + data processing Tez Real time Storm Execution Engine O t h e r MapReduce Cluster resource mgmt. + data processing HDFS Redundant, reliable storage HDFS Redundant, reliable storage
  • 40. 04.12.2013 Apache Tez: Architecture • Task with pluggable Input, Processor & Output Task HDFS Input Map Processor Sorted Output „Classical“ Map Task Shuffle Input Reduce Processor „Classical“ Reduce HDFS Output YARN ApplicationMaster to run DAG of Tez Tasks
  • 41. 04.12.2013 Apache Tez: Performance SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id= b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x ORDER BY AVG; Synchronization barrier Tez avoids unnecessary writes to HDFS
  • 42. 04.12.2013 • • • • • • • • • • • YARN Apps: Overview MapReduce 2 Apache Tez HOYA Storm Samza Apache S4 Spark Apache Giraph Apache Hama Elastic Search Cloudera Llama Batch Batch/Interactive HBase on YARN Stream Processing Stream Processing Stream Processing Iterative/Interactive Graph Processing Bulk Synchronous Parallel Scalable Search Impala on YARN
  • 43. 04.12.2013 HOYA: In a nutshell • Create on-demand HBase clusters • • • • • Small HBase cluster in large YARN cluster Dynamic HBase clusters Self-healing HBase Cluster Elastic HBase clusters Transient/intermittent clusters for workflows • Configure custom configurations & versions • Better isolation • More efficient utilization/sharing of cluster
  • 44. 04.12.2013 HOYA Client YARNClient HOYA specific API HOYA: Creation of AppMaster ResourceManager Scheduler NodeManager NodeManager Container HOYA Application Master Container NodeManager Container Container Container Container
  • 45. 04.12.2013 HOYA Client YARNClient HOYA specific API HOYA: Deployment of HBase ResourceManager Scheduler NodeManager NodeManager Region Server Container HOYA Application Master Container NodeManager Container HBase Master Container Container Region Server Container
  • 46. 04.12.2013 HOYA Client YARNClient HOYA specific API HOYA: Bind via ZooKeeper ResourceManager Scheduler NodeManager NodeManager Region Server Container HBase Client HOYA Application Master Container NodeManager Container HBase Master Container Container Region Server Container ZooKeeper
  • 47. 04.12.2013 • • • • • • • • • • • YARN Apps: Overview MapReduce 2 Apache Tez HOYA Storm Samza Apache S4 Spark Apache Giraph Apache Hama Elastic Search Cloudera Llama Batch Batch/Interactive HBase on YARN Stream Processing Stream Processing Stream Processing Iterative/Interactive Graph Processing Bulk Synchronous Parallel Scalable Search Impala on YARN
  • 48. 04.12.2013 Twitter Storm • Stream-processing • Real-time processing • Developed as standalone application • https://github.com/nathanmarz/storm • Ported on YARN • https://github.com/yahoo/storm-yarn
  • 49. 04.12.2013 Storm: Conceptual view Bolt Spout Source of streams • Consumer of streams • Processing of tuples • Possibly emits new tuples Bolt Stream Spout Unbound sequence of tuples Tuple List of name-value pairs Bolt Tuple Bolt Tuple Spout Tuple Bolt Bolt Topology Network of spouts & bolts as the nodes and streams as the edges
  • 50. 04.12.2013 • • • • • • • • • • • YARN Apps: Overview MapReduce 2 Apache Tez HOYA Storm Samza Apache S4 Spark Apache Giraph Apache Hama Elastic Search Cloudera Llama Batch Batch/Interactive HBase on YARN Stream Processing Stream Processing Stream Processing Iterative/Interactive Graph Processing Bulk Synchronous Parallel Scalable Search Impala on YARN
  • 51. 04.12.2013 Spark: In a nutshell • High-speed in-memory analytics over Hadoop and Hive • Separate MapReduce-like engine – – Speedup of up to 100x On-disk queries 5-10x faster • Compatible with Hadoop‘s Storage API • Available as standalone application – https://github.com/mesos/spark • Experimental support for YARN since 0.6 – http://spark.incubator.apache.org/docs/0.6.0/running-on-yarn.html
  • 53. 04.12.2013 • • • • • • • • • • • YARN Apps: Overview MapReduce 2 Apache Tez HOYA Storm Samza Apache S4 Spark Apache Giraph Apache Hama Elastic Search Cloudera Llama Batch Batch/Interactive HBase on YARN Stream Processing Stream Processing Stream Processing Iterative/Interactive Graph Processing Bulk Synchronous Parallel Scalable Search Impala on YARN
  • 54. 04.12.2013 Giraph: In a nutshell • Giraph is a framework for processing semistructured graph data on a massive scale. • Giraph is loosely based upon Google's Pregel • Giraph performs iterative calculations on top of an existing Hadoop cluster. • Available on GitHub – https://github.com/apache/giraph
  • 55. 04.12.2013 Agenda • Why Hadoop 2? • HDFS 2 • YARN • YARN Apps • Write your own YARN App • Stinger Initiative & Hive
  • 56. 04.12.2013 YARN: API & client libaries • Application Client Protocol: Client to RM interaction • • • Library: YarnClient Application lifecycle control Access cluster information • Application Master Protocol: AM – RM interaction • • • Library: AMRMClient / AMRMClientAsync Resource negotiation Heartbeat to the RM • Container Management Protocol: AM to NM interaction • • • Library: NMClient/NMClientAsync Launching allocated containers Stop running containers Use external frameworks like Weave, REEF or Spring
  • 57. 04.12.2013 App client YARNClient App specific API YARN: Application flow Application Client Protocol ResourceManager Scheduler Application Master Protocol NodeManager NodeManager myAM Container AMRMClient NodeManager Container NMClient Container Management Protocol
  • 58. 04.12.2013 YARN: More things to consider • Container / Tasks • Client • UI • Recovery • Container-to-AM communication • Application history
  • 59. YARN: Testing / Debugging 04.12.2013 • MiniYARNCluster • • • • Unmanaged AM • • • Cluster within process in threads (HDFS, MR, HBase) Useful for regression tests Avoid for unit tests: too slow Support to run the AM outside of a YARN cluster for manual testing Debugger, heap dumps, etc. Logs • • • Log aggregation support to push all logs into HDFS Accessible via CLI, UI Useful to enable all application logs by default
  • 60. 04.12.2013 Agenda • Why Hadoop 2? • HDFS 2 • YARN • YARN Apps • Write your own YARN App • Stinger Initiative & Hive
  • 61. 04.12.2013 Real-Time • Online systems • R-T analytics • CEP 0-5s Hive: Current Focus Area Interactive • Parameterized Reports • Drilldown • Visualization • Exploration NonInteractive Batch • Data preparation • Operational • Incremental batch batch processing processing • Enterprise • Dashboards / Reports Scorecards • Data Mining Current Hive Sweet Spot 1m – 1h 5s – 1m Data Size 1h+
  • 62. 04.12.2013 Real-Time • Online systems • R-T analytics • CEP Stinger: Extending the sweet spot NonInteractive Interactive • Parameterized Reports • Drilldown • Visualization • Exploration • Data preparation • Incremental batch processing • Dashboards / Scorecards Batch • Operational batch processing • Enterprise Reports • Data Mining Future Hive Expansion 0-5s 1m – 1h 5s – 1m 1h+ Data Size Improve Latency & Throughput • Query engine improvements • New “Optimized RCFile” column store • Next-gen runtime (elim’s M/R latency) Extend Deep Analytical Ability • Analytics functions • Improved SQL coverage • Continued focus on core Hive use cases
  • 65. 04.12.2013 Stinger: Tez revisited SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Job 1 Job 2 I/O Synchronization Barrier I/O Synchronization Barrier Job 3 Hive - MapReduce Single Job Hive - Tez
  • 66. 04.12.2013 Stinger: Tez Service • MapReduce Query Startup is expensive: – Job launch & task-launch latencies are fatal for short queries (in order of 5s to 30s) • Solution: – Tez Service (= Preallocated Application Master) • Removes job-launch overhead (Application Master) • Removes task-launch overhead (Pre-warmed Containers) – Hive (or Pig) • Submit query-plan to Tez Service – Native Hadoop service, not ad-hoc
  • 67. 04.12.2013 Stinger: Tez Performance SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Existing Hive Tez & Hive Service Hive/Tez Parse Query 0.5s Parse Query 0.5s Parse Query 0.5s Create Plan 0.5s Create Plan 0.5s Create Plan 0.5s Launch MapReduce 20s Launch MapReduce 20s Submit to Tez Service 0.5s Process MapReduce 10s Process MapReduce 2s Total 31s Total 23s Process Map-Reduce Total 2s 3.5s * No exact numbers, for illustration only
  • 68. 04.12.2013 Hive: Persistence Options • Built-in Formats • • • • • • • ORCFile RCFile Avro Delimited Text Regular Expression S3 Logfile Typed Bytes • 3rd Party Addons • JSON • XML
  • 69. 04.12.2013 Stinger: ORCFile in a nutshell • Optimized Row Columnar File • Columns stored separately • Knows types • Uses type-specific encoders • Stores statistics (min, max, sum, count) • Has light-weight index • Skip over blocks of rows that don’t matter • Larger blocks • 256 MB by default • Has an index for block boundaries
  • 70. 04.12.2013 Large block size well suited for HDFS Stinger: ORCFile Layout Columnar format arranges columns adjacent within the file for compression and fast access.
  • 71. 04.12.2013 Stinger: ORCFile advantages • High Compression • Many tricks used out-of-the-box to ensure high compression rates. • RLE, dictionary encoding, etc. • High Performance • Inline indexes record value ranges within blocks of ORCFile data. • Filter pushdown allows efficient scanning during precise queries. • Flexible Data Model • All Hive types including maps, structs and unions.
  • 72. 04.12.2013 Stinger: ORCFile compression Data set from TPC-DS Measurements by Owen O‘Malley, Hortonworks
  • 73. 04.12.2013 Stinger: More optimizations • Join Optimizations • New Join Types added or improved in Hive 0.11 • More Efficient Query Plan Generation • Vectorization • Make the most use of L1 and L2 caches • Avoid branching whenever possible • Optimized Query Planner • Automatically determine optimal execution parameters. • Buffering • Cache hot data in memory. • …
  • 74. 04.12.2013 Stinger: Overall Performance * Real numbers, but handle with care!
  • 75. 04.12.2013 Hadoop 2: Summary 1. Scale 2. New programming models & services 3. Improved cluster utilization 4. Agility 5. Beyond Java
  • 76. 04.12.2013 One more thing… Let’s get started with Hadoop 2!
  • 78. 04.12.2013 Dude, where’s my Hadoop 2? * Table by Adam Muise, Hortonworks