Hackathon bonn

Hortonworks: We Do Hadoop.
Our mission is to enable your Modern Data Architecture
by Delivering Enterprise Apache Hadoop
YARN, Tez, Stinger
June 2014

Our Mission:
Our Commitment
Open Leadership
Drive innovation in the open exclusively via the
Apache community-driven open source process
Enterprise Rigor
Engineer, test and certify Apache Hadoop with
the enterprise in mind
Ecosystem Endorsement
Focus on deep integration with existing data
center technologies and skills
Page 2
Headquarters: Palo Alto, CA
Employees: 300+ and growing
Trusted Partners
Enable your Modern Data Architecture by
Delivering Enterprise Apache Hadoop

Driving Our Innovation Through Apache
147,933 lines
614,041 lines
End Users
449,768 lines
Total Net Lines Contributed
to Apache Hadoop
Yahoo: 10
Cloudera: 7
IBM: 3
10 Others
21
Facebook: 5
LinkedIn: 3
Total Number of Committers
to Apache Hadoop
63
total
Hortonworks mission is
to power your modern data architecture by enabling
Hadoop to be an enterprise data platform that
deeply integrates with your data center technologies
Page 3
Apache
Project
Committers
PMC
Members
Hadoop 21 13
Tez 10 4
Hive 11 3
HBase 8 3
Pig 6 5
Sqoop 1 0
Ambari 20 12
Knox 6 2
Falcon 2 2
Oozie 2 2
Zookeepe
r
2 1
Flume 1 0
Accumulo 2 2
Storm 1 0
Drill 1 0
TOTAL 95 48

Broad Ecosystem Integration
Page 4
APPLICATIONSDATASYSTEMSOURCES
RDBMS EDW MPP
Emerging Sources
(Sensor, Sentiment, Geo, Unstructured)
HANA
BusinessObjects BI
OPERATIONAL TOOLS
DEV & DATA TOOLS
Existing Sources
(CRM, ERP, Clickstream, Logs)
INFRASTRUCTURE

UDA
Diagram
Relying on Hortonworks…
Teradata Portfolio
for Hadoop
• Seamless data access
between Teradata and
Hadoop (SQL-H)
• Simple management &
monitoring with Viewpoint
integration
• Flexible deployment
options
Page 5
HDInsight &
HDP for Windows
• Only Hadoop Distribution
for Windows Azure &
Windows Server
• Native integration with
SQL Server, Excel, and
System Center
• Extends Hadoop to .NET
community
Complete Portfolio for Hadoop
Appliances
Instant Access +
Infinite Scale
• SAP can assure their
customers they are
deploying an SAP HANA
+ Hadoop architecture
fully supported by SAP
• Enables analytics apps
(BOBJ) to interact with
Hadoop

HDP 2.1: Enterprise Hadoop Platform
Page 6
Hortonworks
Data Platform (HDP)
• The ONLY 100% open source
and most current platform
• Integrates full range of
enterprise-ready services
• Certified and tested at scale
• Engineered for deep
ecosystem interoperability
OS/VM Cloud Appliance
CORE
SERVICES
CORE
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS
DATA PLATFORM (HDP)
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD &
EXTRACT
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP
TEZREDUCE
HIVE &
HCATALOG
PIGHBASE
OPERATIONAL
SERVICES
DATA
SERVICES
CORE SERVICES
HORTONWORKS
DATA PLATFORM (HDP)
Schedule
Storage
Resource Management
Process
Data
Movement
Cluster
Mgmnt Dataset
Mgmnt
Data Access
CORE SERVICES
HORTONWORKS
DATA PLATFORM (HDP)
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUMEAMBARI
FALCON
YARN
MAP
TEZREDUCE
HIVEPIG
HBASE
OOZIE
LOAD &
EXTRACT
WebHDFS
NFS
KNOX*

Our Vision: Hadoop as Next-Gen Platform
HADOOP 1.0
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management
& data processing)
HDFS2
(redundant, highly-available & reliable storage)
YARN
(cluster resource management)
MapReduce
(data processing)
Others
HADOOP 2.0
Single Use System
Batch Apps
Multi Purpose Platform
Batch, Interactive, Online, Streaming, …
Page 7

The 1st Generation of Hadoop: Batch
HADOOP 1.0
Built for Web-Scale Batch Apps
Single App
BATCH
HDFS
Single App
INTERACTIVE
Single App
BATCH
HDFS
• All other usage
patterns must
leverage that same
infrastructure
• Forces the creation
of silos for managing
mixed workloads
Single App
BATCH
HDFS
Single App
ONLINE

Hadoop MapReduce Classic
• JobTracker
–Manages cluster resources and job scheduling
• TaskTracker
–Per-node agent
–Manage tasks
Page 9

YARN: Taking Hadoop Beyond Batch
Page 10
Applications Run Natively in Hadoop
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm, S4,…)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
ONLINE
(HBase)
OTHER
(Search)
(Weave…)
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service

5 Key Benefits of YARN
1. Scale
2. New Programming Models &
Services
3. Improved cluster utilization
4. Agility
5. Beyond Java
Page 11

Concepts
• Application
–Application is a temporal job or a service submitted YARN
–Examples
– Map Reduce Job (job)
– Hbase Cluster (service)
• Container
–Basic unit of allocation
–Fine-grained resource allocation across multiple resource
types (memory, cpu, disk, network, gpu etc.)
– container_0 = 2GB, 1CPU
– container_1 = 1GB, 6 CPU
–Replaces the fixed map/reduce slots
12

Design Centre
• Split up the two major functions of JobTracker
–Cluster resource management
–Application life-cycle management
• MapReduce becomes user-land library
13

YARN Applications
• Data processing applications and services
–Online Serving – HOYA (HBase on YARN)
–Real-time event processing – Storm, S4, other commercial
platforms
–Interactive SQL – Tez (Generalization of MR)
–Machine Learning – MPI (OpenMPI, MPICH2)
–In-Memory: Spark
–Graph processing: Giraph
–Enabled by allowing the use of paradigm-specific application
master
Run all on the same Hadoop cluster!
Page 14

© Hortonworks Inc. 2012
NodeManager NodeManager NodeManager NodeManager
map 1.1
vertex1.2.2
map1.2
reduce1.1
Batch
vertex1.1.1
vertex1.1.2
vertex1.2.1
Interactive SQL
YARN as OS for Data Lake
ResourceManager
Scheduler
Real-Time
nimbus0
nimbus1
nimbus2

© Hortonworks Inc. 2012
Multi-Tenant YARN
ResourceManager
Scheduler
root
Adhoc
10%
DW
60%
Mrkting
30%
Dev
10%
Reserved
20%
Prod
70%
Prod
80%
Dev
20%
P0
70%
P1
30%

Multi-Tenancy with CapacityScheduler
• Queues
• Economics as queue-capacity
–Hierarchical Queues
• SLAs
–Preemption
• Resource Isolation
–Linux: cgroups
–MS Windows: Job Control
–Roadmap: Virtualization (Xen, KVM)
• Administration
–Queue ACLs
–Run-time re-configuration for queues
–Charge-back
Page 17
ResourceManager
Scheduler
root
Adhoc
10%
DW
70%
Mrkting
20%
Dev
10%
Reserved
20%
Prod
70%
Prod
80%
Dev
20%
P0
70%
P1
30%
Capacity Scheduler
Hierarchical
Queues

Tez (“Speed”)
• What is it?
–A data processing framework as an alternative to MapReduce
–A new incubation project in the ASF
• Who else is involved?
–22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo,
Microsoft
• Why does it matter?
–Widens the platform for Hadoop use cases
–Crucial to improving the performance of low-latency applications
–Core to the Stinger initiative
–Evidence of Hortonworks leading the community in the evolution
of Enterprise Hadoop

Moving Hadoop Beyond MapReduce
• Low level data-processing execution engine
• Built on YARN
• Enables pipelining of jobs
• Removes task and job launch times
• Does not write intermediate output to HDFS
–Much lighter disk and network usage
• New base of MapReduce, Hive, Pig, Cascading etc.
• Hive and Pig jobs no longer need to move to the end
of the queue between steps in the pipeline

Tez - Core Idea
Task with pluggable Input, Processor & Output
YARN ApplicationMaster to run DAG of Tez Tasks
Input Processor
Task
Output
Tez Task - <Input, Processor, Output>

Building Blocks for Tasks
MapReduce ‘Map’ MapReduce ‘Reduce’
HDFS
Input
Map
Processor
MapReduce ‘Map’ Task
Sorted
Output
Intermediate ‘Reduce’ for
Map-Reduce-Reduce
Shuffle
Input
Reduce
Processor
Intermediate ‘Reduce’ for
Map-Reduce-Reduce
Sorted
Output
Shuffle
Input
Reduce
Processor
HDFS
Output
MapReduce ‘Reduce’ Task
Special Pig/Hive ‘Map’
HDFS
Input
Map
Processor
Tez Task
Pipelin
e
Sorter
Output
Special Pig/Hive ‘Reduce’
Shuffle
Skip-
merge
Input
Reduce
Processor
Tez Task
Sorted
Output
In-memory Map
HDFSI
nput
Map
Processor
Tez Task
In-
memor
y
Sorted
Output

Pig/Hive-MR versus Pig/Hive-Tez
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Pig/Hive - MR Pig/Hive - Tez
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Job 1
Job 2
Job 3
Single Job

Tez on YARN: Going Beyond Batch
Tez Optimizes Execution
New runtime engine for
more efficient data processing
Always-On Tez Service
Low latency processing for
all Hadoop data processing
Tez Task

SQL-in-Hadoop with Apache Hive
• Apache Hive is the standard for
SQL interaction with Hadoop
–Enterprise makes final purchasing
decision on two key characteristics:
'compatibility' with existing
investments (60%) and skills (20%)
–Most application claim Hive
compatibility TODAY*
• Stinger Initiative: Simple Focus
–Performance
–SQL-Compatibility
–Scalability
Claims publicly made by: Teradata, Microsoft, Oracle, Microstrategy, IBM, Information Builders,
SAS, QlikTech, SAP, Tableau, Tibco, Actuate, Jaspersoft, Alteryx, Datameer, Pentaho
Page 24
Hadoop
HDFS
Hive
TezMapReduce
SQL
YARN
Business
Analytics
Custom
Apps
Improves existing
tools & preserves
investments

Stinger Project
(announced February 2013)
Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
Hive 0.13, April 2014:
• Hive on Apache Tez
• Query Service
• Buffer Cache
• Cost Based Optimizer (Optiq)
• Vectorized Processing
Hive 0.11, May 2013:
• Base Optimizations
• SQL Analytic Functions
• ORCFile, Modern File Format
Hive 0.12, October 2013:
• VARCHAR, DATE Types
• ORCFile predicate pushdown
• Advanced Optimizations
• Performance Boosts via YARN
Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB
SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop
…all IN Hadoop
Goals:

Hortonworks: The Value of “Open” for You
Page 26
Validate & Try
1. Download the
Hortonworks Sandbox
2. Learn Hadoop using the
technical tutorials
3. Investigate a business
case using the step-by-
step business cases
scenarios
4. Validate YOUR business
case using your data in
the sandbox
Connect With the Hadoop Community
We employ a large number of Apache project committers & innovators so
that you are represented in the open source community
Avoid Vendor Lock-In
Hortonworks Data Platform remain as close to the open source trunk as
possible and is developed 100% in the open so you are never locked in
The Partners you Rely On, Rely On Hortonworks
We work with partners to deeply integrate Hadoop with data center
technologies so you can leverage existing skills and investments
Certified for the Enterprise
We engineer, test and certify the Hortonworks Data Platform at scale to
ensure reliability and stability you require for enterprise use
Support from the Experts
We provide the highest quality of support for deploying at scale. You are
supported by hundreds of years of Hadoop experience
Engage
1. Execute a Business Case
Discovery Workshop with
our architects
2. Build a business case for
Hadoop today

Hackathon bonn

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Hackathon bonn

Similar to Hackathon bonn (20)

Recently uploaded

Recently uploaded (20)

Hackathon bonn

Editor's Notes