The Codex of Business Writing Software for Real-World Solutions 2.pptx
Hackathon bonn
1. Hortonworks: We Do Hadoop.
Our mission is to enable your Modern Data Architecture
by Delivering Enterprise Apache Hadoop
YARN, Tez, Stinger
June 2014
2. Our Mission:
Our Commitment
Open Leadership
Drive innovation in the open exclusively via the
Apache community-driven open source process
Enterprise Rigor
Engineer, test and certify Apache Hadoop with
the enterprise in mind
Ecosystem Endorsement
Focus on deep integration with existing data
center technologies and skills
Page 2
Headquarters: Palo Alto, CA
Employees: 300+ and growing
Trusted Partners
Enable your Modern Data Architecture by
Delivering Enterprise Apache Hadoop
3. Driving Our Innovation Through Apache
147,933 lines
614,041 lines
End Users
449,768 lines
Total Net Lines Contributed
to Apache Hadoop
Yahoo: 10
Cloudera: 7
IBM: 3
10 Others
21
Facebook: 5
LinkedIn: 3
Total Number of Committers
to Apache Hadoop
63
total
Hortonworks mission is
to power your modern data architecture by enabling
Hadoop to be an enterprise data platform that
deeply integrates with your data center technologies
Page 3
Apache
Project
Committers
PMC
Members
Hadoop 21 13
Tez 10 4
Hive 11 3
HBase 8 3
Pig 6 5
Sqoop 1 0
Ambari 20 12
Knox 6 2
Falcon 2 2
Oozie 2 2
Zookeepe
r
2 1
Flume 1 0
Accumulo 2 2
Storm 1 0
Drill 1 0
TOTAL 95 48
4. Broad Ecosystem Integration
Page 4
APPLICATIONSDATASYSTEMSOURCES
RDBMS EDW MPP
Emerging Sources
(Sensor, Sentiment, Geo, Unstructured)
HANA
BusinessObjects BI
OPERATIONAL TOOLS
DEV & DATA TOOLS
Existing Sources
(CRM, ERP, Clickstream, Logs)
INFRASTRUCTURE
5. UDA
Diagram
Relying on Hortonworks…
Teradata Portfolio
for Hadoop
• Seamless data access
between Teradata and
Hadoop (SQL-H)
• Simple management &
monitoring with Viewpoint
integration
• Flexible deployment
options
Page 5
HDInsight &
HDP for Windows
• Only Hadoop Distribution
for Windows Azure &
Windows Server
• Native integration with
SQL Server, Excel, and
System Center
• Extends Hadoop to .NET
community
Complete Portfolio for Hadoop
Appliances
Instant Access +
Infinite Scale
• SAP can assure their
customers they are
deploying an SAP HANA
+ Hadoop architecture
fully supported by SAP
• Enables analytics apps
(BOBJ) to interact with
Hadoop
6. HDP 2.1: Enterprise Hadoop Platform
Page 6
Hortonworks
Data Platform (HDP)
• The ONLY 100% open source
and most current platform
• Integrates full range of
enterprise-ready services
• Certified and tested at scale
• Engineered for deep
ecosystem interoperability
OS/VM Cloud Appliance
CORE
SERVICES
CORE
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS
DATA PLATFORM (HDP)
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD &
EXTRACT
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP
TEZREDUCE
HIVE &
HCATALOG
PIGHBASE
OPERATIONAL
SERVICES
DATA
SERVICES
CORE SERVICES
HORTONWORKS
DATA PLATFORM (HDP)
Schedule
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
Storage
Resource Management
Process
Data
Movement
Cluster
Mgmnt Dataset
Mgmnt
Data Access
CORE SERVICES
HORTONWORKS
DATA PLATFORM (HDP)
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUMEAMBARI
FALCON
YARN
MAP
TEZREDUCE
HIVEPIG
HBASE
OOZIE
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
LOAD &
EXTRACT
WebHDFS
NFS
KNOX*
7. Our Vision: Hadoop as Next-Gen Platform
HADOOP 1.0
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management
& data processing)
HDFS2
(redundant, highly-available & reliable storage)
YARN
(cluster resource management)
MapReduce
(data processing)
Others
HADOOP 2.0
Single Use System
Batch Apps
Multi Purpose Platform
Batch, Interactive, Online, Streaming, …
Page 7
8. The 1st Generation of Hadoop: Batch
HADOOP 1.0
Built for Web-Scale Batch Apps
Single App
BATCH
HDFS
Single App
INTERACTIVE
Single App
BATCH
HDFS
• All other usage
patterns must
leverage that same
infrastructure
• Forces the creation
of silos for managing
mixed workloads
Single App
BATCH
HDFS
Single App
ONLINE
10. YARN: Taking Hadoop Beyond Batch
Page 10
Applications Run Natively in Hadoop
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm, S4,…)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
ONLINE
(HBase)
OTHER
(Search)
(Weave…)
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
12. Concepts
• Application
–Application is a temporal job or a service submitted YARN
–Examples
– Map Reduce Job (job)
– Hbase Cluster (service)
• Container
–Basic unit of allocation
–Fine-grained resource allocation across multiple resource
types (memory, cpu, disk, network, gpu etc.)
– container_0 = 2GB, 1CPU
– container_1 = 1GB, 6 CPU
–Replaces the fixed map/reduce slots
12
13. Design Centre
• Split up the two major functions of JobTracker
–Cluster resource management
–Application life-cycle management
• MapReduce becomes user-land library
13
14. YARN Applications
• Data processing applications and services
–Online Serving – HOYA (HBase on YARN)
–Real-time event processing – Storm, S4, other commercial
platforms
–Interactive SQL – Tez (Generalization of MR)
–Machine Learning – MPI (OpenMPI, MPICH2)
–In-Memory: Spark
–Graph processing: Giraph
–Enabled by allowing the use of paradigm-specific application
master
Run all on the same Hadoop cluster!
Page 14
18. Tez (“Speed”)
• What is it?
–A data processing framework as an alternative to MapReduce
–A new incubation project in the ASF
• Who else is involved?
–22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo,
Microsoft
• Why does it matter?
–Widens the platform for Hadoop use cases
–Crucial to improving the performance of low-latency applications
–Core to the Stinger initiative
–Evidence of Hortonworks leading the community in the evolution
of Enterprise Hadoop
19. Moving Hadoop Beyond MapReduce
• Low level data-processing execution engine
• Built on YARN
• Enables pipelining of jobs
• Removes task and job launch times
• Does not write intermediate output to HDFS
–Much lighter disk and network usage
• New base of MapReduce, Hive, Pig, Cascading etc.
• Hive and Pig jobs no longer need to move to the end
of the queue between steps in the pipeline
20. Tez - Core Idea
Task with pluggable Input, Processor & Output
YARN ApplicationMaster to run DAG of Tez Tasks
Input Processor
Task
Output
Tez Task - <Input, Processor, Output>
21. Building Blocks for Tasks
MapReduce ‘Map’ MapReduce ‘Reduce’
HDFS
Input
Map
Processor
MapReduce ‘Map’ Task
Sorted
Output
Intermediate ‘Reduce’ for
Map-Reduce-Reduce
Shuffle
Input
Reduce
Processor
Intermediate ‘Reduce’ for
Map-Reduce-Reduce
Sorted
Output
Shuffle
Input
Reduce
Processor
HDFS
Output
MapReduce ‘Reduce’ Task
Special Pig/Hive ‘Map’
HDFS
Input
Map
Processor
Tez Task
Pipelin
e
Sorter
Output
Special Pig/Hive ‘Reduce’
Shuffle
Skip-
merge
Input
Reduce
Processor
Tez Task
Sorted
Output
In-memory Map
HDFSI
nput
Map
Processor
Tez Task
In-
memor
y
Sorted
Output
22. Pig/Hive-MR versus Pig/Hive-Tez
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Pig/Hive - MR Pig/Hive - Tez
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Job 1
Job 2
Job 3
Single Job
23. Tez on YARN: Going Beyond Batch
Tez Optimizes Execution
New runtime engine for
more efficient data processing
Always-On Tez Service
Low latency processing for
all Hadoop data processing
Tez Task
24. SQL-in-Hadoop with Apache Hive
• Apache Hive is the standard for
SQL interaction with Hadoop
–Enterprise makes final purchasing
decision on two key characteristics:
'compatibility' with existing
investments (60%) and skills (20%)
–Most application claim Hive
compatibility TODAY*
• Stinger Initiative: Simple Focus
–Performance
–SQL-Compatibility
–Scalability
Claims publicly made by: Teradata, Microsoft, Oracle, Microstrategy, IBM, Information Builders,
SAS, QlikTech, SAP, Tableau, Tibco, Actuate, Jaspersoft, Alteryx, Datameer, Pentaho
Page 24
Hadoop
HDFS
Hive
TezMapReduce
SQL
YARN
Business
Analytics
Custom
Apps
Improves existing
tools & preserves
investments
25. Stinger Project
(announced February 2013)
Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
Hive 0.13, April 2014:
• Hive on Apache Tez
• Query Service
• Buffer Cache
• Cost Based Optimizer (Optiq)
• Vectorized Processing
Hive 0.11, May 2013:
• Base Optimizations
• SQL Analytic Functions
• ORCFile, Modern File Format
Hive 0.12, October 2013:
• VARCHAR, DATE Types
• ORCFile predicate pushdown
• Advanced Optimizations
• Performance Boosts via YARN
Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB
SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop
…all IN Hadoop
Goals:
26. Hortonworks: The Value of “Open” for You
Page 26
Validate & Try
1. Download the
Hortonworks Sandbox
2. Learn Hadoop using the
technical tutorials
3. Investigate a business
case using the step-by-
step business cases
scenarios
4. Validate YOUR business
case using your data in
the sandbox
Connect With the Hadoop Community
We employ a large number of Apache project committers & innovators so
that you are represented in the open source community
Avoid Vendor Lock-In
Hortonworks Data Platform remain as close to the open source trunk as
possible and is developed 100% in the open so you are never locked in
The Partners you Rely On, Rely On Hortonworks
We work with partners to deeply integrate Hadoop with data center
technologies so you can leverage existing skills and investments
Certified for the Enterprise
We engineer, test and certify the Hortonworks Data Platform at scale to
ensure reliability and stability you require for enterprise use
Support from the Experts
We provide the highest quality of support for deploying at scale. You are
supported by hundreds of years of Hadoop experience
Engage
1. Execute a Business Case
Discovery Workshop with
our architects
2. Build a business case for
Hadoop today
Editor's Notes
Hello Today I’m going to talk to you about HW and how we deliver an Enterprise Ready Hadoop to enable your modern data architecture.
Founded just 2.5 years ago from the original hadoop team members a yahoo.
Hortonworks emerged as the leader in open source Hadoop.
We are commited to ensure H is an enterprise viable data platform ready for your modern data architecture
Our team is probably the largest assembled team of Hadoop experts and active leaders in the community
We not only make sure Hadoop meets all your enterprise requirements like
Operations, reliablity & Security
It also needs to be
Packaged & Tested and we do this.
It has to work with what you have
Make Hadoop an enterprise data platform. Make the market function.
Innovate core platform, data, & operational services
Integrate deeply with enterprise ecosystem
Provide world-class enterprise support
Drive 100% open source software development and releases through the core Apache projects
Address enterprise needs in community projects
Establish Apache foundation projects as “the standard”
Promote open community vs. vendor control / lock-in
Enable the Hadoop market to function
Make it easy for enterprises to deploy at scale
Be the best at enabling deep ecosystem integration
Create a pull market with key strategic partners
Tez Approved as New Apache Incubator ProjectHortonworks Introduces Next-Generation Runtime for Improving Latency and Throughput of Hadoop Apps
Make Hadoop an enterprise data platform
Innovate core platform, data, & operational services
Integrate deeply with enterprise ecosystem
Provide world-class enterprise support
Drive 100% open source software development and releases through the core Apache projects
Address enterprise needs in community projects
Establish Apache foundation projects as “the standard”
Promote open community vs. vendor control / lock-in
Enable the Hadoop market to function
Make it easy for enterprises to deploy at scale
Be the best at enabling deep ecosystem integration
Create a pull market with key strategic partners