This document discusses Hortonworks' approach to addressing challenges around managing large volumes of diverse data. It presents Hortonworks' Hadoop Data Platform (HDP) as a solution for consolidating siloed data into a central data lake on a single cluster. This allows different data types and workloads like batch, interactive, and real-time processing to leverage shared services for security, governance and operations while preserving existing tools. The HDP also enables new use cases for analytics like real-time personalization and segmentation using diverse data sources.
4. The
leaders
of
Hadoop’s
development
We
do
Hadoop
Community
driven,
Enterprise
Focused
Drive
InnovaDon
in
the
plaEorm
–
We
lead
the
roadmap
100%
Open
Source
–
DemocraDzed
Access
to
Data
5. We
do
Hadoop
successfully.
> Develop
Open
Source
Hadoop
> Distribute
Hadoop
with
HDP
> Support
> Professional
Services
> Training
6. Hortonworks Approach
1 Innovate the Core
Architect and build
innovation at the core of
Hadoop
• YARN: Data Operating
System
• HDFS as the storage layer
• Key processing engines
Extend Hadoop as an
2 Enterprise Data Platform 3 Enable the Ecosystem
Extend Hadoop with enterprise
capabilities for governance,
security & operations
Apply enterprise software rigor
to the open source development
process
Enable the leaders in the data
center to easily adopt & extend
their platforms
• Establish Hadoop as standard
component of a modern data
architecture
• Joint engineering
Script
Pig
YARN
SQL
Hive/Tez,
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
:
Data
Opera>ng
System
Batch
Map
Reduce
HDFS
(Hadoop
Distributed
File
System)
HDP
2.2
Governance
& Integration
Security
Operations
Data Access
YARN
Data Management
Memory
Spark
7. …all done completely 4 in Open Source
Innova>ng
within
the
community
for
the
enterprise
• Open
• Complete
adopDon
and
minimizes
lock
in
• Enables
Script
Pig
YARN
Source:
fastest
path
to
innovaDon
for
a
plaEorm
technology
open
source
plaEorm
speeds
enterprise
and
ecosystem
the
market
to
funcDon
much
bigger
much
faster
Memory
Spark
SQL
Hive/Tez,
HCatalog
NoSQL
HBase
Accumulo
Stream
Storm
:
Data
Opera>ng
System
Batch
Map
Reduce
HDFS
(Hadoop
Distributed
File
System)
HDP
2.2
Governance
& Integration
Security
Operations
Data Access
YARN
Data Management
Driving
our
innova>on
through
Apache
SoQware
Founda>on
Projects
Apache
Project
CommiTers
PMC
Members
Hadoop
27
20
Pig
5
5
Hive
16
4
Tez
15
15
HBase
6
4
Phoenix
4
4
Accumulo
2
2
Storm
3
2
Slider
10
10
Flume
1
0
Sqoop
1
1
Ambari
32
27
Oozie
3
2
Zookeeper
2
1
Knox
11
5
Argus
10
n/a
Falcon
5
3
TOTAL
153
105
15. Data
The
soluDon?
EDW
DDataat
a
Data
Data
Data
Data
Data
Data
Data
Data
Data
Yet
Another
EDW
DDataat
Data
Data
a
Data
Data
Data
Data
Data
AnalyDcal
DB
DDataat
a
Data
Data
Data
Data
Data
Data
OLTP
DDataat
a
Data
Data
Data
Data
Data
Data
Another
EDW
DDataat
Data
Data
a
Data
Data
Data
Data
16. Data
Ummm…you
Data
dropped
something
Data
Data
Data
Data
Data
DDataat
a
Data
Data
Data
Data
Data
Data
Data
DDDaDatataaatt
a
Data
Data
Data
Data
Data
Data
DDDaaDattataaa
Data
t
a
Data
Data
Data
Data
Data
Data
DDataat
a
Data
Data
Data
Data
DDataat
a
Data
Data
Data
Data
DDataat
a
Data
Data
Data
Data
Data
DDataat
a
Data
Data
Data
Data
Data
DDataat
a
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
DaDtaat
a
Data
Data
Data
Data
EDW
DDataat
a
Data
Data
Data
Data
Data
Data
Yet
Another
EDW
DDataat
Data
Data
a
Data
Data
Data
Data
AnalyDcal
DB
DDataat
a
Data
Data
Data
Data
Data
Data
OLTP
DDataat
a
Data
Data
Data
Data
Data
Data
Another
EDW
DDataat
Data
Data
a
Data
Data
Data
Data
18. Data
Silos.
Your
data
silos
are
lonely
places.
EDW
DDataat
a
Data
Data
Data
Data
Data
Data
Data
Accounts
DDataat
a
Data
Data
Data
Data
Data
Data
Data
Customers
DDataat
a
Data
Data
Data
Data
Data
Data
Data
Web
ProperDes
DDataat
Data
Data
a
Data
Data
Data
Data
Data
19. …
Data
likes
to
be
together.
EDW
DDataat
a
Data
Data
Data
Data
Data
Accounts
DDataat
a
Data
Data
Data
Data
Data
Data
Data
Customers
DDataat
a
Data
Data
Data
Data
Data
Data
Data
Web
ProperDes
DDataat
a
Data
Data
Data
Data
Data
Data
Data
Data
Data
20. Facebook
DDataat
a
Data
Data
Data
Data
Data
likes
to
socialize
too.
EDW
DDataat
a
Data
Data
Data
Data
Data
Accounts
DDataat
a
Data
Data
Data
Data
Data
Data
Data
Customers
DDataat
a
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Web
ProperDes
DDataat
a
Data
Data
Data
Data
Data
Data
Data
Data
Data
Machine
Data
DDataat
Data
Data
a
Data
Data
Data
Data
Twiber
DDataat
a
Data
Data
Data
Data
Data
Data
Data
Data
CDR
DDataat
a
Data
Data
Data
Data
Data
Data
Weather
Data
DDataat
Data
Data
a
Data
Data
Data
Data
21. New
types
of
data
don’t
quite
fit
into
your
prisDne
view
of
the
world.
My
Lible
Data
Empire
DaDtaat
Data
a
Data
Data
Data
Data
Data
Data
Logs
Data
Data
Data
Data
Data
Data
Data
Machine
Data
Data
Data
Data
Data
Data
Data
Data
?
?
?
?
22. To
resolve
this,
some
people
take
hints
from
Lord
Of
The
Rings...
24. …but
that
has
its
problems
too.
EDW
DDataat
a
Data
Data
Data
Data
Data
Data
Data
SchemaD
ata
DaDtaat
a
ETL
ETL
ETL
ETL
EDW
DDataat
a
Data
Data
Data
Data
Data
Data
Data
SchemaD
ata
DaDtaat
a
ETL
ETL
ETL
ETL
25. What
if
the
data
was
processed
and
stored
centrally?
What
if
you
didn’t
need
to
force
it
into
a
single
schema?
We
call
it
a
Modern
Data
Architecture*
*AKA
Data
Lake
26. A Modern Data Architecture
• Consolidate siloed data sets structured
and unstructured
• Central data set on a single cluster
• Multiple workloads across batch
interactive and real time
• Central services for security, governance
and operation
• Preserve existing investment in current
tools and platforms
• Single view of the customer, product,
supply chain
APPLICATIONS
DATA
SYSTEM
Business
Analy>cs
Custom
Applica>ons
Packaged
Applica>ons
RDBMS
EDW
MPP
Batch Interactive Real-Time
YARN:
Data
Opera>ng
System
1
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
CRM
ERP
Other
1
°
°
°
°
°
°
HDFS
(Hadoop
Distributed
File
System)
SOURCES
EXISTING
Systems
Clickstream
Web
&Social
Geoloca>on
Sensor
&
Machine
Server
Logs
Unstructured
29. Your
segmentaDon
today.
Male
Female
Age:
25-‐30
Town/City
Middle
Income
Band
Product
Category
Preferences
30. Looking
to
start
a
business
Your
segmentaDon
with
beber
data.
Male
Female
Age:
27
but
feels
old
GPS
coordinates
$65-‐68k
per
year
Product
recommendaDons
per
Dme
of
day
and
per
weather
Tea
Party
Hippie
Walking
into
Starbucks
right
now…
A
depressed
Toronto
Maple
Leaf’s
Fan
Products
lem
in
basket
indicate
drunk
amazon
shopper
Purchase
history
indicates
a
risk
taker
Thinking
about
a
new
house
Unhappy
with
his
cell
phone
plan
Pregnant
Spent
25
minutes
looking
at
tea
cozies
31. Pick
up
all
of
that
data
that
was
prohibiDvely
expensive
to
store
and
use.
32. To
approach
these
use
cases
you
need
an
affordable
plaEorm
that
stores,
processes,
and
analyzes
the
data.
33. Don’t
wait
for
your
data.
Batch
is
omen
too
late
to
influence
the
person
who
is
in
your
store
or
on
your
website
right
now.
34. Streaming Processing, Search, and Storage
APACHE
KAFKA
YARN
HDFS
Hortonworks
Data
Plaaorm
2.2
Search
Slider
Solr
Online
Data
Processing
HBase
Real
Time
Stream
Processing
Storm
SQL
Hive
Streaming
Ingest
Stream
data
into
Hadoop
and
process
it
in
near
real-‐;me
Real-‐Dme
data
feeds
36. What’s New in HDP 2.2
New and Improved YARN
Ready Engines
• Enterprise SQL at Hadoop Scale with
Stinger.next
• Enterprise Ready Spark on YARN
• Deep YARN integration for real-time
engines: HBase, Accumulo, Storm
• Enabling ISVs with a general SDK and API
for direct YARN integration
• Only solution to provide real-time to micro
batch for analyzing the internet of things
• Other engines/tools: Solr, Cascading
Continued Innovation of
Central Enterprise Services
• Centralized security administration
and policy enforcement
• Ease of use and operations agility
features to speed cluster
deployment
• 100% uptime target with cluster
rolling upgrades
Expanded Deployment Options
• Enhanced business continuity with
replication/archival across on-premises
and cloud storage tiers (Azure Blob, S3)
• Simultaneous ship of Windows and Linux
installs
• Expand Azure support beyond HDInsight
Azure to include HDP for Windows or
Linux in Azure VMs
HDP
2.2
Delivering
Apache
Hadoop
for
the
Enterprise
37. Complete List of New Features in HDP 2.2
Apache Hadoop YARN
• Slide existing services onto YARN through ‘Slider’
• GA release of HBase, Accumulo, and Storm on
YARN
• Support long running services: handling of logs,
containers not killed when AM dies, secure token
renewal, YARN Labels for tagging nodes for specific
workloads
• Support for CPU Scheduling and CPU Resource
Isolation through CGroups
Apache Hadoop HDFS
• Heterogeneous storage: Support for archival
• Rolling Upgrade (This is an item that applies to the
entire HDP Stack. YARN, Hive, HBase, everything.
We now support comprehensive Rolling Upgrade
across the HDP Stack).
• Multi-NIC Support
• Heterogeneous storage: Support memory as a
storage tier (TP)
• HDFS Transparent Data Encryption (TP)
Apache Hive, Apache Pig, and Apache Tez
• Hive Cost Based Optimizer: Function Pushdown &
Join re-ordering support for other join types: star &
bushy.
• Hive SQL Enhancements including:
• ACID Support: Insert, Update, Delete
• Temporary Tables
• Metadata-only queries return instantly
• Pig on Tez
• Including DataFu for use with Pig
• Vectorized shuffle
• Tez Debug Tooling & UI
Hue
• Support for HiveServer 2
• Support for Resource Manager HA
Apache Spark
• Refreshed Tech Preview to Spark 1.1.0 (available
now)
• ORC File support & Hive 0.13 integration
• Planned for GA of Spark 1.2.0
• Operations integration via YARN ATS and Ambari
• Security: Authentication
• Apache Solr
• Added Banana, a rich and flexible UI for visualizing
time series data indexed in Solr
• Cascading
• Cascading 3.0 on Tez distributed with HDP
— coming soon
Apache Falcon
• Authentication Integration
• Lineage – now GA. (it’s been a tech preview
feature…)
• Improve UI for pipeline management & editing: list,
detail, and create new (from existing elements)
• Replicate to Cloud – Azure & S3
Apache Sqoop, Apache Flume & Apache Oozie
• Sqoop import support for Hive types via HCatalog
• Secure Windows cluster support: Sqoop, Flume,
Oozie
• Flume streaming support: sink to HCat on secure
cluster
• Oozie HA now supports secure clusters
• Oozie Rolling Upgrade
• Operational improvements for Oozie to better
support Falcon
• Capture workflow job logs in HDFS
• Don’t start new workflows for re-run
• Allow job property updates on running jobs
Apache HBase, Apache Phoenix, & Apache
Accumulo
• HBase & Accumulo on YARN via Slider
• HBase HA
• Replicas update in real-time
• Fully supports region split/merge
• Scan API now supports standby RegionServers
• HBase Block cache compression
• HBase optimizations for low latency
• Phoenix Robust Secondary Indexes
• Performance enhancements for bulk import into
Phoenix
• Hive over HBase Snapshots
• Hive Connector to Accumulo
• HBase & Accumulo wire-level encryption
• Accumulo multi-datacenter replication
Apache Storm
• Storm-on-YARN via Slider
• Ingest & notification for JMS (IBM MQ not
supported)
• Kafka bolt for Storm – supports sophisticated
chaining of topologies through Kafka
• Kerberos support
• Hive update support – Streaming Ingest
• Connector improvements for HBase and HDFS
• Deliver Kafka as a companion component
• Kafka install, start/stop via Ambari
• Security Authorization Integration with Ranger
Apache Slider
• Allow on-demand create and run different versions
of heterogeneous applications
• Allow users to configure different application
instances differently
• Manage operational lifecycle of application
instances
• Expand / shrink application instances
• Provide application registry for publish and
discovery
Apache Knox & Apache Ranger (Argus) & HDP
Security
• Apache Ranger – Support authorization and auditing
for Storm and Knox
• Introducing REST APIs for managing policies in
Apache Ranger
• Apache Ranger – Support native grant/revoke
permissions in Hive and HBase
• Apache Ranger – Support Oracle DB and storing of
audit logs in HDFS
• Apache Ranger to run on Windows environment
• Apache Knox to protect YARN RM
• Apache Knox support for HDFS HA
• Apache Ambari install, start/stop of Knox
Apache Ambari
• Support for HDP 2.2 Stack, including support for
Kafka, Knox and Slider
• Enhancements to Ambari Web configuration
management including: versioning, history and
revert, setting final properties and downloading client
configurations
• Launch and monitor HDFS rebalance
• Perform Capacity Scheduler queue refresh
• Configure High Availability for ResourceManager
• Ambari Administration framework for managing user
and group access to Ambari
• Ambari Views development framework for
customizing the Ambari Web user experience
• Ambari Stacks for extending Ambari to bring custom
Services under Ambari management
• Ambari Blueprints for automating cluster
deployments
• Performance improvements and enterprise usability
guardrails
38. Hortonworks Data Platform:
A comprehensive data management platform
Hortonworks
Data
Plaaorm
2.2
Java
Scala
Cascading
Tez
Stream
Storm
YARN: Data Operating System
(Cluster
Resource
Management)
Script
Pig
SQL
Hive
TezTez
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
Others
ISV
Engines
° ° ° ° °
° ° ° ° °
HDFS
(Hadoop Distributed File System)
Search
Solr
NoSQL
HBase
Accumulo
Sli der
Slider
GOVERNANCE
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY
OPERATIONS
In-Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Accounting
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive, …
Pipeline: Falcon
Cluster: Knox
Cluster: Ranger
Linux Windows Deployment Choice On-Premises Cloud
YARN
is the architectural
center of HDP
Enables batch, interactive
and real-time workloads
Provides comprehensive
enterprise capabilities
The widest range of
deployment options
Delivered
Completely
in
the
OPEN