Contenu connexe Similaire à Enterprise Apache Hadoop: State of the Union (20) Enterprise Apache Hadoop: State of the Union1. Hortonworks: We Do Hadoop
“State of the Union” Webinar
Shaun Connolly, VP Strategy
@shaunconnolly, @hortonworks
January 22, 2014
© Hortonworks Inc. 2014
Page 1
2. Today’s Webinar
• Apache Hadoop & Hortonworks Overview
• Hadoop’s Role
• Hadoop Adoption: From Apps to Lake
• Enterprise Hadoop Technology Directions
© Hortonworks Inc. 2014
Page 2
3. Our Mission:
Enable your Modern Data Architecture by
Delivering Enterprise Apache Hadoop
Our Commitment
Headquarters: Palo Alto, CA
Employees: 300+ and growing
Open Leadership
Drive innovation in the open exclusively via the
Apache community-driven open source process
Reseller Partners
Enterprise Rigor
Engineer, test and certify Apache Hadoop with
the enterprise in mind
Ecosystem Endorsement
Focus on deep integration with existing data
center technologies and skills
Our Vision:
More than Half the World's Data Will Be Processed by Apache Hadoop
© Hortonworks Inc. 2014
Page 3
4. Apache Community Process
Apache Community Projects
Apache
HBase
Apache Software Foundation
Guiding Principles
• Release early & often
• Transparency, respect, meritocracy
Apache
Hive
Apache
Pig
Key Roles
Test &
Patch
Apache
Hadoop
Apache
Storm
Release
• PMC Members
– Managing community projects
– Mentoring new incubator projects
Design & Develop
• Committers
Apache
Falcon
Apache
Ambari
– Authoring, reviewing & editing code
• Release Managers
– Testing & releasing projects
© Hortonworks Inc. 2014
Page 4
5. Hortonworks Process for Enterprise Hadoop
Upstream Community Projects
Downstream Enterprise Product
Certified at scale using the most
advanced Hadoop test bed on the planet
Apache
HBase
• 1000’s of production nodes at Yahoo!
Apache
Hive
• Over 1500 unit & system tests
Integrate
& Test
Apache
Pig
Test &
Patch
Apache
Hadoop
Apache
Storm
Release
Design &
Develop
Fixed Issues
Design & Develop
Apache
Falcon
Apache
Ambari
HDP 2.0
Package
& Certify
Stable Project
Releases
Distribute
Virtuous cycle when development & fixed issues done
upstream & stable project releases flow downstream
© Hortonworks Inc. 2014
Page 5
6. Hadoop’s Role…
“Hadoop is becoming a more ‘normal’
software market” and the “Hadoop vendor
ecosystem [is] gaining critical mass”
Tony Baer, Ovum
© Hortonworks Inc. 2014
Page 6
7. APPLICATIONS
A Traditional Approach Under Pressure
Custom
Applica4ons
Business
Analy4cs
Packaged
Applica4ons
DATA
SYSTEM
2.8
ZB
in
2012
85%
from
New
Data
Types
RDBMS
EDW
MPP
REPOSITORIES
15x
Machine
Data
by
2020
40
ZB
by
2020
SOURCES
Source: IDC
Exis4ng
Sources
(CRM,
ERP,
Clickstream,
Logs)
© Hortonworks Inc. 2014
Emerging
Sources
(Sensor,
Sen4ment,
Geo,
Unstructured)
Page 7
8. Unlock Value in New Types of Data
1. Social
Understand how people are feeling and interacting –
right now
2. Clickstream
Capture and analyze website visitors’ data trails and
optimize your website
3. Sensor/Machine
Discover patterns in data streaming from remote
sensors and machines
4. Geographic
Value
Analyze location-based data to manage operations
where they occur
5. Server Logs
Diagnose process failures and prevent security
breaches
6. Unstructured (txt, video, pictures, etc..)
Understand patterns in files across millions of web
pages, emails, and documents
© Hortonworks Inc. 2014
+ Online archive
Data that was once purged or moved
to tape can be stored in Hadoop to
discover long term trends and
previously hidden value
Page 8
9. SOURCES
DATA
SYSTEM
APPLICATIONS
A Modern Data Architecture Enabled
Custom
Applica4ons
Business
Analy4cs
RDBMS
EDW
Packaged
Applica4ons
• Complement
Data
Systems
• Right
Workload
Right
Place
MPP
REPOSITORIES
Exis4ng
Sources
(CRM,
ERP,
Clickstream,
Logs)
© Hortonworks Inc. 2014
Emerging
Sources
(Sensor,
Sen4ment,
Geo,
Unstructured)
Page 9
10. DATA
SYSTEM
APPLICATIONS
A Modern Data Architecture Applied
BusinessObjects BI
DEV
&
DATA
TOOLS
OPERATIONAL
TOOLS
RDBMS
EDW
HANA
MPP
SOURCES
INFRASTRUCTURE
Exis4ng
Sources
(CRM,
ERP,
Clickstream,
Logs)
© Hortonworks Inc. 2014
Emerging
Sources
(Sensor,
Sen4ment,
Geo,
Unstructured)
Page 10
11. Major Vendors Have Embraced Hadoop
HDInsight &
HDP for Windows
Teradata Portfolio
for Hadoop
• Only Hadoop Distribution
for Windows Azure &
Windows Server
• Seamless data access
between Teradata and
Hadoop (SQL-H)
• Native integration with
SQL Server, Excel, and
System Center
• Simple management &
monitoring with Viewpoint
integration
• Extends Hadoop to .NET
community
• Flexible deployment
options
Instant Access +
Infinite Scale
• SAP can assure their
customers they are
deploying an SAP HANA
+ Hadoop architecture
fully supported by SAP
• Enables analytics apps
(BOBJ) to interact with
Hadoop
Complete Portfolio for Hadoop
UDA
Diagram
Appliances
© Hortonworks Inc. 2014
Page 11
12. Hadoop Adoption
“Hadoop’s momentum is unstoppable as its open
source roots grow wildly into enterprises. Its refreshingly
unique approach to data management is transforming how
companies store, process, analyze, and share big data”
--Mike Gualtieri, Forrester
© Hortonworks Inc. 2014
Page 12
13. SCALE
Drivers of Hadoop Adoption
New Analytic Apps
New Types of Data
LOB Driven
SCOPE
© Hortonworks Inc. 2014
Page 13
14. 20 Common Business Applications
Industry
Use Case
New Account Risk Screens
Geographic
Clickstream
Sensor
Assembly Line Quality Assurance
Sensor
Crowdsourced Quality Assurance
Social
Use Genomic Data in Medical Trials
Structured
Monitor Patient Vitals in Real-Time
Sensor
Recruit and Retain Patients for Drug Trials
Social, Clickstream
Improve Prescription Adherence
Social, Unstructured, Geographic
Unify Exploration & Production Data
Sensor, Geographic & Unstructured
Monitor Rig Safety in Real-Time
© Hortonworks Inc. 2014
Clickstream, Text
Supply Chain and Logistics
Government
Server Logs, Text, Social
Website Optimization
Oil & Gas
Machine, Server Logs
Localized, Personalized Promotions
Pharmaceuticals
Machine, Geographic
360° View of the Customer
Healthcare
Geographic, Sensor, Text
Real-time Bandwidth Allocation
Manufacturing
Server Logs
Infrastructure Investment
Retail
Trading Risk
Call Detail Records (CDRs)
Telecom
Text, Server Logs
Insurance Underwriting
Financial Services
Type of Data
Sensor, Unstructured
ETL Offload in Response to Federal Budgetary Pressures
Structured
Sentiment Analysis for Government Programs
Social
Page 14
16. PB’s
The Journey Towards a Data Lake
PB
Risk Management
E.g., Fraud Reduction
New Business
E.g., Data as a Product
DATA
TB’s
Customer Intimacy
E.g., 360 Degree View
of the Customer
DATA LAKE
Operational Excellence
E.g., Network
Maintenance
An architectural shift in the
data center that uses Hadoop
to deliver deep insight across a
large, broad, diverse set of
data at efficient scale
VALUE
© Hortonworks Inc. 2014
Page 16
17. Drivers of the Data Lake
DATA
LAKE
• Allows simultaneous access by and timely insights for all
your users across all your data
• Enabled schema on read & enterprise-wide pool of data
Data
Access
+
Hadoop
=
INSIGHT
BROAD
INSIGHT
Access your data simultaneously in multiple ways
Data
Access
Irrespective ofdthe sprocessing engine, analytical
Access
your
ata
imultaneously
in
mul4ple
ways
application or presentation
EFFICIENT
+
Hadoop
=
SCALE SCALE
Data
Management
Store
and
process
all
of
your
Corporate
Data
Assets
• Acquire all data in original format and store in one place,
cost effectively and for an unlimited time
• Scale horizontally and to petabyte scale
© Hortonworks Inc. 2014
Page 17
18. Custom
Applica4ons
Business
Analy4cs
Packaged
Applica4ons
BROAD
INSIGHT
DATA
LAKE
APPLICATIONS
Data Lake Transforms Your Architecture
Data
Access
Access
your
data
simultaneously
in
mul4ple
ways
EFFICIENT
SCALE
Data
Management
SOURCES
Store
and
process
all
of
your
Corporate
Data
Assets
Exis4ng
Sources
(CRM,
ERP,
Clickstream,
Logs)
© Hortonworks Inc. 2014
Emerging
Sources
(Sensor,
Sen4ment,
Geo,
Unstructured)
Page 18
20. What’s Needed for Enterprise Hadoop?
1
2
3
Key Services
Platform, Operational and Data
services essential for the
enterprise
OPERATIONAL
OPERATIONAL
SERVICES
SERVICES
AMBARI
Cluster
Mgmt
Dataset
FALCON*
Mgmt
Schedule
OOZIE
SQOOP
MAP
Process
REDUCE
NFS
OS/VM
Data
Security
KNOX*
TEZ
YARN
Resource
Management
WebHDFS
CORE
CORE
SERVICES
SERVICES
© Hortonworks Inc. 2014
HBASE
PIG
HIVE
&
Data
Access
HCATALOG
Movement
Leverage your existing skills:
development, analytics,
operations
Interoperable with existing data
center investments
FLUME
Data
Skills
Integration
DATA
SERVICES
HDFS
Storage
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
Cloud
Appliance
Page 20
21. What’s Needed for Enterprise Hadoop?
1
2
3
Key Services
Platform, Operational and Data
services essential for the
enterprise
OPERATIONAL
OPERATIONAL
SERVICES
SERVICES
AMBARI
Cluster
AMBARI
Dataset
Mgmnt
FALCON
FALCON*
Mgmnt
Schedule
OOZIE
OOZIE
CORE
CORE
CORE
SERVICES
SERVICES
Integration
HBASE
PIG
HIVE
&
Data
Access
HIVE
HCATALOG
HBASE
Movement
SQOOP
SQOOP
MAP
Process
REDUCE
NFS
NFS
YARN
Resource
Management
WebHDFS
WebHDFS
KNOX
KNOX*
TEZ
TEZ
HDFS
Storage
HDFS
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS
DATA
PLATFORM
(HDP)
Interoperable with existing data
center investments
OS/VM
© Hortonworks Inc. 2014
FLUME
FLUME
Data
LOAD
&
LOAD
&
EXTRACT
EXTRACT
Skills
Leverage your existing skills:
development, analytics,
operations
DATA
DATA
SERVICES
SERVICES
Cloud
Appliance
Page 21
22. Hadoop 2 & Beyond
details: hortonworks.com/labs
© Hortonworks Inc. 2014
Page 22
23. Hadoop 2: The Introduction of YARN
Store all data in one place, interact in multiple ways
Single Use System
Multi-Use Data Platform
Batch Apps
Batch, Interactive, Online, Streaming, …
1st Gen
of Hadoop
2nd Gen of Hadoop
Classic
Hadoop
Apps
Batch
MapReduce
MapReduce
(cluster
resource
management
&
data
processing)
HDFS
(redundant,
reliable
storage)
© Hortonworks Inc. 2014
Hive,
Pig,
others…
Batch
&
Interac4ve
Tez
Flexible
Data
Processing
Online
Data
Processing
HBase,
Accumulo
Stream
Processing
Storm
others
…
Efficient
Cluster
Resource
Management
&
Shared
Services
(YARN)
Redundant,
Reliable
Storage
(HDFS)
Page 23
24. Apache Hadoop YARN
The Data Operating System for Hadoop 2
Flexible
Enables other purpose-built data
processing models beyond
MapReduce (batch), such as
interactive and streaming
Efficient
Shared
Double processing IN Hadoop on
the same hardware while
providing predictable
performance & quality of service
Provides a stable, reliable,
secure foundation and
shared operational services
across multiple workloads
Data
Processing
Engines
Run
Na4vely
IN
Hadoop
BATCH
INTERACTIVE
ONLINE
STREAMING
IN-‐MEMORY
MapReduce
Tez
HBase,
Accum
Storm
Spark
OTHER
Open
Source
/
Commercial
YARN:
Cluster
Resource
Management
HDFS:
Redundant,
Reliable
Storage
© Hortonworks Inc. 2014
Page 24
25. Apache Tez: Modern Execution Engine
Apache Tez is a modern & more efficient
alternative to MapReduce built on YARN
Supports BOTH Batch & Interactive workloads
– Used for Stinger initiative to enable interactive SQL for Apache Hive
– Hive and Pig will work on Tez
– Other solutions are considering Tez
Hive
MR
(batch)
(SQL)
Pig
(data
flow)
OTHER
Open
Source
/
Commercial
Tez
(execu@on
engine)
YARN
(cluster
resource
management)
HDFS
(redundant,
reliable
storage)
© Hortonworks Inc. 2014
Page 25
26. Batch AND Interactive SQL-IN-Hadoop
Apache Hive
Value Delivered
• The defacto standard for Hadoop SQL access
• Used by your current data center partners
• Built for batch AND interactive query
• Enables rapid insight over big data
SQL
Stinger Initiative
• Single engine for batch & interactive
• Preserves and transparently enhances
existing investments in use of Hive
– Ex. Hive-based solutions get 100x faster
• SQL compliance improves integration
with other data systems & tools
• New ORCFile reduces storage up to
70% while improving resource use,
scale, and throughput
Broad, community based effort to deliver the
next generation of Apache Hive
Speed
Scale
SQL
Improve Hive query
performance by 100X to
allow for interactive
query times (seconds)
The only SQL interface
to Hadoop designed for
queries that scale from
TB to PB
Support broadest range
of SQL semantics for
analytic applications
against Hadoop
© Hortonworks Inc. 2014
Page 26
27. Speed: Delivering Interactive Query
Query
27:
Pricing
Analy4cs
using
Star
Schema
Join
Query
82:
Inventory
Analy4cs
Joining
2
Large
Fact
Tables
1400s
190x
Improvement
3200s
200x
Improvement
65s
39s
14.9s
7.2s
TPC-‐DS
Query
27
Hive 10
Hive 0.11 (Phase 1)
TPC-‐DS
Query
82
Trunk (Phase 3)
All
Results
at
Scale
Factor
200
(Approximately
200GB
Data)
© Hortonworks Inc. 2014
Page 27
28. SCALE: Interactive Query at Petabyte Scale
Sustained Query Times
Smaller Footprint
Apache Hive 0.12 provides
sustained acceptable query
times even at petabyte scale
Better encoding with ORCFile in
Apache Hive 0.12 reduces resource
requirements for your cluster
File
Size
Comparison
Across
Encoding
Methods
Dataset:
TPC-‐DS
Scale
500
Dataset
585
GB
(Original
Size)
505
GB
(14%
Smaller)
Impala
221
GB
(62%
Smaller)
Hive
12
131
GB
(78%
Smaller)
Encoded
with
Text
© Hortonworks Inc. 2014
Encoded
with
RCFile
Encoded
with
Parquet
• Larger Block Sizes
• Columnar format
arranges columns
adjacent within the
file for compression
& fast access
Encoded
with
ORCFile
Page 28
29. SQL: Enhancing SQL Semantics
Hive
SQL
Datatypes
Hive
SQL
Seman4cs
SQL Compliance
INT
SELECT,
INSERT
TINYINT/SMALLINT/BIGINT
GROUP
BY,
ORDER
BY,
SORT
BY
BOOLEAN
JOIN
on
explicit
join
key
FLOAT
Inner,
outer,
cross
and
semi
joins
DOUBLE
Sub-‐queries
in
FROM
clause
Hive 12 provides a wide
array of SQL datatypes
and semantics so your
existing tools integrate
more seamlessly with
Hadoop
STRING
ROLLUP
and
CUBE
TIMESTAMP
UNION
BINARY
Windowing
Func@ons
(OVER,
RANK,
etc)
DECIMAL
Custom
Java
UDFs
ARRAY,
MAP,
STRUCT,
UNION
Standard
Aggrega@on
(SUM,
AVG,
etc.)
DATE
Advanced
UDFs
(ngram,
Xpath,
URL)
VARCHAR
Sub-‐queries
for
IN/NOT
IN,
HAVING
CHAR
Expanded
JOIN
Syntax
INTERSECT
/
EXCEPT
© Hortonworks Inc. 2014
Available
Hive
0.12
(HDP
2.0)
Hive
13
Page 29
30. Real-Time Streaming-IN-Hadoop
Apache Storm
A community-based effort to bring
real-time processing to Hadoop
Goals:
Project
Phases
Storm
:
Streaming
in
Hadoop
•
•
•
•
Coming
Soon
Storm-‐on-‐YARN
Installa@on
with
Ambari
Ganglia
&
Nagios
based
monitoring
Kaia,
HBase,
HDFS
&
Cassandra
connectors
HADOOP INTEGRATION
Making streaming a first-class component of a
modern data architecture
ENTERPRISE CONNECTIVITY
Connecting Storm to the important streaming
sources within the enterprise
IMPROVED MULTI-TENANCY
Increasing operations usability and enabling simple
programming of new flows
© Hortonworks Inc. 2014
Storm
:
Enterprise
Connec4vity
• No@fica@on
and
data
persistence
bolts:
EDWs,
RDBMS,
JMS
etc
• Data
Ingest
Spouts
• AD/LDAP
plugin
for
authen@ca@on
• High
Availability
management
w/
Ambari
Storm
:
Improved
Mul4-‐Tenancy
• Declara@ve
“wiring”
• Hive
update
support
• Advanced
scheduler
Page 30
31. Simplified Data Processing for Hadoop
Apache Falcon
Create and implement reusable
workflows for datasets to orchestrate
movement and track lineage
Hortonworks
Investment
in
Apache
Falcon
Q4 2013
Phase
1:
•
•
•
•
Goals:
Acquisition & Processing Data
• Direct data to processing engines or formats
• Obfuscate or transform data
Phase
2:
•
•
•
•
Replication & Retention Policy
• Replicate datasets
• Establish retention policies for datasets
© Hortonworks Inc. 2014
Coming
Soon
Hive
/
HCatalog
integra@on
Basic
Dashboard
for
En@ty
Viewing
Kerberos
security
support
Ambari
integra@on
for
management
Phase
3
Coming
Soon
• Advanced
Dashboard
for
pipeline
building
• Dataset
lineage
Redirection & Extensions of Hadoop
• Redirect data to encrypt or decrypt
• Extract segments of data and redirect to other tools
Incubate
Apache
Falcon
Dataset
Replica@on
Dataset
Reten@on
Falcon
Tech
Preview
Page 31
32. Enterprise Hadoop Security Today
Authentication
Authorization
Audit
Data Protection
Who am I/prove it?
Control access to
cluster.
Restrict access
to explicit data
Understand who
did what
Encrypt data at
rest & motion
Kerberos in
native Apache
Hadoop
Perimeter
Security with
Apache Knox
Gateway
© Hortonworks Inc. 2014
Native in Apache Hadoop
• MapReduce Access Control Lists
• HDFS Permissions
• Process Execution audit trail
Cell level access control in
Apache Accumulo
Wire encryption
in native Apache
Hadoop
Orchestrated
encryption with
3rd party tools
Page 32
33. Hadoop Security – What’s Next?
Security in Enterprise Hadoop
Driving the next generation of
Hadoop security
Goals:
Flexible Authentication & Authorization
Improve authentication choices and provide more
granular access controls for the Hadoop platform,
services and data.
Improve Data Protection
Enhance Hadoop’s audit and data protection
capabilities to support broader enterprise
governance and compliance needs.
Work with Existing Systems
Integrate with existing enterprise security and
identity management systems in a consistent way.
© Hortonworks Inc. 2014
Security
Investments
Security
Phase
1:
•
•
•
•
Delivere
Strong
AuthN
with
Kerberos
d in
HDP 2.0
HBase,
Hive,
HDFS
basic
AuthZ
Encryp@on
with
SSL
for
NN,
JT,
etc.
Wire
encryp@on
with
Shuffle,
HDFS,
JDBC
Security
Phase
2:
• Knox:
Hadoop
Perimeter
Security
• SQL-‐style
Hive
AuthZ
(GRANT,
REVOKE)
Coming
Soon
• ACLs
for
HDFS
• SSL
support
for
Hive
Server
2
• PAM
support
for
Hive
Security
Phase
3:
• Audit
event
correla@on
and
Audit
viewer
• NotOnlyKerberos
–
Support
other
Token-‐Based
Authen@ca@on
• Data
Encryp@on
in
HDFS,
Hive
&
HBase
Page 33
34. Operating Enterprise Hadoop at Scale
Apache Ambari is the only 100% open source
framework for provisioning, managing and
monitoring Apache Hadoop clusters
AMBARI
WEB
Integra@on
With
Exis@ng
Opera@ons
Tools
Viewpoint
COMING SOON!
Ambari Stacks: AMBARI-2714
Ambari Views: AMBARI-4234
Others
REST
APIs
PROVISION
AMBARI
SERVER
PROVISION
|
MANAGE
|
MONITOR
© Hortonworks Inc. 2014
compute
&
storage
.
.
.
MANAGE
.
.
.
.
MONITOR
.
.
.
compute
&
storage
Page 34
35. Recap
• Hadoop's role is becoming clear
• Major vendors have recognized Hadoop’s role and are
actively integrating it into their solutions
• Adoption path is consistent: from apps to lake
• Open source innovation continues unabated
– YARN opens up the platform, and as adoption deepens, the
community of committers is working to mature it even further
© Hortonworks Inc. 2014
Page 35
36. Try Hadoop Today… Get Involved
Download the Hortonworks Sandbox
Learn Hadoop
Build Your Analytic App
Try Hadoop 2
Amsterdam
April 2 - 3, 2014
REGISTER NOW
© Hortonworks Inc. 2014
San Jose, CA
June 3 - 5, 2014
CALL FOR
PAPERS OPEN
Page 36