More Related Content
Similar to Introduction to Hadoop (20)
Introduction to Hadoop
- 1. Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Introduction to Hadoop
Eric Mizell – Director, Solution Engineering
Hortonworks. We do Hadoop.
- 5. Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Quick Audience Poll
Which best describes how your org is using Hadoop?
A. We’re using Hadoop
B. We’re in the process of getting Hadoop integrated
C. We don’t have Hadoop installed
D. What’s Hadoop?
E. I don’t know
- 6. Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Big Data, Hadoop, and the Modern Data Architecture
- 7. Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Big
Data
Explosion
Big Data Market Trends & Projections
20%
% by which org’s leveraging
modern info management
systems outperform peers by
2015
!"
1 Zettabyte (ZB)
=
1 Billion TBs
15x
growth rate of
machine generated
data by 2020
The US has 1/3 of the world’s data
Big Data is 1 of 5 US GDP Game Changers $325 billion
incremental annual GDP from big data analytics in retail and manufacturing by
2020
- 8. Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Existing Siloed Data Architectures Under PressureAPPLICATIONS
DATA
SYSTEM
SOURCES
Business
Analy:cs
Custom
Applica:ons
Packaged
Applica:ons
Exis:ng
Sources
(CRM,
ERP,
Clickstream,
Logs)
SILO
SILO
RDBMS
SILO
SILO
SILO
SILO
EDW
MPP
Data
growth:
New
Data
Types
OLTP,
ERP,
CRM
Systems
Unstructured
docs,
emails
Clickstream
Server
logs
Social/Web
Data
Sensor.
Machine
Data
Geoloca:on
85%
Source: IDC
??
" Can’t manage new
data paradigm
" Constrains data to
specific schema
" Siloed data
" Limited scalability
" Economically
unfeasible
" Limited analytics
- 9. Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop is Driving the New Data-driven Era of IT
1st
Era
Real-time Data Driven
RDBMS
2nd
Era 3rd
Era
Automation + EfficiencyProcessing Power
Mainframe
GoalDataTechnology
- 10. Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Key Drivers of Hadoop
OPERATIONS
TOOLS
Provision,
Manage &
Monitor
DEV
&
DATA
TOOLS
Build &
Test
DATA
SYSTEM
REPOSITORIES
SOURCES
RDBMS
EDW
MPP
APPLICATIONS
Business
Analy:cs
Custom
Applica:ons
Packaged
Applica:ons
Unlock
New
Approach
to
Analy:cs
• Agile
analy*cs
via
“Schema
on
Read”
with
ability
to
store
all
data
in
na*ve
format
• Create
new
apps
from
new
types
of
data
A
Op:mize
Investments,
Cut
Costs
• Focus
EDW
on
high
value
workloads
• Use
commodity
servers
&
storage
to
enable
all
data
(original
and
historical)
to
be
accessible
for
ongoing
explora*on
B
Enable
a
Modern
Data
Architecture
• Integrate
new
&
exis*ng
data
sets
• Make
all
data
available
for
shared
access
and
processing
in
mul*tenant
infrastructure
• Batch,
interac*ve
&
real-‐*me
use
cases
• Integrated
with
exis*ng
tools
&
skills
C
EXISTING
Systems
Clickstream
Web
&
Social
Geoloca:on
Sensor
&
Machine
Server
Logs
Unstructured
YARN: Data Operating System
° ° ° ° ° ° ° ° °
Interactive Real-TimeBatch
HDFS: Hadoop Distributed File System
- 11. Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
…to real-time personalizationFrom static branding
…to repair before breakFrom break then fix
…to designer medicineFrom mass treatment
…to automated algorithmsFrom educated investing
…to 1x1 targetingFrom mass branding
A shift in Advertising
A shift in Financial Services
A shift in Healthcare
A shift in Retail
A shift in Manufacturing
Hadoop enables
organizations to cost
effectively store and use
all of the data available
in a way that shifts the
business from…
Reactive
Proactive
Shift to Data-driven Means Treating Data like Capital
- 12. Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enterprise Goals for the Modern Data Architecture
ü Centrally manage new and existing data
ü Data needs flexibility and lands in
Hadoop without schema
ü Prepare data with no predetermined
questions
ü User self-service – no limit to questions
ü Run batch, interactive & real time analytic
applications on shared datasets
ü Leverage new and existing data center
infrastructure investments
ü Scalable and affordable; low cost per TB
APPLICATIONSDATASYSTEM
Business
Analytics
Custom
Applications
Packaged
Applications
RDBMS
EDW
MPP
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
Interactive Real-TimeBatch
CRM
ERP
Other
1 ° ° °
° ° ° °
HDFS
(Hadoop Distributed File System)
SOURCES
EXISTING
Systems
Clickstream
Web
&
Social
Geoloca:on
Sensor
&
Machine
Server
Logs
Unstructured
- 13. Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN and HDP Enables the Modern Data Architecture
YARN is the architectural center of
Hadoop and HDP
• YARN enables a common data set
across all applications
• Batch, interactive & real-time
workloads
• Support multi-tenant access &
processing
HDP enables Apache Hadoop to
become Enterprise Viable Data
Platform with centralized services
• Security
• Governance
• Operations
• Productization
Enabled broad ecosystem
adoption
Hortonworks drove this innovation of Hadoop through YARN
Hortonworks Data Platform 2.2
YARN: Data Operating System
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez
Tez
Java
Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
HDFS
(Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase
Accumulo
Slider
Slider
SECURITYGOVERNANCE OPERATIONSBATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-
Memory
Spark
Provision,
Manage &
Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow,
Lifecycle &
Governance
Falcon
Sqoop
Flume
Kafka
NFS
WebHDFS
Authentication
Authorization
Audit
Data Protection
Storage: HDFS
Resources: YARN
Access: Hive
Pipeline: Falcon
Cluster: Ranger
Cluster: Knox
Deployment ChoiceLinux Windows Cloud
Others
ISV
Engines
On-Premises
- 14. Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
OPERATIONAL
TOOLS
DEV
&
DATA
TOOLS
INFRASTRUCTURE
Modern Data ArchitectureSOURCES
EXISTING
Systems
Clickstream
Web
&Social
Geoloca:on
Sensor
&
Machine
Server
Logs
Unstructured
DATASYSTEM
RDBMS
EDW
HANA
APPLICATIONS
BusinessObjects BI
Deep Partnerships
Hortonworks engages
in deep engineered relationships
with the leaders in the data center,
such as Microsoft, HP, Teradata,
SAS, SAP & Redhat
Broad Partnerships
Over 600 partners work with us to
certify their applications to work with
Hadoop so they can extend big data
to their users
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
HDFS
(Hadoop Distributed File System)
Interactive Real-TimeBatch
- 15. Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop unlocks a new approach: Iterative Analytics
✚
Determine
list
of
ques:ons
Design
solu:ons
Collect
structured
data
Ask
ques:ons
from
list
Detect
addi:onal
ques:ons
Current Reality
Apply schema on write
Dependent on IT
Repeatable Process: SQL Only
Augment w/ Hadoop
Apply schema on read
Support range of access patterns to
data stored in HDFS: polymorphic access
HADOOP
Iterate over structure
Transform and Analyze
batch interactive real-time
Right Engine, Right Job
in-memory
- 16. Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop delivers compelling economics
✚
EDW Optimization
OPERATIONS
50%
ANALYTICS
20%
ETL PROCESS
30%
OPERATIONS
50% ANALYTICS
50%
Current Reality
EDW at capacity: some usage
from low value workloads
Older data archived, unavailable
for ongoing exploration
Source data often discarded
Augment w/ Hadoop
Free up EDW resources from low value
tasks
Keep 100% of source data and historical
data for ongoing exploration
Mine data for value after loading it
because of schema-on-read
MPP
SAN
Engineered System
NAS
HADOOP
Cloud Storage
$0 $20,000 $40,000 $60,000 $80,000 $180,000
Fully-loaded Cost Per Raw TB
of Data (Min–Max Cost)
Commodity Compute & Storage
Hadoop Enables Scalable Compute &
Storage at a Compelling Cost Structure
Hadoop
Parse, Cleanse
Apply Structure, Transform
- 17. Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
How to Get Started with Hadoop
- 18. Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Try Hadoop Today
Download the Hortonworks Sandbox
http://hortonworks.com/products/hortonworks-sandbox/
Learn Hadoop
Build a Proof of Concept
Test New Functionality
- 19. © Hortonworks Inc. 2013
5 Reasons Hadoop is Kicking Cans
and Taking Names
Hadoop’s momentum is unstoppable as its open source roots grow wildly into
enterprises. Its refreshingly unique approach to data management is transforming how
companies store, process, analyze, and share big data.
Forrester believes that Hadoop will become must-have infrastructure for large
enterprises.
Here are five reasons firms should adopt Hadoop today:
1. Build a data lake with the Hadoop file system (HDFS)
2. Enjoy cheap, quick processing with MapReduce
3. Data scientists can wrangle big data faster
4. Even the POC can make you money
5. The future of Hadoop is real-time and transactional
Page 19
http://blogs.forrester.com/mike_gualtieri/13-10-22-5_reasons_hadoop_is_kicking_can_and_taking_names
- 21. Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2013
Thank You!
Eric Mizell - Director, Solutions Engineering
emizell@hortonworks.com