More Related Content Similar to vBACD July 2012 - Apache Hadoop, Now and Beyond (20) More from CloudStack - Open Source Cloud Computing Project (12) vBACD July 2012 - Apache Hadoop, Now and Beyond1. Apache Hadoop & the Cloud
Jim Walker
Dir. Product Marketing, Hortonworks
Twitter @jaymce
July 10, 2012
© Hortonworks Inc. 2012
2. 1941
2012
Page 2
© Hortonworks Inc. 2012
3. Big data market segments
Software
Hardware ETL & Mgmnt Analytics Applications Services
Distributions
• Storage • OSS Apache • Distributed file • Analytic • Data • Consulting
• Servers Hadoop stores application visualization • Training
• Networking • Enterprise • NoSQL development tools • Tech support
Distributions databases platforms • Business • Software
• Non-Hadoop • Data • Advanced intelligence maintenance
big data integration analytics applications • Hardware
frameworks • Data quality & applications maintenance
governance • hosting
Next Generation Data Warehouse
• MPP columnar data warehouse appliances
• In-memory analytics engines
• Fast data loading
© Hortonworks Inc. 2012
4. Big data market segments
Software
Hardware ETL & Mgmnt Analytics Applications Services
Distributions
• Storage • OSS Apache • Distributed file • Analytic • Data • Consulting
• Servers Hadoop stores application visualization • Training
• Networking • Enterprise • NoSQL development tools • Tech support
Distributions databases platforms • Business • Software
• Non-Hadoop • Data • Advanced intelligence maintenance
big data integration analytics applications • Hardware
frameworks • Data quality & applications maintenance
governance • hosting
cloud cloud cloud cloud
Next Generation Data Warehouse
• MPP columnar data warehouse appliances
• In-memory analytics engines
• Fast data loading
© Hortonworks Inc. 2012
5. Analytics started with basic purchase history…
Megabytes
ERP
Purchase detail
Purchase record
Payment record
Increasing Data Variety and Complexity
Source: Crated in conjunction with Teradata, Inc.
© Hortonworks Inc. 2012
6. then we added customer information…
Gigabytes CRM
Segmentation
Customer Touches
Megabytes
ERP
Purchase detail Support Contacts
Purchase record
Payment record Offer details
Increasing Data Variety and Complexity
Source: Crated in conjunction with Teradata, Inc.
© Hortonworks Inc. 2012
7. and the web started to impact…
Terabytes WEB Web logs
A/B testing
Behavioral Targeting
Gigabytes CRM Dynamic Pricing
Segmentation
Search Marketing
Customer Touches
Megabytes
ERP Affiliate Networks
Purchase detail Support Contacts
Dynamic Funnels
Purchase record
Payment record Offer details Offer history
Increasing Data Variety and Complexity
Source: Crated in conjunction with Teradata, Inc.
© Hortonworks Inc. 2012
8. Big data changes the game
Transactions + Interactions
Petabytes
BIG DATA Mobile Web + Observations
Sentiment
User Click Stream
SMS/MMS
= BIG DATA
Speech to Text
Social Interactions & Feeds
Terabytes WEB Web logs
Spatial & GPS Coordinates
A/B testing
Sensors / RFID / Devices
Behavioral Targeting
Gigabytes CRM Dynamic Pricing
Business Data Feeds
Segmentation External Demographics
Search Marketing
Customer Touches User Generated Content
Megabytes
ERP Affiliate Networks
Purchase detail Support Contacts HD Video, Audio, Images
Dynamic Funnels
Purchase record
Offer details Offer history Product/Service Logs
Payment record
Increasing Data Variety and Complexity
Source: Crated in conjunction with Teradata, Inc.
© Hortonworks Inc. 2012
9. Next-gen data architecture drivers
Business • Enable new business models & drive faster growth (20%+)
Drivers • Find insights for competitive advantage & optimal returns
Technical • Data continues to grow exponentially
Drivers • Data is increasingly everywhere and in many formats
• Legacy solutions unfit for new requirements growth
cloud
Financial • Cost of data systems, as % of IT spend, continues to grow
Drivers • Cost advantages of commodity hardware & open source
© Hortonworks Inc. 2012
10. Apache Hadoop
Open Source Data Management Software
One of the best examples of open source
driving innovation and creating a market
• Foundation for big data solutions
• Enables a rational economics model
• Powers data-driven business
• Commodity hardware
• Loosely coupled, ship early/ship often
• Consists of many specialized sub-projects
© Hortonworks Inc. 2012
11. Apache Hadoop & Cloud Makes Sense
• Broader access of Hadoop to end users, IT
professionals, and developers
cloud
• Easy installation and configuration and
simplified programming
• Enterprise-ready distribution with greater
security, performance, ease of management
and options for Hybrid IT usage.
• Integrate with everything via RESTful API
• Spin up a cluster on demand
• Ease management
Page 11
© Hortonworks Inc. 2012
12. 5 Reasons for Hadoop in the Cloud
People say "should
you run Hadoop in
the cloud?”
I say "it depends".
http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html
Page 12
© Hortonworks Inc. 2012
13. 5 Reasons for Hadoop in the Cloud
1 If your data is stored in a cloud, local analysis
may make more sense… "work near the data"
2 For periodic processing (nightly, etc…)
it might make sense to just rent.
3 No upfront capital expense,
fund from success
4 Easier to expand a cluster;
no need to buy just find
5 Eliminate networking concerns
http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html
Page 13
© Hortonworks Inc. 2012
14. What is Apache Hadoop?
1 PROCESSING – Map/Reduce
• Splits a task across processors “near”
the data & assembles results
• 2004 white paper
MapReduce: Simplified Data Processing on Large Clusters
• Base of much new tech
2 STORAGE – Hadoop Distributed File System
• Distributed across “nodes”
• Natively redundant
• Name node tracks locations
© Hortonworks Inc. 2012
15. Apache Hadoop related projects
3 Hive
4 HBase
Apache Hive is a data
5 HCatalog warehouse infrastructure built
on top of Hadoop (originally by
6 Pig Facebook) for providing data
summarization, ad-hoc query,
7 Oozie and analysis of large datasets.
It provides a mechanism to
project structure onto this data
8 Ambari and query the data using a
SQL-like language called
9 Sqoop HiveQL (HQL).
10 Zookeeper
© Hortonworks Inc. 2012
16. Apache Hadoop related projects
3 Hive
4 HBase
5 HCatalog HBase is a non-relational
database. It is columnar and
provides fault-tolerant storage
6 Pig and quick access to large
quantities of sparse data. It
7 Oozie also adds transactional
capabilities to Hadoop,
8 Ambari allowing users to conduct
updates, inserts and deletes.
9 Sqoop
10 Zookeeper
© Hortonworks Inc. 2012
17. Apache Hadoop related projects
3 Hive HCatalog
4 HBase HCatalog is a metadata
management service for
5 HCatalog Apache Hadoop. It opens up
the platform and allows
6 Pig interoperability across data
processing tools such as Pig,
Map Reduce and Hive. It also
7 Oozie provides a table abstraction so
that users need not be
8 Ambari concerned with where or how
their data is stored.
9 Sqoop
Aster SQL-H interfaces
with HCatalog
10 Zookeeper
© Hortonworks Inc. 2012
18. Apache Hadoop related projects
3 Hive
4 HBase
Apache Pig allows you to write
complex map reduce
5 HCatalog transformations using a simple
scripting language. Pig latin
6 Pig (the language) defines a set of
transformations on a data set
7 Oozie such as aggregate, join and
sort among others. Pig Latin is
sometimes extended using
8 Ambari UDF (User Defined
Functions), which the user can
9 Sqoop write in Java and then call
directly from the language.
10 Zookeeper
© Hortonworks Inc. 2012
19. Apache Hadoop related projects
3 Hive
4 HBase
5 HCatalog Oozie coordinates jobs written
in multiple languages such as
6 Pig Map Reduce, Pig and Hive. It
is a workflow system that links
7 Oozie these jobs and allows
specification of order and
dependencies between them.
8 Ambari
9 Sqoop
10 Zookeeper
© Hortonworks Inc. 2012
20. Apache Hadoop related projects
3 Hive
4 HBase
5 HCatalog Apache Ambari
operationalizes Hadoop. It
provides a mechanism to
6 Pig monitor and manage a cluster.
It also provisions nodes.
7 Oozie
Ambari is a monitoring,
8 Ambari administration and lifecycle
management project for
Apache Hadoop clusters
9 Sqoop
10 Zookeeper
© Hortonworks Inc. 2012
21. Apache Hadoop related projects
3 Hive
4 HBase
5 HCatalog
Sqoop is a set of tools that
allow non-Hadoop data stores
6 Pig to interact with traditional
relational databases and data
7 Oozie warehouses.
8 Ambari
9 Sqoop
10 Zookeeper
© Hortonworks Inc. 2012
22. Apache Hadoop related projects
3 Hive
4 HBase
5 HCatalog ZooKeeper is a centralized
service for maintaining
6 Pig configuration information,
naming, providing distributed
7 Oozie synchronization, and providing
group services.
8 Ambari
9 Sqoop
10 Zookeeper
© Hortonworks Inc. 2012
23. Hadoop in Action
Interfaces with HCatalog to
1 Web Log files via WebHDFS APIs 4
analyze website visits by the
type of end results
Website Web
Interactions Logs
Big Data
Order Refinery
DB
Data
Customer
DB
Data
Customer & Order data via Talend Pre-processes, refines, and
2 3
& HCatalog for schema joins data via Talend, Pig, &
HCatalog
© Hortonworks Inc. 2012
24. Hortonworks Vision & Role
We believe that by the end of 2015,
more than half the world's data will be
processed by Apache Hadoop.
1 Be diligent stewards of the open source core
2 Be tireless innovators beyond the core
3 Provide robust data platform services & open APIs
4 Enable the ecosystem at each layer of the stack
5 Make the platform enterprise-ready & easy to use
© Hortonworks Inc. 2012
25. Balancing Innovation & Stability
customers
relative %
The CHASM
Innovators, Early Early
Late majority, Laggards,
technology adopters, majority,
conservatives Skeptics
enthusiasts visionaries pragmatists
time
Customers want Customers want
technology & performance solutions & convenience
Source: Geoffrey Moore - Crossing the Chasm
Page 25
© Hortonworks Inc. 2012
26. Enabling Hadoop as Enterprise Big Data Platform
Applications, Installation & Configuration,
Business Tools, Administration,
Development Tools, Monitoring,
Open APIs and access High Availability,
Data Movement & Integration, Replication,
Data Management Systems, Multi-tenancy, ..
Systems Management
Hortonworks
Data Platform
DEVELOPER
Data Platform Services & Open APIs
Metadata, Indexing, Search, Security,
Management, Data Extract & Load, APIs
© Hortonworks Inc. 2012
27. Hortonworks Data Platform
The ONLY 100% open source data
platform for Hadoop
• Tightly aligned with core Apache code line
• All code committed back to open source
• Most complete Apache Hadoop platform
• Comprehensive management and monitoring
• Intuitive graphical data integration tools
• Centralized metadata services for easy data sharing
Page 27
© Hortonworks Inc. 2012
28. Hortonworks Data Platform
• Simplify deployment to get
started quickly and easily
• Monitor, manage any size cluster
with familiar console and tools
• Only platform to include data
integration services to interact
1 with any data source
• Metadata services opens the
platform for integration with
Hortonworks Data Platform existing applications
Delivers enterprise grade functionality on a proven
Apache Hadoop distribution to ease management, • Dependable high availability
simplify use and ease integration into the enterprise architecture
The only 100% open source data platform for Apache Hadoop
© Hortonworks Inc. 2012
29. Apache Distribution Stack
Built on Hadoop 1.0
(a.k.a. 0.20.205)
• Proven at large scale enterprise
implementations 0.92.1+ 5.1.1
• Most stable and reliable version 1.0.3
0.9.2 3.3.4
of Hadoop to date
• First Apache line supporting 0.4.0
security, HBase, WebHDFS
• Driven by core committers and 0.9.0+ 3.1.3
architects at Hortonworks
0.9.0+
beta
Zookeeper
Includes necessary components
HCatalog
Ambari
HBase
Talend
Sqoop
already integrated and tested
Oozie
Core
Hive
Pig
together
1.0.3 0.4.0 0.9.2 0.9.0+ 0.92.1+ 0.9.0+ 3.1.3 3.3.4 beta 5.1.1
Most stable versions of all
Hortonworks Distribution
components are chosen
Tested, Hardened & Proven
Distribution Reduces Risk
Page 29
© Hortonworks Inc. 2012
30. Management & Monitoring Svcs
Hortonworks Management Center
– View the health of cluster operations,
server utilization and performance levels
– Customizable dashboards
– APIs for integration into 3rd party
monitoring tools
– 100% open source management &
monitoring, powered by Apache Ambari,
Puppet, Nagios and Gaglia
– Simple wizard-based installation,
configuration & provisioning of any size
Hadoop cluster
Optimize performance for your Hadoop cluster
Simplify Installation and provisioning
Page 30
© Hortonworks Inc. 2012
31. Data Integration Services
• Intuitive graphical data
integration tools for HDFS,
Hive, HBase, HCatalog and Pig
• Oozie scheduling allows you to
manage and stage jobs
• Connectors for any database,
business application or system
• Integrated HCatalog storage
Bridge the gap between
legacy data & Hadoop
Simplify and speed development
Page 31
© Hortonworks Inc. 2012
32. Which is best for the cloud?
vs.
Page 32
© Hortonworks Inc. 2012
33. Metadata Services
Apache HCatalog provides flexible metadata
services across tools and external access
• Consistency of metadata and data models across tools
(MapReduce, Pig, HBase and Hive)
• Accessibility: share data as tables in and out of HDFS
• Availability: enables flexible, thin-client access via REST API
HCatalog Shared table
and schema
management
• Raw Hadoop data Table access opens the
• Inconsistent, unknown Aligned metadata platform
• Tool specific access REST API
© Hortonworks Inc. 2012
34. Services Integration
Provides RESTful API as
“front door” for Hadoop Existing & New Applications
• Opens the door to WebHDFS HCatalog RESTful Web Services
languages other than Java
• Thin clients via web MapReduce Pig Hive
services vs. fat-clients in HCatalog
gateway
• Insulation from interface External
HDFS HBase
changes release to release Store
Opens Hadoop to integration with existing and new applications
© Hortonworks Inc. 2012
35. Use cases: optimize outcomes at scale
Media optimize Content
Intelligence optimize Detection
Investment optimize Algorithms
Advertising optimize Performance
Fraud optimize Prevention
Regulation optimize Compliance
Retail / Wholesale optimize Inventory turns
Manufacturing optimize Supply chains
Healthcare optimize Patient outcomes
Education optimize Learning outcomes
Government optimize Citizen services
Source: Geoffrey Moore. Hadoop Summit 2012 keynote presentation.
© Hortonworks Inc. 2012
36. Connecting Transactions + Interactions + Observations
Audio, Retain runtime models and
Video,
Images
historical data for ongoing 5 Business Web, Mobile, CRM,
refinement & analysis ERP, SCM, …
Transactions
Docs, & Interactions
Text,
XML
Web
Logs,
Clicks
Big Data 4 Data
Social, Refinery Discovery & Classic
Graph, 1 ETL
Feeds Investigative processing
Analytics
Sensors, 3 Share refined
Devices,
RFID
data & runtime 2
Store, aggregate, and models Interactive
transform multi-structured data
Spatial, data to unlock value Business exploration
GPS
Intelligence
& Analytics
Retain historical data to
Events,
Other
unlock additional value 6
Dashboards, Reports,
Visualization, …
© Hortonworks Inc. 2012
37. 5 Reasons for Hadoop in the Cloud
1 If your data is stored in a cloud, local analysis
may make more sense… "work near the data"
2 For periodic processing (nightly, etc…)
it might make sense to just rent.
3 No upfront capital expense,
fund from success
4 Easier to expand a cluster;
no need to buy just find
5 Eliminate networking concerns
http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.html
Page 37
© Hortonworks Inc. 2012
38. THANK YOU
Jim Walker
jim@hortonworks.com
@jaymce
1 Get Hortonworks Data Platform
hortonworks.com/download
2 Use the getting started guide
hortonworks.com/get-started
3 Learn more… get support
hortonworks.com/training hortonworks.com/support
Page 38
© Hortonworks Inc. 2012