Contenu connexe
Similaire à Hadoop Trends (20)
Hadoop Trends
- 1. Trends and usage of
Apache Hadoop
Eric Baldeschwieler
CEO Hortonworks
Twitter: @jeric14, @hortonworks
January 2012
© Hortonworks Inc. 2011 Page 1
- 2. Agenda
• Define terms
– What is Hadoop? Why does Hadoop matter?
• What drives Hadoop adoption?
• Observed Trends
Architecting the Future of Big Data
Page 2
© Hortonworks Inc. 2011
- 3. Hortonworks Vision
We believe that by 2015, more than
half the world's data will be
processed by Apache Hadoop
How to achieve that vision???
Enable ecosystem around
enterprise-viable platform.
Page 3
© Hortonworks Inc. 2011
- 4. What is Apache Hadoop?
• Solution for big data
– Deals with complexities of high
volume, velocity & variety of data
• Set of open source projects
• Transforms commodity hardware
into a service that:
– Stores petabytes of data reliably
– Allows huge distributed computations
• Key attributes:
– Redundant and reliable (no data loss)
One of the best examples of
– Extremely powerful open source driving innovation
– Batch processing centric and creating a market
– Easy to program distributed apps
– Runs on commodity hardware
Page 4
© Hortonworks Inc. 2011
- 5. Hortonworks Data Platform (HDP)
Key Components of “Standard Hadoop” Open Source Stack
Core Apache Hadoop Related Hadoop Projects Open APIs for:
• Data Integration
• Data Movement
• App Job Management
• System Management
Pig Hive
(Data Flow) (SQL)
(Columnar NoSQL Store)
HBase
MapReduce
Zookeeper
(Coordination)
(Distributed Programing Framework)
HCatalog
(Table & Schema Management)
HDFS
(Hadoop Distributed File System)
Page 5
© Hortonworks Inc. 2011
- 6. Big Data Trailblazers and Use Cases
data
analyzing web logs analytics
advertising optimization machine learning
mail anti-spam
text mining web search
content optimization
customer trend analysis
ad selection
video & audio processing
data mining
user interest prediction
social media
Page 6
© Hortonworks Inc. 2011
- 7. Yahoo!, Apache Hadoop & Hortonworks
http://www.wired.com/wiredenterprise/2011/10/how-yahoo-spawned-hadoop
Yahoo! embraced Apache Hadoop, an open source platform, to
crunch epic amounts of data using an army of dirt-cheap servers
2006
Hadoop at Yahoo!
40K+ Servers
170PB Storage
5M+ Monthly Jobs
1000+ Active Users
2011
Yahoo! spun off 22+ engineers into Hortonworks, a company focused on
advancing open source Apache Hadoop for the broader market
Page 7
© Hortonworks Inc. 2011
- 8. What drives Hadoop adoption?
Architecting the Future of Big Data
Page 8
© Hortonworks Inc. 2011
- 9. Market Drivers for Apache Hadoop
• Business drivers
– High-value projects that require use of more data Gartner predicts
800% data growth
– Belief that there is great ROI in mastering big data over next 5 years
• Financial drivers
– Growing cost of data systems as percentage of IT spend
– Cost advantage of commodity hardware + open source
– Enables departmental-level big data strategies 80-90% of data
produced today
is unstructured
• Technical drivers
– Existing solutions failing under growing requirements
– 3Vs - Volume, velocity, variety
– Proliferation of unstructured data
© Hortonworks Inc. 2011 9
© Hortonworks Inc. 2011
- 10. Every Market has Big Data
Digital data is personal, everywhere, increasingly
accessible, and will continue to grow exponentially
Source: McKinsey & Company report. Big data: The next frontier for innovation, competition, and productivity. May 2011.
Page 10
© Hortonworks Inc. 2011
- 11. Broader Use Case Opportunities
Financial Services Healthcare
• Detect/prevent fraud • Patient monitoring
• Model and manage risk • Predictive modeling
• Personalize banking/insurance products • Compliance, Archival, text search
• Compliance, Archival, … • Data driven research
Retail Web / Social / Mobile
• Behavior analysis • Sentiment analysis
• Cross selling, recommendation engines • Web log, image, and video analysis
• Optimize pricing, placement, design • Personalization
• Optimize inventory and distribution • Billing, Reporting, Network Analysis
Manufacturing Government
• Simulation, Analysis, Design • Detect/prevent fraud
• Improve service via product sensor data • Security & Intelligence
• “Digital factory” for lean manufacturing • Support open data initiatives
Page 11
© Hortonworks Inc. 2011
- 12. Observed Trends
Architecting the Future of Big Data
Page 12
© Hortonworks Inc. 2011
- 13. Trend: Agile Data
• The old way
– Operational systems keep only current records, short history
– Analytics systems keep only conformed / cleaned / digested data
– Unstructured data locked away in operational silos
– Archives offline
– Inflexible, new questions require system redesigns
• The new trend
– Keep raw data in Hadoop for a long time
– Able to produce a new analytics view on-demand
– Keep a new copy of data that was previously on in silos
– Can directly do new reports, experiments at low incremental cost
– New products / services can be added very quickly
– Agile outcome justifies new infrastructure
Architecting the Future of Big Data
Page 13
© Hortonworks Inc. 2011
- 14. Traditional Enterprise Data Architecture
Data Silos
Traditional Data Warehouses,
Serving Applications BI & Analytics
Web NoSQL Traditional ETL &
Data BI /
Serving RDMS
… Message buses EDW
Marts Analytics
Serving Social Sensor Text
Logs Media Data Systems …
Unstructured Systems
Page 14
© Hortonworks Inc. 2011
- 15. Agile Data Architecture w/Hadoop
Connecting All of Your Big Data
Traditional Data Warehouses,
Serving Applications BI & Analytics
Web NoSQL Traditional ETL &
Data BI /
Serving RDMS
… Message buses EDW
Marts Analytics
EsTsL (s = Store)
Custom Analytics
Serving Social Sensor Text
Logs Media Data Systems …
Unstructured Systems
Page 15
© Hortonworks Inc. 2011
- 16. Trend: Data driven development
• Limited runtime logic driven by huge lookup tables
• Data computed offline on Hadoop
– Machine learning, other expensive computation offline
– Personalization, classification, fraud, value analysis…
• Application development requires data science
– Huge amounts of actually observed data key to modern services
– Hadoop used as the science platform
Architecting the Future of Big Data
Page 16
© Hortonworks Inc. 2011
- 17. CASE STUDY
YAHOO! HOMEPAGE
• Serving Maps
SCIENCE »
Machine learning to build ever
• Users
-‐
Interests
HADOOP better categorization models
CLUSTER
• Five
Minute
USER
CATEGORIZATION
Produc7on
BEHAVIOR
MODELS
(weekly)
• Weekly
PRODUCTION
Categoriza7on
HADOOP
»
Identify user interests using
models
SERVING
CLUSTER
Categorization models
MAPS
(every 5 minutes)
USER
BEHAVIOR
SERVING
SYSTEMS ENGAGED
USERS
Build
customized
home
pages
with
latest
data
(thousands
/
second)
Copyright
Yahoo
2011
17
- 18. CASE STUDY
YAHOO! HOMEPAGE
Personalized
for each visitor
Result:
twice the engagement
Recommended
links
News
Interests
Top
Searches
+79% clicks +160% clicks +43% clicks
vs. randomly selected vs. one size fits all vs. editor selected
Copyright
Yahoo
2011
Hortonworks Inc. 2011
©
18
- 19. Trend: Specialization of Data Systems
• Hadoop does not replace existing systems
– It adds new capabilities to the enterprise
– It can offload things that are not done efficiently in current systems
– Especially in scale out situations
• Specialization of traditional data components
– Use OLTP systems just for transactions
– Use OLAP systems for interactive analysis
• Hadoop has LOTS of bandwidth to storage and CPU
– Pull reporting out OLTP systems
– Pull ELT out of OLAP systems
Architecting the Future of Big Data
Page 19
© Hortonworks Inc. 2011
- 20. Hadoop and OLTP Systems
MPP Processing of Online Transactions Hadoop used to Process Reports
• Mission critical • Free up 50+% processing power for
• Manages transactions & serves reports transaction processing system
• Significant cost savings due to commodity
nature of Hadoop
Web
Site
Transaction Reports
Processing
Web Systems
Site
$$$ Transaction
Logs
Web
Site
Page 20
© Hortonworks Inc. 2011
- 21. Hadoop and OLAP Systems
Fast loading, raw data staging, ELT &
long-term archival Allow analysts to use tools they know
(The Agile Data Zone) (Take advantage of huge ecosystem of
BI and Analytics tooling)
Web
Hadoop EDW
Mobile
Social
Online
Archival
Other
logs
Page 21
© Hortonworks Inc. 2011
- 22. TRENDS: Instrument Clouds of Things
Clouds of things logging to Hadoop HDFS + Map-Reduce
Websites Or HBase
Mobile phones, Enterprise devices… +
Analysis
Things
Things
Things
Things
Things
Things
Page 22
© Hortonworks Inc. 2011
- 23. Trend: Many POCs, Few Production Systems
• The problem
– Hadoop is still a young technology
– Hard to find knowledgeable staff
– Integration with existing systems
• Hadoop market is maturing at speed
– Emerging ecosystem of Hadoop platform solutions providers
– Apache Hadoop continues to get better
– Hadoop training and support available form several vendors
Architecting the Future of Big Data
Page 23
© Hortonworks Inc. 2011
- 24. Growth in Hadoop Ecosystem
• Hardware vendors, Public Cloud (IAAS, PAAS)
– Storage, Appliances, Preloaded commodity boxes, cloud
• Data Systems
– All the major vendors announced Hadoop plans / products in 2011
• BI, Analytics and ETL
– Hadoop integrations emerging
• Dedicated Hadoop Applications
– Datamere, Karmashere, Platfora, …
• Systems Integrators
– Regional and Global providers available
Architecting the Future of Big Data
Page 24
© Hortonworks Inc. 2011
- 25. Hadoop Continues to Improve
Apache community, including Hortonworks investing to improve Hadoop:
• Make Hadoop an Open, Extensible, and Enterprise Viable Platform
• Enable More Applications to Run on Apache Hadoop
“Hadoop.Beyond”
Platform actively evolving
“Hadoop.Next”
(Hadoop 0.23)
HA, Next-gen HDFS & MapReduce
“Hadoop.Now” Extension & Integration APIs
(Hadoop 1.0)
Most stable version ever
HBase, security, WebHDFS
Page 25
© Hortonworks Inc. 2011
- 26. Hortonworks – Approachable Hadoop
• Apache Hadoop Leadership
– Delivered every major release since 0.1
– Driving innovation across entire stack
– Experience managing world’s largest
deployment
– Access to Yahoo’s 1,000+ Hadoop users
and 40k+ nodes for testing, QA, etc.
• Business Focus
– Provide 100% open source product
– Hortonworks Data Platform Expert Role-based Training
– Help customers and partners overcome
Hadoop knowledge gaps
Full Lifecycle Support and Services
– Help organizations successfully develop
and deploy solutions based on Hadoop
Evaluate Pilot Production
Architecting the Future of Big Data
Page 26
© Hortonworks Inc. 2011
- 27. Trend: Finding More Value Over Time
• Hadoop is usually brought in to solve a specific
problem
– Build seach indexes for Yahoo
– Manage web site logs for Facebook
– Users using EC2 to do data processing at Amazon
– Simple reporting when existing tools don’t scale
• Once your data is in Hadoop more users find value
• Once you have Hadoop, folks add more data
Architecting the Future of Big Data
Page 27
© Hortonworks Inc. 2011