Join Cloudera’s founder and Chief Scientist, Jeff Hammerbacher, as he describes ten common problems that are being solved with Apache Hadoop.
A replay of the webinar can be viewed here:
https://www1.gotomeeting.com/register/719074008
2. Topics
• Introduction
• 10 Common Hadoop-able Problems
• Summary
• Questions
Copyright 2010 Cloudera Inc. All rights reserved 2
3. Today’s speaker - Jeff Hammerbacher
• hammer@cloudera.com
• Studied Mathematics at Harvard
• Worked as a Quant on Wall Street
• Conceived, built, and led Data team at Facebook
• Nearly 30 amazing engineers and data scientists
• Several open source projects and research papers
• Founder of Cloudera
• Chief Scientist
• Also, check out the book “Beautiful Data”
Copyright 2010 Cloudera Inc. All rights reserved 3
4. What is Hadoop?
• A scalable fault-tolerant distributed system for data storage
and processing (open source under the Apache license)
• Scalable data processing engine
• Hadoop Distributed File System (HDFS): self-healing high-bandwidth
clustered storage
• MapReduce: fault-tolerant distributed processing
• Key value
• Flexible -> store data without a schema and add it later as needed
• Affordable -> cost / TB at a fraction of traditional options
• Broadly adopted -> a large and active ecosystem
• Proven at scale -> dozens of petabyte + implementations in
production today
Copyright 2010 Cloudera Inc. All Rights Reserved. 4
5. Cloudera’s Distribution for Hadoop, version 3
The industry’s leading Hadoop distribution
Hue Hue SDK
Oozie Oozie Hive
Pig/
Hive
Flume, Sqoop HBase
Zookeeper
• Open source – 100% Apache licensed
• Simplified – Component versions & dependencies managed for you
• Integrated – All components & functions interoperate through standard API’s
• Reliable – Patched with fixes from future releases to improve stability
• Supported – Employs project founders and committers for >70% of components
Copyright 2010 Cloudera Inc. All Rights Reserved. 5
6. How does Cloudera know which problems are
Hadoop-able?
• Talking to 1000s of users
• Supporting 100s of implementations
• Experience putting Hadoop into production with
customers across a range of industries
Copyright 2010 Cloudera Inc. All rights reserved 6
7. Summary – 10 Common Hadoop-able Problems
1. Modeling true risk 6. Analyzing network data
to predict failure
2. Customer churn
analysis 7. Threat analysis
3. Recommendation 8. Trade surveillance
engine
9. Search quality
4. Ad targeting
10. Data “sandbox”
5. PoS transaction analysis
Copyright 2010 Cloudera Inc. All rights reserved 7
8. What is common across Hadoop-able problems?
Nature of the data
• Complex data
• Multiple data sources
• Lots of it
Nature of the analysis
• Batch processing
• Parallel execution
• Spread data over a cluster of servers
and take the computation to the data
Copyright 2010 Cloudera Inc. All rights reserved 8
9. What Analysis is Possible With Hadoop?
• Text mining • Collaborative filtering
• Index building • Prediction models
• Graph creation and • Sentiment analysis
analysis
• Risk assessment
• Pattern recognition
Copyright 2010 Cloudera Inc. All rights reserved 9
10. Benefits of Analyzing With Hadoop
• Previously impossible/impractical to do this analysis
• Analysis conducted at lower cost
• Analysis conducted in less time
• Greater flexibility
Copyright 2010 Cloudera Inc. All rights reserved 10
11. Topics
• Introduction
• 10 Common Hadoop-able Problems
• Summary
• Questions
Copyright 2010 Cloudera Inc. All rights reserved 11
12. 1. Modeling True Risk
Copyright 2010 Cloudera Inc. All rights reserved 12
13. 1. Modeling True Risk
Solution with Hadoop
• Source, parse and aggregate disparate data
sources to build comprehensive data picture
• e.g. credit card records, call recordings, chat
sessions, emails, banking activity
• Structure and analyze
• Sentiment analysis, graph creation, pattern
recognition
Typical Industry
• Financial Services (Banks, Insurance)
Copyright 2010 Cloudera Inc. All rights reserved 13
14. 2. Customer Churn Analysis
Copyright 2010 Cloudera Inc. All rights reserved 14
15. 2. Customer Churn Analysis
Solution with Hadoop
• Rapidly test and build behavioral model of customer
from disparate data sources
• Structure and analyze with Hadoop
• Traversing
• Graph creation
• Pattern recognition
Typical Industry
• Telecommunications, Financial Services
Copyright 2010 Cloudera Inc. All rights reserved 15
17. 3. Recommendation Engine
Solution with Hadoop
• Batch processing framework
• Allow execution in in parallel over large datasets
• Collaborative filtering
• Collecting ‘taste’ information from many users
• Utilizing information to predict what similar
users like
Typical Industry
• Ecommerce, Manufacturing, Retail
Copyright 2010 Cloudera Inc. All rights reserved 17
18. 4. Ad Targeting
Copyright 2010 Cloudera Inc. All rights reserved 18
19. 4. Ad Targeting
Solution with Hadoop
• Data analysis can be conducted in parallel, reducing
processing times from days to hours
• With Hadoop, as data volumes grow the only
expansion cost is hardware
• Add more nodes without a degradation in
performance
Typical Industry
• Advertising
Copyright 2010 Cloudera Inc. All rights reserved 19
20. 5. Point of Sale Transaction Analysis
Copyright 2010 Cloudera Inc. All rights reserved 20
21. 5. Point of Sale Transaction Analysis
Solution with Hadoop
• Batch processing framework
• Allow execution in in parallel over large datasets
• Pattern recognition
• Optimizing over multiple data sources
• Utilizing information to predict demand
Typical Industry
• Retail
Copyright 2010 Cloudera Inc. All rights reserved 21
22. 6. Analyzing Network Data to Predict Failure
Copyright 2010 Cloudera Inc. All rights reserved 22
23. 6. Analyzing Network Data to Predict Failure
Solution with Hadoop
• Take the computation to the data
• Expand the range of indexing techniques from simple
scans to more complex data mining
• Better understand how the network reacts to
fluctuations
• How previously thought discrete anomalies may, in
fact, be interconnected
• Identify leading indicators of component failure
Typical Industry
• Utilities, Telecommunications,
Data Centers
Copyright 2010 Cloudera Inc. All rights reserved 23
27. 8. Trade Surveillance
Solution with Hadoop
• Batch processing framework
• Allow execution in in parallel over large datasets
• Pattern recognition
• Detect trading anomalies and harmful behavior
Typical Industry
• Financial services
• Regulatory bodies
Copyright 2010 Cloudera Inc. All rights reserved 27
28. 9. Search Quality
Copyright 2010 Cloudera Inc. All rights reserved 28
29. 9. Search Quality
Solution with Hadoop
• Analyzing search attempts in conjunction with
structured data
• Pattern recognition
• Browsing pattern of users performing searches in
different categories
Typical Industry
• Web
• Ecommerce
Copyright 2010 Cloudera Inc. All rights reserved 29
31. 10. Data “Sandbox”
Solution with Hadoop
• With Hadoop an organization can “dump” all this
data into a HDFS cluster
• Then use Hadoop to start trying out different
analysis on the data
• See patterns or relationships that allow the
organization to derive additional value from data
Typical Industry
• Common across all industries
Copyright 2010 Cloudera Inc. All rights reserved 31
32. Topics
• Introduction
• 10 Common Hadoop-able Problems
• Summary
• Questions
Copyright 2010 Cloudera Inc. All rights reserved 32
33. Summary – 10 Common Hadoop-able Problems
1. Modeling true risk 6. Threat analysis
2. Customer churn 7. Analyzing network
analysis data to predict failure
3. Recommendation 8. Trade surveillance
engine
9. Search quality
4. Ad targeting
10. Data “sandbox”
5. PoS transaction
analysis
Copyright 2010 Cloudera Inc. All rights reserved 33
34. Who is Cloudera?
• Enterprise software & services company providing the industry’s
leading Hadoop-based data management platform
• Founding team came from large Web companies
• Products: Cloudera Enterprise & Cloudera’s Distribution for Hadoop
• All necessary packages, matched, tested and supported
• Tools to support production use of Hadoop
• The leading distribution for the enterprise
• Contributors and committers
• Fixing, patching and adding features
34
35. Hear More Examples @ Hadoop World 2010
http://www.cloudera.com/company/press-center/hadoop-world-nyc/
• 2nd annual event focused on practical
applications of Hadoop
• Date: October 12th 2010
• Location: Hilton New York Confirmed speakers from
• Keynote from Tim O’Reilly – founder
O’Reilly Media
• Pre and post conference training
available for Hadoop and related projects
• 36 business and technical focused sessions
Copyright 2010 Cloudera Inc. All Rights Reserved. 35
36. Questions?
Copyright 2010 Cloudera Inc. All Rights Reserved. 36