Rhat OSS - Cloudera - Mike Olson - Hadoop Data Analytics In The Cloud
1. Hadoop
Data Analytics in the Cloud
Mike Olson
Chief Executive Officer
Friday, July 17, 2009
2. Hadoop History
▪ Doug Cutting worked on Nutch (web-scale crawler-based
search), 2002-2004
▪ Google published MapReduce paper in 2004
▪ Cutting adds DFS & MapReduce support to Nutch
▪ Joined by Mike Cafarella
▪ 2006: Yahoo! hires Cutting, Hadoop spins out of Nutch
▪ Web-scale deployments in 2007, 2008 at Y!, Facebook, others
▪ Today: 22 committers to core project
▪ Related projects: HBase, Hive, Pig, Mahout, Hama and others
Friday, July 17, 2009
3. Why Hadoop?
▪ Large web properties invented MapReduce for large-scale,
reliable, inexpensive analytics
▪ Enterprises generally need these techniques
▪ Retail, financial services, oil and gas, health care, green
technologies and more
▪ Hardware trends driving toward long-term retention of valuable
source data
▪ New analytical tools are required
▪ Hadoop complements current-generation data warehousing and
analytical products
Friday, July 17, 2009
4. Where Does Data Come From?
Many Sources Provide Deeper Insight
Friday, July 17, 2009
5. Where Does Data Come From?
Many Sources Provide Deeper Insight
▪ Simulations and Scientific/Experimental Data
▪ genome sequencing, medical imaging, wireless sensors
Friday, July 17, 2009
6. Where Does Data Come From?
Many Sources Provide Deeper Insight
▪ Simulations and Scientific/Experimental Data
▪ genome sequencing, medical imaging, wireless sensors
▪ Existing Databases
▪ product catalogs, historical sales data, transaction histories
Friday, July 17, 2009
7. Where Does Data Come From?
Many Sources Provide Deeper Insight
▪ Simulations and Scientific/Experimental Data
▪ genome sequencing, medical imaging, wireless sensors
▪ Existing Databases
▪ product catalogs, historical sales data, transaction histories
▪ User Data
▪ web logs, clicks on website, pictures, videos, bbs, etc
Friday, July 17, 2009
8. Where Does Data Come From?
Many Sources Provide Deeper Insight
▪ Simulations and Scientific/Experimental Data
▪ genome sequencing, medical imaging, wireless sensors
▪ Existing Databases
▪ product catalogs, historical sales data, transaction histories
▪ User Data
▪ web logs, clicks on website, pictures, videos, bbs, etc
▪ System Generated Data
▪ 1000’s of systems reporting status every second
Friday, July 17, 2009
9. Where Does Data Come From?
Many Sources Provide Deeper Insight
▪ Simulations and Scientific/Experimental Data
▪ genome sequencing, medical imaging, wireless sensors
▪ Existing Databases
▪ product catalogs, historical sales data, transaction histories
▪ User Data
▪ web logs, clicks on website, pictures, videos, bbs, etc
▪ System Generated Data
▪ 1000’s of systems reporting status every second
▪ Data Comes in All Shapes, Sizes, Schemas and Structures
▪ Hadoop combines many sources regardless of format and structure
Friday, July 17, 2009
10. Hadoop Technical Overview: HDFS
Storing Data: Distributed Over Many Machines
HDFS: Hadoop Distributed File System
Friday, July 17, 2009
11. Hadoop Technical Overview: HDFS
Storing Data: Distributed Over Many Machines
HDFS: Hadoop Distributed File System
Friday, July 17, 2009
12. Hadoop Technical Overview: HDFS
Storing Data: Distributed Over Many Machines
Commodity Servers
HDFS: Hadoop Distributed File System
Friday, July 17, 2009
13. Hadoop Technical Overview: HDFS
Storing Data: Distributed Over Many Machines
Commodity Servers
Files are broken into blocks and distributed across all
servers. Replication protects data from hardware failure.
HDFS: Hadoop Distributed File System
Friday, July 17, 2009
14. Hadoop Technical Overview: MapReduce
Processing Data: Leveraging Data Locality
MapReduce
Friday, July 17, 2009
15. Hadoop Technical Overview: MapReduce
Processing Data: Leveraging Data Locality
MapReduce
Friday, July 17, 2009
16. Hadoop Technical Overview: MapReduce
Processing Data: Leveraging Data Locality
MapReduce
Friday, July 17, 2009
17. Hadoop Technical Overview: MapReduce
Processing Data: Leveraging Data Locality
Data elements processed locally, in parallel
Reliable computation implicitly managed by Hadoop
MapReduce
Friday, July 17, 2009
18. Hadoop Technical Overview: Reliability
Fault Tolerance: Handled with Software
Software Fault Tolerance
Friday, July 17, 2009
19. Hadoop Technical Overview: Reliability
Fault Tolerance: Handled with Software
Software Fault Tolerance
Friday, July 17, 2009
20. Hadoop Technical Overview: Reliability
Fault Tolerance: Handled with Software
Data loss prevented through automatic replication and rebalancing
Computation is restarted automatically without user intervention
Software Fault Tolerance
Friday, July 17, 2009
21. Cloud Deployment Options for Hadoop
▪ In your data center
• Acquire, provision, administer servers
• Choose a virtualization infrastructure?
▪ On dedicated, hosted services
• Scale up or down by coordinating with your MSP
• On dynamic web services (AWS and others)
• Spin up, use, shut down a cluster
• Issues:
• Data persistence and location, organizational control
Friday, July 17, 2009
22. (c) 1009 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved.
Friday, July 17, 2009