SlideShare une entreprise Scribd logo
1  sur  32
Large Scale Data with Hadoop Galen Riley and Josh Patterson Presented at DevChatt 2010
Agenda Thinking at Scale Hadoop Architecture Distributed File System MapReduce Programming Model Examples
Data is Big The Data Deluge (2/25/2010) “Eighteen months ago, Li & Fung, a firm that manages supply chains for retailers, saw 100 gigabytes of information flow through its network each day. Now the amount has increased tenfold.” http://www.economist.com/opinion/displaystory.cfm?story_id=15579717
Data is Big Sensor data collection 128 sensors 37 GB/day 10 bytes/sample, 30 per second Increasing 10x by 2012 http://jpatterson.floe.tv/index.php/2009/10/29/the-smartgrid-goes-open-source
Disks are Slow Disk Seek, Data Transfer Reading Files Disk seek for every access Buffered reads, locality  still seeking every disk page
Disks are Slow 10ms seek, 10MB/s transfer 1TB file, 100b records, 10kb page10B entries, 1B pages1GB of updates Seek for each update, 1000 days Seek for each page, 100 days Transfer entire TB, 1 day
Disks are Slow IDE drive – 75 MB/sec, 10ms seek SATA drive – 300MB/s, 8.5ms seek SSD – 800MB/s, 2 ms “seek” (1TB = $4k!) 
// Sidetrack Observation: transfer speed improves at a greater rate than seek speed Improvement by treating disks like tapes Seek as little as possible in favor of sequential reads  Operate at transfer speed http://weblogs.java.net/blog/2008/03/18/disks-have-become-tapes
An Idea: Parallelism 1 drive – 75 MB/sec 16 days for 100TB 1000 drives – 75 GB/sec 22 minutes for 100TB
A Problem: Parallelism is Hard Issues Synchronization Deadlock Limited bandwidth Timing issues Apples v. Oranges, but… MPI Data distribution, communication between nodes done manually by the programmer Considerable effort achieving parallelism compared to actual processing
A Problem: Reliability Computers are complicated Hard drive Power supply Overheating
A Problem: Reliability 1 Machine 3 years mean time between failures 1000 Machines 1 day mean time between failures
Requirements Backup Reliable Partial failure, graceful decline rather than full halt Data recoverability, if a node fails, another picks up its workload Node recoverability, a fixed node can rejoin the group without a full group restart Scalability, adding resources adds load capacity Easy to use
Hadoop: Robust, Cheap, Reliable Apache project, open source Designed for commodity hardware Can lose whole nodes and not lose data Includes MapReduce programming model
Why Commodity Hardware? Single large computer systems are expensive and proprietary High initial costs, plus lock-in with vendor Existing methods do not work at petabyte-scale Solution: Scale “out” instead of “up”
Hadoop Distributed File System Throughput Good, Latency Bad Data Coherency Write-once, read-many access model  Files are broken up into blocks Typically 64MB or 128MB block size Each replicated on multiple DataNodes on write Intelligent Client Client can find location of blocks Client accesses data directly from DataNode
Source: http://wiki.apache.org/hadoop/HadoopPresentations?action=AttachFile&do=get&target=hdfs_dhruba.pdf
HDFS: Performance Robust in the face of multiple machine failures through aggressive replication of data blocks High Performance Checksum of 100 TB in 10 minutes,~166 GB/sec Built to house petabytes of data
MapReduce Simple programming model that abstracts parallel programming complications away from data processing logic Made popular at Google, drives their processing systems, used on 1000s of computers in various clusters Hadoop provides an open source version of MR
MapReduce Data Flow
Using MapReduce MapReduce is a programming model for efficient distributed computing It works like a Unix pipeline: cat input | grep | sort | uniq -c | cat > output Input | Map | Shuffle & Sort | Reduce | Output Efficiency from Streaming through data, reducing seeks Pipelining A good fit for a lot of applications Log processing Web index building
Hadoop In The Field Yahoo Facebook Twitter Commercial support available from Cloudera
Hadoop In Your Backyard openPDC project at TVAhttp://openpdc.codeplex.com Cluster is currently: 20 nodes 200TB of physical drive space Used for  Cheap, redundant storage Time series data mining
Examples – Word Count Hello, World! Map Input: foofoo bar Output all words in a dataset as:{ key, value } 	{“foo”, 1}, {“foo”, 1}, {“bar”, 1} Reduce Input:{“foo”, (1, 1)}, {“bar”, (1)} Output:{“foo”, 2}, {“bar”, 1}
Word Count: Mapper public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 	private final static IntWritable one = new 				IntWritable(1); 	private Text word = new Text(); 	public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, 		Reporter reporter) throws IOException	{ 		String line = value.toString(); StringTokenizeritr = new  StringTokenizer(line); 		while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); 		} 	} }
Word Count: Reducer public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 	public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, 	Reporter reporter) throws IOException { int sum = 0; 		while (values.hasNext()) { 			sum += values.next().get(); 		} output.collect(key, new IntWritable(sum)); 	} }
Examples – Stock Analysis Input dataset:  Symbol,Date,Open,High,Low,Close GOOG,2010-03-19,555.23,568.00,557.28,560.00 YHOO,2010-03-19,16.62,16.81,16.34,16.44 GOOG,2010-03-18,564.72,568.44,562.96,566.40 YHOO,2010-03-18,16.46,16.57,16.32,16.56 Interested in biggest delta for each stock
Examples – Stock Analysis Map Output 	{“GOOG”, 10.72}, 	{“YHOO”, 0.47}, 	{“GOOG”, 5.48}, 	{“YHOO”, 0.25} Reduce Input: {“GOOG”, (10.72, 5.48)},{“YHOO”, (0.47, 0.25)} Output:{“GOOG”, 10.72},{“YHOO”, 0.47}
Examples – Time Series Analysis Map: {pointId, Timestamp + 30s of data} Reduce: Data mining! Classify samples based on training dataset Output samples that fall into interesting categories, index in database
Other Stuff Compatibility with Amazon Elastic Cloud HadoopStreaming MapReducewith anything that uses stdin/stdout Hbase, distributed column-store database Pig, data analysis (transforms, filters, etc) Hive, data warehousing infrastructure Mahout, machine learning algorithms
Parting Thoughts “We don't have better algorithms than anyone else. We just have more data.” Peter Norvig Artificial Intelligence: A Modern Approach Chief scientist at Google
Contact Galen Riley http://galenriley.com @TotallyGreat Josh Patterson http://jpatterson.floe.tv @jpatanooga

Contenu connexe

Tendances

Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course pptNjain85
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1Giovanna Roda
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 

Tendances (20)

Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Hadoop
HadoopHadoop
Hadoop
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 

En vedette

Інтегральний рейтинговий список - магістр/спеціаліст ФСП
Інтегральний рейтинговий список - магістр/спеціаліст ФСПІнтегральний рейтинговий список - магістр/спеціаліст ФСП
Інтегральний рейтинговий список - магістр/спеціаліст ФСПMihaylo Rabinchuk
 
KPI's, Social Media Metrics and Business
KPI's, Social Media Metrics and BusinessKPI's, Social Media Metrics and Business
KPI's, Social Media Metrics and BusinessDheeraj Chowdhury
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big AnalyticsAjay Ohri
 
Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Scalable Strategies for Computing with Massive Data: The Bigmemory ProjectScalable Strategies for Computing with Massive Data: The Bigmemory Project
Scalable Strategies for Computing with Massive Data: The Bigmemory Projectjoshpaulson
 

En vedette (6)

Atto costitutivo del M5S
Atto costitutivo del M5SAtto costitutivo del M5S
Atto costitutivo del M5S
 
Інтегральний рейтинговий список - магістр/спеціаліст ФСП
Інтегральний рейтинговий список - магістр/спеціаліст ФСПІнтегральний рейтинговий список - магістр/спеціаліст ФСП
Інтегральний рейтинговий список - магістр/спеціаліст ФСП
 
KPI's, Social Media Metrics and Business
KPI's, Social Media Metrics and BusinessKPI's, Social Media Metrics and Business
KPI's, Social Media Metrics and Business
 
Big data Big Analytics
Big data Big AnalyticsBig data Big Analytics
Big data Big Analytics
 
Scalable Strategies for Computing with Massive Data: The Bigmemory Project
Scalable Strategies for Computing with Massive Data: The Bigmemory ProjectScalable Strategies for Computing with Massive Data: The Bigmemory Project
Scalable Strategies for Computing with Massive Data: The Bigmemory Project
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 

Similaire à Large Scale Data With Hadoop

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxAltafKhadim
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 

Similaire à Large Scale Data With Hadoop (20)

Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
HDFS
HDFSHDFS
HDFS
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 

Dernier

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 

Dernier (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Large Scale Data With Hadoop

  • 1. Large Scale Data with Hadoop Galen Riley and Josh Patterson Presented at DevChatt 2010
  • 2. Agenda Thinking at Scale Hadoop Architecture Distributed File System MapReduce Programming Model Examples
  • 3. Data is Big The Data Deluge (2/25/2010) “Eighteen months ago, Li & Fung, a firm that manages supply chains for retailers, saw 100 gigabytes of information flow through its network each day. Now the amount has increased tenfold.” http://www.economist.com/opinion/displaystory.cfm?story_id=15579717
  • 4. Data is Big Sensor data collection 128 sensors 37 GB/day 10 bytes/sample, 30 per second Increasing 10x by 2012 http://jpatterson.floe.tv/index.php/2009/10/29/the-smartgrid-goes-open-source
  • 5. Disks are Slow Disk Seek, Data Transfer Reading Files Disk seek for every access Buffered reads, locality  still seeking every disk page
  • 6. Disks are Slow 10ms seek, 10MB/s transfer 1TB file, 100b records, 10kb page10B entries, 1B pages1GB of updates Seek for each update, 1000 days Seek for each page, 100 days Transfer entire TB, 1 day
  • 7. Disks are Slow IDE drive – 75 MB/sec, 10ms seek SATA drive – 300MB/s, 8.5ms seek SSD – 800MB/s, 2 ms “seek” (1TB = $4k!) 
  • 8. // Sidetrack Observation: transfer speed improves at a greater rate than seek speed Improvement by treating disks like tapes Seek as little as possible in favor of sequential reads Operate at transfer speed http://weblogs.java.net/blog/2008/03/18/disks-have-become-tapes
  • 9. An Idea: Parallelism 1 drive – 75 MB/sec 16 days for 100TB 1000 drives – 75 GB/sec 22 minutes for 100TB
  • 10. A Problem: Parallelism is Hard Issues Synchronization Deadlock Limited bandwidth Timing issues Apples v. Oranges, but… MPI Data distribution, communication between nodes done manually by the programmer Considerable effort achieving parallelism compared to actual processing
  • 11. A Problem: Reliability Computers are complicated Hard drive Power supply Overheating
  • 12. A Problem: Reliability 1 Machine 3 years mean time between failures 1000 Machines 1 day mean time between failures
  • 13. Requirements Backup Reliable Partial failure, graceful decline rather than full halt Data recoverability, if a node fails, another picks up its workload Node recoverability, a fixed node can rejoin the group without a full group restart Scalability, adding resources adds load capacity Easy to use
  • 14. Hadoop: Robust, Cheap, Reliable Apache project, open source Designed for commodity hardware Can lose whole nodes and not lose data Includes MapReduce programming model
  • 15. Why Commodity Hardware? Single large computer systems are expensive and proprietary High initial costs, plus lock-in with vendor Existing methods do not work at petabyte-scale Solution: Scale “out” instead of “up”
  • 16. Hadoop Distributed File System Throughput Good, Latency Bad Data Coherency Write-once, read-many access model Files are broken up into blocks Typically 64MB or 128MB block size Each replicated on multiple DataNodes on write Intelligent Client Client can find location of blocks Client accesses data directly from DataNode
  • 18. HDFS: Performance Robust in the face of multiple machine failures through aggressive replication of data blocks High Performance Checksum of 100 TB in 10 minutes,~166 GB/sec Built to house petabytes of data
  • 19. MapReduce Simple programming model that abstracts parallel programming complications away from data processing logic Made popular at Google, drives their processing systems, used on 1000s of computers in various clusters Hadoop provides an open source version of MR
  • 21. Using MapReduce MapReduce is a programming model for efficient distributed computing It works like a Unix pipeline: cat input | grep | sort | uniq -c | cat > output Input | Map | Shuffle & Sort | Reduce | Output Efficiency from Streaming through data, reducing seeks Pipelining A good fit for a lot of applications Log processing Web index building
  • 22. Hadoop In The Field Yahoo Facebook Twitter Commercial support available from Cloudera
  • 23. Hadoop In Your Backyard openPDC project at TVAhttp://openpdc.codeplex.com Cluster is currently: 20 nodes 200TB of physical drive space Used for Cheap, redundant storage Time series data mining
  • 24. Examples – Word Count Hello, World! Map Input: foofoo bar Output all words in a dataset as:{ key, value } {“foo”, 1}, {“foo”, 1}, {“bar”, 1} Reduce Input:{“foo”, (1, 1)}, {“bar”, (1)} Output:{“foo”, 2}, {“bar”, 1}
  • 25. Word Count: Mapper public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizeritr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }
  • 26. Word Count: Reducer public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 27. Examples – Stock Analysis Input dataset: Symbol,Date,Open,High,Low,Close GOOG,2010-03-19,555.23,568.00,557.28,560.00 YHOO,2010-03-19,16.62,16.81,16.34,16.44 GOOG,2010-03-18,564.72,568.44,562.96,566.40 YHOO,2010-03-18,16.46,16.57,16.32,16.56 Interested in biggest delta for each stock
  • 28. Examples – Stock Analysis Map Output {“GOOG”, 10.72}, {“YHOO”, 0.47}, {“GOOG”, 5.48}, {“YHOO”, 0.25} Reduce Input: {“GOOG”, (10.72, 5.48)},{“YHOO”, (0.47, 0.25)} Output:{“GOOG”, 10.72},{“YHOO”, 0.47}
  • 29. Examples – Time Series Analysis Map: {pointId, Timestamp + 30s of data} Reduce: Data mining! Classify samples based on training dataset Output samples that fall into interesting categories, index in database
  • 30. Other Stuff Compatibility with Amazon Elastic Cloud HadoopStreaming MapReducewith anything that uses stdin/stdout Hbase, distributed column-store database Pig, data analysis (transforms, filters, etc) Hive, data warehousing infrastructure Mahout, machine learning algorithms
  • 31. Parting Thoughts “We don't have better algorithms than anyone else. We just have more data.” Peter Norvig Artificial Intelligence: A Modern Approach Chief scientist at Google
  • 32. Contact Galen Riley http://galenriley.com @TotallyGreat Josh Patterson http://jpatterson.floe.tv @jpatanooga

Notes de l'éditeur

  1. So everyone knows what data processing is, but what do we mean by “scale” ?
  2. Simply- Data. Is. Big.… So this is the trend. The amount of data we can collect is increasing exponentially, and most companies aren’t capable of handling it. Patterson likes to call this “the data tsunami.” Let’s talk about a real example of this…
  3. Okay, so data is big. No big deal, I’ve got a processor with four cores that will chew through anything. However, the speed of my application is constrained by the speed at which I can get data. It is not going to fit in memory, so it’s going to be living on a hard disk. This brings us to problem number 2.
  4. Hard drive speed comes from two numbers.Disk seek time: The time to move the read head on a drive to where the data is storedData transfer: The speed that I can get information off the diskHard drives are wonderful because they are random access devices, and I can get data anywhere off the disk ay time I want it just by seeking and reading.Seeking takes a while, though. Fortunately, I can take advantage of locality, and read a page of data in to a bufferLet’s look at an example….
  5. This example has nice round numbers.I have a fictional hard drive that has a 10ms seek time, and a 10 meg/second transfer speedOn it, there’s a 1TB of data, made up of 100 byte records in 10K pages. That’s 10 billion entries over a billion pages, and I want to update 10% of this data set – a gig.…So seeks are slow, but I can transfer the whole file in a single day. Again, my application is always bound by the slowest piece in the pipe, so I get the most benefit by speeding that part up – the disk.
  6. Here are some real drives. I grabbed the specs that are advertised on newegg....Solid state drives are expensive though, $4k per terabyte. I can’t afford to buy a new one every month for my sensor collection.
  7. …With this observation in mind, let’s consider treading a hard disk (a random access device) like tape (a sequential device)…So we get closer to 1 day instead of a thousand
  8. I bet a lot of you know where this is going already – parallelism!So let’s say I’ve got one of those ide drives from earlier. I can sequentially process 100TB of data in 16 days.Alright, let’s get a thousand of them and run them in parallel – 75 GIGS per second, and I’m done in 22 minutes.Alright, parallel processing solves our problem, let’s call it a day!
  9. There’s an issue, though. Parallelism is really really hard.…And that’s not the only problem, either.
  10. Even if you have an OS that you know won’t crash, and code that won’t kill it either, reliability is still an issue because hardware fails.
  11. So let’s buy some expensive fault-tolerant hardware…
  12. A system that is robust in the face of machine failureA platform to allows multiple groups to collaborateA solution that scales linearly with respect to costA vision that will not lock us into a single vendor over time
  13. Alright, now we’re going to do some examples. I find it is most useful to look at what happens to the data instead of what the code looks like. Let’s review MR, but think about the data.So I have a big file that I want to process. It is split up in blocks and spread over my cluster. I start a job and Hadoop initiates a bunch of map tasks- this processing occurs where the data exists already. The mapper reads in a part of the file, and emits several key/value pairs. These are collected, sorted into buckets based on key, and each bucket goes to a reduce task. Each reducer processes a bucket and outputs the result. Of course, I can chain these steps together.Word count is the ‘Hello World’ of mapreduce. I’m interested in the frequency of words in a dataset.…Word count isn’t entirely silly by the way. Consider the suggestions that pop up when you start a google search. What you see is a list of search strings that people use frequently. Think of it as phrase count instead of word count.
  14. I’ve got some code here , but I’m going to skip going over it in detail. The slides will be available if you want to pour over it.We talked about MR being accessible for a programmer when compared to an MPI approach, and this is the entire map class for word count.
  15. Here’s another example to illustrate that my map process can do more than just read data in and push it back out. Here’s a file with information about stock prices – the ticker symbol, a date,the open price, the high and low prices for the day, and what it closed at. Since we’re talking about big data sets here, I want you to imagine that it’s got every stock for the last 50 years and there’s not enough room on my slide to include it all.I’m interested in volatility or something, so I want the biggest change in price for a particular stock. Let’s look at the data.
  16. My mapper reads in a record, filters out the information I’m not interested in (date and open/close prices), and emits the delta for each day.
  17. I think that collecting data without doing anything interesting with it is a big sin. So, here’s a business case for someone in the room, perhaps. Say you want to grep through some server logs that you’ve been collecting forever but never got around to doing anything with. Amazon EC2 supports Hadoop, so you can run your job without having to buy any hardware at all.And a list of stuff that is built on top of Hadoop.… You don’t have to write your jobs in java. I know that I love python, and I bet you do too.…We’ll be contributing some our time series stuff to the Mahout project.
  18. So let’s conclude with a quote from Peter Norvig that I think justifies our entire presentation.