SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Distributed Computing with
Apache Hadoop
Technology Overview
Konstantin V. Shvachko
14 July 2011
Contents
• Why life is interesting in Distributed Computing
• Computational shift: New Data Domain
• Data is more important than Algorithms
• Hadoop as a technology
• Ecosystem of Hadoop tools
2
New Data Domain
• Simple calculations can be performed by humans
• Devices are need to process larger computations
• Large computations assume large data domain
• Domain of numbers – the only one until recently
– Crunching numbers from ancient times
– Computers served the same purpose– Computers served the same purpose
– Strict rules
• Growth of the Internet provided a new vast domain
– Word data: human generated texts
– Digital data: photo, video, sound
– Fuzzy rules. Errors & deviations are a part of study
– Started to process texts
– Barely touching digital data
3
Words vs. Numbers
• In 1997 IBM built Deep Blue supercomputer
– Playing chess game with the champion G. Kasparov
– Human race was defeated
– Strict rules for Chess
– Fast deep analyses of current state
– Still numbers
4
• In 2011 IBM built Watson computer to
play Jeopardy
– Questions and hints in human terms
– Analysis of texts from library and the
Internet
– Human champions defeated
The Library of Babel
• Jorge Luis Borges "The Library of Babel“
– Vast storage universe
– Composed of all possible manuscripts
uniformly formatted as 410-page books.
– Most are meaningless sequences of symbols
– The rest excitingly forms a complete and an
indestructible knowledge system
– Stores any text written or to be written
– Provides solutions to all problems in the world
– Just find the right book.
• Hard copy size is larger than visible universe
– a data domain worth discovering
• What is the size of the electronic version?
• Internet collection is a subset of the The Library of Babel
5
New Type of Algorithms
• Scalability is more important than efficiency
– Classic and Distributed sorting
– In place sorting updates common state
• More Hardware vs. development time
– 20% improvements in efficiency are not important
– Can ad more nodes instead
• Data is more important than algorithms
– Hard to collect data. Historical data 6 months to 1 year
• Example: Natural language processing
– Effects of training data size on classification accuracy
– Accuracy increases linearly on the size of the training data
– Machine learning algorithms converge on with increase of training data
6
Big Data
• Computations that need the power of many computers
– Large datasets: hundreds of TBs, PBs
– Or use of thousands of CPUs in parallel
– Or both
• Cluster as a computer
– Big Data management, storage and analytics
7
Big Data: Examples
• Search Webmap as of 2008 @ Y!
– Raw disk used 5 PB
– 1500 nodes
• High-energy physics LHC Collider:• High-energy physics LHC Collider:
– PBs of events
– 1 PB of data per sec, most filtered out
• 2 quadrillionth (1015) digit of πis 0
– Tsz-Wo (Nicholas) Sze
– 12 days of cluster time, 208 years of CPU time
– No data, pure CPU workload
8
Big Data: More Examples
• eHarmony
– Soul matching
• Banking• Banking
– Fraud detection
• Processing of astronomy data
– Image Stacking and Mosaicing
9
What is Hadoop
• Hadoop is an ecosystem of tools for processing
“Big Data”
• Hadoop is an open source project
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
10
Hadoop: Architecture Principles
• Linear scalability: more nodes can do more work in the same time
– Linear on data size:
– Linear on compute resources:
• Move computation to data
– Minimize expensive data transfers
– Data are large, programs are small
• Reliability and Availability: Failures are common
– 1 drive fails every 3 years
• Probability of failing today 1/1000
– How many drives per day fail on 1000 node cluster with 10 drives ?
• Simple computational model
– hides complexity in efficient execution framework
• Sequential data processing (avoid random reads)
11
Hadoop Success Factors
• Apache Hadoop won the 2011 MediaGuardian Innovation Award
– Recognition for its influence on technological innovation
– Other nominees: iPad, WikiLeaks
1. Scalability
2. Open source & commodity software2. Open source & commodity software
3. Just works
12
Hadoop Family
HDFS Distributed file system
MapReduce Distributed computation
Zookeeper Distributed coordination
HBase Column storeHBase Column store
Pig Dataflow language, SQL
Hive Data warehouse, SQL
Oozie Complex job workflow
Avro Data Serialization
13
Hadoop Core
• A reliable, scalable, high performance distributed computing system
• Reliable storage layer
– The Hadoop Distributed File System (HDFS)
– With more sophisticated layers on top
• MapReduce – distributed computation framework
• Hadoop scales computation capacity, storage capacity, and I/O bandwidth• Hadoop scales computation capacity, storage capacity, and I/O bandwidth
by adding commodity servers.
• Divide-and-conquer using lots of commodity hardware
14
MapReduce
• MapReduce – distributed computation framework
– Invented by Google researchers
• Two stages of a MR job
– map: (k1; v1) → {(k2; v2)}
– reduce: (k2; {v2}) → {(k3; v3)}
• Map – a truly distributed stage• Map – a truly distributed stage
Reduce – an aggregation, may not be distributed
• Shuffle – sort and merge
– transition from Map to Reduce
– invisible to user
• Combiners & Partitioners
MapReduce Workflow
Where MapReduce cannot help
• MapReduce solves about 95% of practical problems
– Not a tool for everything
• Batch processing vs. real-time
– Throughput vs. Latency
• Simultaneous update of common state
• Inter communication between tasks of a job• Inter communication between tasks of a job
• Coordinated execution
• Use of other computational models
– MPI
– Driads
17
Hadoop Distributed File System
• The name space is a hierarchy of files and directories
• Files are divided into blocks (typically 128 MB)
• Namespace (metadata) is decoupled from data
– Lots of fast namespace operations, not slowed down by
– Data streaming
• Single NameNode keeps the entire name space in RAM• Single NameNode keeps the entire name space in RAM
• DataNodes store block replicas as files on local drives
• Blocks are replicated on 3 DataNodes for redundancy
18
HDFS Read
• To read a block, the client requests the list of replica locations from the
NameNode
• Then pulling data from a replica on one of the DataNodes
19
HDFS Write
• To write a block of a file, the client requests a list of candidate DataNodes
from the NameNode, and organizes a write pipeline.
20
Replica Location Awareness
• MapReduce schedules a task assigned to process block B to a DataNode
serving a replica of B
• Local access to data
21
Name Node
• NameNode keeps 3 types of information
– Hierarchical namespace
– Block manager: block to data-node mapping
– List of DataNodes
• The durability of the name space is maintained by a write-ahead journal and
checkpoints
– A BackupNode creates periodic checkpoints– A BackupNode creates periodic checkpoints
– A journal transaction is guaranteed to be persisted before replying to the client
– Block locations are not persisted, but rather discovered from DataNode during
startup via block reports.
22
Data Nodes
• DataNodes register with the NameNode, and provide periodic block reports
that list the block replicas on hand
• DataNodes send heartbeats to the NameNode
– Heartbeat responses give instructions for managing replicas
• If no heartbeat is received during a 10-minute interval, the node is
presumed to be lost, and the replicas hosted by that node to be unavailablepresumed to be lost, and the replicas hosted by that node to be unavailable
– NameNode schedules re-replication of lost replicas
23
Quiz:
What Is the Common Attribute?
24
Hadoop Size
• Y! cluster
– 70 million files, 80 million blocks
– 15 PB capacity
– 4000+ nodes. 24,000 clients
– 50 GB heap for NN
• Data warehouse Hadoop cluster at Facebook
– 55 million files, 80 million blocks. Estimate 200 million objects (files + blocks)– 55 million files, 80 million blocks. Estimate 200 million objects (files + blocks)
– 2000 nodes. 21 PB capacity, 30,000 clients
– 108 GB heap for NN should allow for 400 million objects
• Analytics Cluster at eBay
– 768 nodes
– Each node: 24 TB of local disk storage, 72 GB of RAM, and a 12-core CPU
– Cluster size is 18 PB.
– Runs 26,000 MapReduce tasks simultaneously
25
Limitations of the Implementation
• “HDFS Scalability: The limits to growth” USENIX ;login:
• Single master architecture: a constraining resource
• Limit to the number of namespace objects
– 100 million objects; 25 PB of data
– Block to file ratio is shrinking: 2 –> 1.5 -> 1.2
• Limits for linear performance growth• Limits for linear performance growth
– linear increase in # of workers puts a higher workload on the single NameNode
– Sinple NameNode cannot support 100,000 clients
• Hadoop MapReduce framework reached its scalability limit at 40,000 clients
– Corresponds to a 4,000-node cluster with 10 MapReduce slots
26
Benchmarks
• DFSIO
– Read: 66 MB/s
– Write: 40 MB/s
• Observed on busy cluster
– Read: 1.02 MB/s
– Write: 1.09 MB/s– Write: 1.09 MB/s
• Sort (“Very carefully tuned user application”)
Bytes
(TB)
Nodes Maps Reduces Time HDFS I/O Bytes/s
Aggregate
(GB/s)
Per Node
(MB/s)
1 1460 8000 2700 62 s 32 22.1
1000 3558 80,000 20,000 58,500 s 34.2 9.35
27
ZooKeeper
• A distributed coordination service for distributed apps
– Event coordination and notification
– Leader election
– Distributed locking
• ZooKeeper can help build HA systems
28
HBase
• Distributed table store on top of HDFS
– An implementation of Google’s BigTable
• Big table is Big Data, cannot be stored on a single node
• Tables: big, sparse, loosely structured.
– Consist of rows, having unique row keys
– Has arbitrary number of columns,
– grouped into small number of column families
– Dynamic column creation
• Table is partitioned into regions
– Horizontally across rows; vertically across column families
• HBase provides structured yet flexible access to data
• Near real-time data processing
29
HBase Functionality
• HBaseAdmin: administrative functions
– Create, delete, list tables
– Create, update, delete columns, families
– Split, compact, flush
• HTable: access table data
– Result HTable.get(Get g) // get cells of a row
– void HTable.put(Put p) // update a row– void HTable.put(Put p) // update a row
– void HTable.put(Put[] p) // batch update of rows
– void HTable.delete(Delete d) // delete cells/row
– ResultScanner getScanner(family) // scan col family
HBase Architecture
31
Pig
• A language on top of and to simplify MapReduce
• Pig speaks Pig Latin
• SQL-like language
• Pig programs are translated into a
series of MapReduce jobs
32
Hive
• Serves the same purpose as Pig
• Closely follows SQL standards
• Keeps metadata about Hive tables in MySQL DRBM
Oozie
• Workflows actions are arranged as Direct Acyclic Graph
– Multiple steps: MR, Pig, Hive, Java, data mover, ...
• Coordinator jobs (time/data driven workflow jobs)
– A workflow job is scheduled at a regular frequency
– The workflow job is started when all inputs are available
34
The Future: Next Generation MapReduce
• “Apache Hadoop: The scalability update” USENIX ;login:
• Next Generation MapReduce
– Separation of JobTracker functions
1. Job scheduling and resource allocation
• Fundamentally centralized
2. Job monitoring and job life-cycle coordination
• Delegate coordination of different jobs to other nodes
– Dynamic partitioning of cluster resources: no fixed slots
• HDFS Federation
– Independent NameNodes sharing a common pool of DataNodes
– Cluster is a family of volumes with shared block storage layer
– User sees volumes as isolated file systems
– ViewFS: the client-side mount table
– Federated approach provides a static partitioning of the federated namespace
35
The End
36

Contenu connexe

Tendances

dynamic host configuration protocol
dynamic host configuration protocoldynamic host configuration protocol
dynamic host configuration protocolkinish kumar
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS Dr Neelesh Jain
 
Online analytical processing
Online analytical processingOnline analytical processing
Online analytical processingSamraiz Tejani
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Introduction to MongoDB and CRUD operations
Introduction to MongoDB and CRUD operationsIntroduction to MongoDB and CRUD operations
Introduction to MongoDB and CRUD operationsAnand Kumar
 
Network protocol
Network protocolNetwork protocol
Network protocolOnline
 
HyperText Transfer Protocol (HTTP)
HyperText Transfer Protocol (HTTP)HyperText Transfer Protocol (HTTP)
HyperText Transfer Protocol (HTTP)Gurjot Singh
 
System and network administration network services
System and network administration network servicesSystem and network administration network services
System and network administration network servicesUc Man
 
Databases: Locking Methods
Databases: Locking MethodsDatabases: Locking Methods
Databases: Locking MethodsDamian T. Gordon
 
IP Address - IPv4 & IPv6
IP Address - IPv4 & IPv6IP Address - IPv4 & IPv6
IP Address - IPv4 & IPv6Adeel Rasheed
 
Object Relational Database Management System(ORDBMS)
Object Relational Database Management System(ORDBMS)Object Relational Database Management System(ORDBMS)
Object Relational Database Management System(ORDBMS)Rabin BK
 
Distributed concurrency control
Distributed concurrency controlDistributed concurrency control
Distributed concurrency controlBinte fatima
 

Tendances (20)

Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
dynamic host configuration protocol
dynamic host configuration protocoldynamic host configuration protocol
dynamic host configuration protocol
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS
 
Network Layer by-adeel
Network Layer by-adeelNetwork Layer by-adeel
Network Layer by-adeel
 
Online analytical processing
Online analytical processingOnline analytical processing
Online analytical processing
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Introduction to MongoDB and CRUD operations
Introduction to MongoDB and CRUD operationsIntroduction to MongoDB and CRUD operations
Introduction to MongoDB and CRUD operations
 
Cisco packet tracer ripv1
Cisco packet tracer   ripv1Cisco packet tracer   ripv1
Cisco packet tracer ripv1
 
Network protocol
Network protocolNetwork protocol
Network protocol
 
HyperText Transfer Protocol (HTTP)
HyperText Transfer Protocol (HTTP)HyperText Transfer Protocol (HTTP)
HyperText Transfer Protocol (HTTP)
 
lec6
lec6lec6
lec6
 
System and network administration network services
System and network administration network servicesSystem and network administration network services
System and network administration network services
 
Databases: Locking Methods
Databases: Locking MethodsDatabases: Locking Methods
Databases: Locking Methods
 
Data models in NoSQL
Data models in NoSQLData models in NoSQL
Data models in NoSQL
 
C n practical file
C n practical fileC n practical file
C n practical file
 
IP Address - IPv4 & IPv6
IP Address - IPv4 & IPv6IP Address - IPv4 & IPv6
IP Address - IPv4 & IPv6
 
Object Relational Database Management System(ORDBMS)
Object Relational Database Management System(ORDBMS)Object Relational Database Management System(ORDBMS)
Object Relational Database Management System(ORDBMS)
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Distributed concurrency control
Distributed concurrency controlDistributed concurrency control
Distributed concurrency control
 
Dhcp
DhcpDhcp
Dhcp
 

En vedette

Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learnedtcurdt
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingBart Vandewoestyne
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

En vedette (13)

Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Similaire à Distributed Computing with Apache Hadoop: Technology Overview

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopMedia Gorod
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersZohar Elkayam
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecturesaipriyacoool
 

Similaire à Distributed Computing with Apache Hadoop: Technology Overview (20)

Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
try
trytry
try
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 

Dernier

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Dernier (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Distributed Computing with Apache Hadoop: Technology Overview

  • 1. Distributed Computing with Apache Hadoop Technology Overview Konstantin V. Shvachko 14 July 2011
  • 2. Contents • Why life is interesting in Distributed Computing • Computational shift: New Data Domain • Data is more important than Algorithms • Hadoop as a technology • Ecosystem of Hadoop tools 2
  • 3. New Data Domain • Simple calculations can be performed by humans • Devices are need to process larger computations • Large computations assume large data domain • Domain of numbers – the only one until recently – Crunching numbers from ancient times – Computers served the same purpose– Computers served the same purpose – Strict rules • Growth of the Internet provided a new vast domain – Word data: human generated texts – Digital data: photo, video, sound – Fuzzy rules. Errors & deviations are a part of study – Started to process texts – Barely touching digital data 3
  • 4. Words vs. Numbers • In 1997 IBM built Deep Blue supercomputer – Playing chess game with the champion G. Kasparov – Human race was defeated – Strict rules for Chess – Fast deep analyses of current state – Still numbers 4 • In 2011 IBM built Watson computer to play Jeopardy – Questions and hints in human terms – Analysis of texts from library and the Internet – Human champions defeated
  • 5. The Library of Babel • Jorge Luis Borges "The Library of Babel“ – Vast storage universe – Composed of all possible manuscripts uniformly formatted as 410-page books. – Most are meaningless sequences of symbols – The rest excitingly forms a complete and an indestructible knowledge system – Stores any text written or to be written – Provides solutions to all problems in the world – Just find the right book. • Hard copy size is larger than visible universe – a data domain worth discovering • What is the size of the electronic version? • Internet collection is a subset of the The Library of Babel 5
  • 6. New Type of Algorithms • Scalability is more important than efficiency – Classic and Distributed sorting – In place sorting updates common state • More Hardware vs. development time – 20% improvements in efficiency are not important – Can ad more nodes instead • Data is more important than algorithms – Hard to collect data. Historical data 6 months to 1 year • Example: Natural language processing – Effects of training data size on classification accuracy – Accuracy increases linearly on the size of the training data – Machine learning algorithms converge on with increase of training data 6
  • 7. Big Data • Computations that need the power of many computers – Large datasets: hundreds of TBs, PBs – Or use of thousands of CPUs in parallel – Or both • Cluster as a computer – Big Data management, storage and analytics 7
  • 8. Big Data: Examples • Search Webmap as of 2008 @ Y! – Raw disk used 5 PB – 1500 nodes • High-energy physics LHC Collider:• High-energy physics LHC Collider: – PBs of events – 1 PB of data per sec, most filtered out • 2 quadrillionth (1015) digit of πis 0 – Tsz-Wo (Nicholas) Sze – 12 days of cluster time, 208 years of CPU time – No data, pure CPU workload 8
  • 9. Big Data: More Examples • eHarmony – Soul matching • Banking• Banking – Fraud detection • Processing of astronomy data – Image Stacking and Mosaicing 9
  • 10. What is Hadoop • Hadoop is an ecosystem of tools for processing “Big Data” • Hadoop is an open source project The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. 10
  • 11. Hadoop: Architecture Principles • Linear scalability: more nodes can do more work in the same time – Linear on data size: – Linear on compute resources: • Move computation to data – Minimize expensive data transfers – Data are large, programs are small • Reliability and Availability: Failures are common – 1 drive fails every 3 years • Probability of failing today 1/1000 – How many drives per day fail on 1000 node cluster with 10 drives ? • Simple computational model – hides complexity in efficient execution framework • Sequential data processing (avoid random reads) 11
  • 12. Hadoop Success Factors • Apache Hadoop won the 2011 MediaGuardian Innovation Award – Recognition for its influence on technological innovation – Other nominees: iPad, WikiLeaks 1. Scalability 2. Open source & commodity software2. Open source & commodity software 3. Just works 12
  • 13. Hadoop Family HDFS Distributed file system MapReduce Distributed computation Zookeeper Distributed coordination HBase Column storeHBase Column store Pig Dataflow language, SQL Hive Data warehouse, SQL Oozie Complex job workflow Avro Data Serialization 13
  • 14. Hadoop Core • A reliable, scalable, high performance distributed computing system • Reliable storage layer – The Hadoop Distributed File System (HDFS) – With more sophisticated layers on top • MapReduce – distributed computation framework • Hadoop scales computation capacity, storage capacity, and I/O bandwidth• Hadoop scales computation capacity, storage capacity, and I/O bandwidth by adding commodity servers. • Divide-and-conquer using lots of commodity hardware 14
  • 15. MapReduce • MapReduce – distributed computation framework – Invented by Google researchers • Two stages of a MR job – map: (k1; v1) → {(k2; v2)} – reduce: (k2; {v2}) → {(k3; v3)} • Map – a truly distributed stage• Map – a truly distributed stage Reduce – an aggregation, may not be distributed • Shuffle – sort and merge – transition from Map to Reduce – invisible to user • Combiners & Partitioners
  • 17. Where MapReduce cannot help • MapReduce solves about 95% of practical problems – Not a tool for everything • Batch processing vs. real-time – Throughput vs. Latency • Simultaneous update of common state • Inter communication between tasks of a job• Inter communication between tasks of a job • Coordinated execution • Use of other computational models – MPI – Driads 17
  • 18. Hadoop Distributed File System • The name space is a hierarchy of files and directories • Files are divided into blocks (typically 128 MB) • Namespace (metadata) is decoupled from data – Lots of fast namespace operations, not slowed down by – Data streaming • Single NameNode keeps the entire name space in RAM• Single NameNode keeps the entire name space in RAM • DataNodes store block replicas as files on local drives • Blocks are replicated on 3 DataNodes for redundancy 18
  • 19. HDFS Read • To read a block, the client requests the list of replica locations from the NameNode • Then pulling data from a replica on one of the DataNodes 19
  • 20. HDFS Write • To write a block of a file, the client requests a list of candidate DataNodes from the NameNode, and organizes a write pipeline. 20
  • 21. Replica Location Awareness • MapReduce schedules a task assigned to process block B to a DataNode serving a replica of B • Local access to data 21
  • 22. Name Node • NameNode keeps 3 types of information – Hierarchical namespace – Block manager: block to data-node mapping – List of DataNodes • The durability of the name space is maintained by a write-ahead journal and checkpoints – A BackupNode creates periodic checkpoints– A BackupNode creates periodic checkpoints – A journal transaction is guaranteed to be persisted before replying to the client – Block locations are not persisted, but rather discovered from DataNode during startup via block reports. 22
  • 23. Data Nodes • DataNodes register with the NameNode, and provide periodic block reports that list the block replicas on hand • DataNodes send heartbeats to the NameNode – Heartbeat responses give instructions for managing replicas • If no heartbeat is received during a 10-minute interval, the node is presumed to be lost, and the replicas hosted by that node to be unavailablepresumed to be lost, and the replicas hosted by that node to be unavailable – NameNode schedules re-replication of lost replicas 23
  • 24. Quiz: What Is the Common Attribute? 24
  • 25. Hadoop Size • Y! cluster – 70 million files, 80 million blocks – 15 PB capacity – 4000+ nodes. 24,000 clients – 50 GB heap for NN • Data warehouse Hadoop cluster at Facebook – 55 million files, 80 million blocks. Estimate 200 million objects (files + blocks)– 55 million files, 80 million blocks. Estimate 200 million objects (files + blocks) – 2000 nodes. 21 PB capacity, 30,000 clients – 108 GB heap for NN should allow for 400 million objects • Analytics Cluster at eBay – 768 nodes – Each node: 24 TB of local disk storage, 72 GB of RAM, and a 12-core CPU – Cluster size is 18 PB. – Runs 26,000 MapReduce tasks simultaneously 25
  • 26. Limitations of the Implementation • “HDFS Scalability: The limits to growth” USENIX ;login: • Single master architecture: a constraining resource • Limit to the number of namespace objects – 100 million objects; 25 PB of data – Block to file ratio is shrinking: 2 –> 1.5 -> 1.2 • Limits for linear performance growth• Limits for linear performance growth – linear increase in # of workers puts a higher workload on the single NameNode – Sinple NameNode cannot support 100,000 clients • Hadoop MapReduce framework reached its scalability limit at 40,000 clients – Corresponds to a 4,000-node cluster with 10 MapReduce slots 26
  • 27. Benchmarks • DFSIO – Read: 66 MB/s – Write: 40 MB/s • Observed on busy cluster – Read: 1.02 MB/s – Write: 1.09 MB/s– Write: 1.09 MB/s • Sort (“Very carefully tuned user application”) Bytes (TB) Nodes Maps Reduces Time HDFS I/O Bytes/s Aggregate (GB/s) Per Node (MB/s) 1 1460 8000 2700 62 s 32 22.1 1000 3558 80,000 20,000 58,500 s 34.2 9.35 27
  • 28. ZooKeeper • A distributed coordination service for distributed apps – Event coordination and notification – Leader election – Distributed locking • ZooKeeper can help build HA systems 28
  • 29. HBase • Distributed table store on top of HDFS – An implementation of Google’s BigTable • Big table is Big Data, cannot be stored on a single node • Tables: big, sparse, loosely structured. – Consist of rows, having unique row keys – Has arbitrary number of columns, – grouped into small number of column families – Dynamic column creation • Table is partitioned into regions – Horizontally across rows; vertically across column families • HBase provides structured yet flexible access to data • Near real-time data processing 29
  • 30. HBase Functionality • HBaseAdmin: administrative functions – Create, delete, list tables – Create, update, delete columns, families – Split, compact, flush • HTable: access table data – Result HTable.get(Get g) // get cells of a row – void HTable.put(Put p) // update a row– void HTable.put(Put p) // update a row – void HTable.put(Put[] p) // batch update of rows – void HTable.delete(Delete d) // delete cells/row – ResultScanner getScanner(family) // scan col family
  • 32. Pig • A language on top of and to simplify MapReduce • Pig speaks Pig Latin • SQL-like language • Pig programs are translated into a series of MapReduce jobs 32
  • 33. Hive • Serves the same purpose as Pig • Closely follows SQL standards • Keeps metadata about Hive tables in MySQL DRBM
  • 34. Oozie • Workflows actions are arranged as Direct Acyclic Graph – Multiple steps: MR, Pig, Hive, Java, data mover, ... • Coordinator jobs (time/data driven workflow jobs) – A workflow job is scheduled at a regular frequency – The workflow job is started when all inputs are available 34
  • 35. The Future: Next Generation MapReduce • “Apache Hadoop: The scalability update” USENIX ;login: • Next Generation MapReduce – Separation of JobTracker functions 1. Job scheduling and resource allocation • Fundamentally centralized 2. Job monitoring and job life-cycle coordination • Delegate coordination of different jobs to other nodes – Dynamic partitioning of cluster resources: no fixed slots • HDFS Federation – Independent NameNodes sharing a common pool of DataNodes – Cluster is a family of volumes with shared block storage layer – User sees volumes as isolated file systems – ViewFS: the client-side mount table – Federated approach provides a static partitioning of the federated namespace 35