This document discusses large scale computing with MapReduce. It provides background on the growth of digital data, noting that by 2020 there will be over 5,200 GB of data for every person on Earth. It introduces MapReduce as a programming model for processing large datasets in a distributed manner, describing the key aspects of Map and Reduce functions. Examples of MapReduce jobs are also provided, such as counting URL access frequencies and generating a reverse web link graph.
2. • By 2020, there will be 5,200 GB of data for every person
on Earth
• next eight years, the amount of digital data produced will
exceed 40 zetabytes, which is the equivalent of 5200 GB
of data for every man.
• The data recorded by each of the big experiments at the
Large Hadron Colider (LHC) at Gern in Geneva is
enough to fill around 100000 DVDs every year
• Source: Facebook, Google, etc.
Data Explosion 2
3. • Big Data in Fields:
Sport Finance
Banking Science
Marketing Journalism
Medicine Education
Data Explosion 3
4. Downloading
Creating Retrieve Most
a large amount
Indexes related pages
of web pages
A case study of Google 4
5. • Single-thread performance doesn’t matter
• Throughput more important than peak performance.
• Stuff Breaks
• 1 server can run many years but large cluster of
servers, like lose 10 a day.
• ―Ultra-reliable‖ hardware doesn’t really help.
• Software needs to be fault tolerant.
• Commodity machine with lower price is better.
Large Data Set 5
6. Traditional RDBMS MapReduce
Data Size Gigabytes Petabytes
Access Interactive and batch Batch
Updates Read and write many times Write once, read many times
Structure Static Schema Dynamic Schema
Integrity High Low
Scaling Nonlinear Linear
Streaming Data 6
8. • Map:
• produces a set of intermediate key/value pairs
• Reduce:
• Deliver Results from key/value pairs
Functional MapReduce 8
9. • map(String key, String value):
// key: document name
// value:document conents:
EmitIntermediate(w,‖1‖);
• reduce(String key, Iterator values):
//key: a word
//value: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Functional MapReduce 9
10. • Parallel map over input.
• Parallel grouping of intermediate data.
• Parallel map over groups.
• Parallel reduction per group.
Discover Parallelism in
MapReduce 10
13. • One master, many workers
• Input data split into M map tasks (typically 64 MB in size)
• Reduce phase partitioned into R reduce tasks.
• Tasks are assigned to workers dynamically.
• Often M = 200,000; R = 4,000; workers=2,000
• Master assigns each map task to a free worker.
• Consider locality of data to worker then assigning task.
• Worker reads task input (often from local disk)
• Worker produces R local files containing intermediate k/v pairs.
• Master assigns each reduce task to a free worker.
• Worker reads intermediate k/v pairs from map workers.
• Worker sorts&applies user’s Reduce op to produce the output.
MapReduce: Job
Scheduling 13
15. • On Worker Failure:
• Detect failure via periodic heartbeats.
• Re-execute complete and in-progress map tasks.
• Re-execute in progress reduce tasks
• Task completion committed through master.
• On master failure:
• State is check pointed to GFS: new master recovers &
continues
MapReduce: Fault
Tolerance 15
16. • Master scheduling:
• Asks GFS for locations of replicas of input file blocks.
• Map tasks typically split into 64MB (==GFS block size)
• Map tasks scheduled so GFS input block replica are on
same machine or same rack.
MapReduce: Locality
Optimization 16
17. • Optional secondary keys for ordering.
• Compression of intermediate data.
• Combiner: useful for saving network bandwidth
• User-defined counters.
MapReduce: Other
refinements 17
18. • Distributed Grep
• The map function emits a line if it matches a given pattern. The reduce
function is an identity function that just copies the supplied intermediate data
to the output.
• Count of URL Access Frequency
• Input is web page logs. Output is <URL, 1> The reduce function adds
together all values for the same URL and emits a <URL, total count> pair.
• Reverse Web-Link Graph
• The map function outputs (target, source) pairs for each link to target URL
found in a page named source. The reduce function concatenates the lst of all
source URLs associated with a given target URL and emits the pair: (target,
list(source)).
• Term-Vector per Host
• A term vector summarizes the most important words that occur in a document
or a set of documents as a list of (word, frequency) pairs.
MapReduce: examples 18
19. • MapReduce runtime library[8]:
• Automatic parallelization.
• Load balancing.
• Network and disk transfer optimization.
• Handling of machine failure.
• Robustness.
MapReduce: runtime
library 19
20. • Economy: a cluster of commodity computers
• Usability: a simpler user interface of submitting
computing jobs and all distributed computing are carried
out on the back. No need of dealing with these issues.
• Reliable: Fault tolerant.
Hadoop: a opensource
library 20
22. • Built-in back up became a necessity.
• Built-in automated recovery mechanism.
• Running things in parallel.(Distributed Programming)
• Easy to Administrate.
• Something that is Cost effective.
What was required? 22
24. File System component of Hadoop.
• Streaming Data Access
• Hardware Failure
• Commodity Hardware
• Moving Data is Expensive
HDFS 24
25. • Scalable
• Fault tolerant
• Distributed file system
• Data Storage
• Cost effective processing
Hadoop 25
26. Name
User
Node(Master)
HDFS client
Data
Data Data
Node(Slave) Node(Slave)
HDFS Core Architecture 26
27. • Only one NameNode.
• Selects DataNodes to create replicas.
• Image
• Checkpoint
• Journal
• CheckpointNode / BackupNode
NameNode 27
28. • Variable block size (default is 128mb).
• Replicas at multiple locations (default 3).
• Namespaces of all the blocks stored in NameNode.
• Handshake with NameNode at startup.
• Storage ID – To identify a DataNode.
• Update of block replicas every one hour.
• Heartbeat – Normal operation of DataNode
DataNode 28
29. Heartbeat
• Backup of the state of the file system.
• To protect from data loss during
software upgrade. DataNode NameNode
• DataNode copies storage directories
and hardlinks blocks into it.
Snapshot Snapshot
29
30. • Data in file cannot be modified once saved. (Only
Append)
• Only one client can have write access to a file at a time.
• Soft limit and Hard limit.
• Bytes sent in pipeline to Datablocks (in form of packets).
• Optimized for Batch programming systems.
Reads and Write 30
31. • Two rules:
1. One DataNode should contain more than one replica.
2. No rack contains more than two replicas of the same
block.
• Placement of replicas play a vital role.
• Block report gives the number of replicas.
• Replication priority queue.
Replica Management 31
32. • Represents POSIX model(read, write and execute).
• Latest version uses Kerberos authentication.
• Does not travel on untrusted networks.
• Very weak security features but working on it.
Security 32
33. • Yahoo
• Facebook
• Twitter
• Ebay
• LinkedIn
• Amazon(A9)
Who all use Hadoop? 33
34. • Yahoo played a vital role in the development
of Hadoop.
• Initially used for indexing of web crawl results.
• To block spams entering into the mail server,
filters, content optimization etc..,
34
35. • When facebook first started – commerical RDBMS.
• Need for infrastructure to handle such huge data.
• Days turned into hours.
• Log processing, Recommendation systems, Data
warehouse and archiving.
35
36. • Uses LZO compression to store data.
• Used for analyzing and collecting information.
• Uses Scala programming language along with
Hadoop.
• Tweets, Log information etc..,
36
37. • Huge data.
• Teradata and Hadoop together to store data.
• Uses Hadoop to understand customer needs.
• Search queries, server logs, click throughs etc..,
37
38. • Uses Hadoop to analyze data.
• New data products like
• People you may know
• Jobs matching your skills
• Profile visitors etc..,
38
39. • Amazon A9
• The NewYork Times
• IBM
• Last.fm
• Veoh
• And the list goes on…..
Other Applications 39
40. • Optimized for high throughput of data at the expense of
latency.
• Single Point Failure and limited NameNode memory.
• No modification to data in file
• Hadoop is not a substitute for a database.
• Consumes immense power.
Where Hadoop doesnot
work? 40
44. • Pig is a large-scale data analysis platform based on
Hadoop
• Provides SQL-LIKE language called Pig Latin
• Convert SQL data request into a series optimized
MapReduce computing
• Pig complex massive data parallel computing
• Provides a simple operation and programming interface
Pig decription 44
46. • Amazon/A9
• AOL
• Facebook
• Fox interactive media
• Google
• IBM
• New York Times
• PowerSet (now Microsoft)
• Quantcast
• Rackspace/Mailtrust
• Veoh
• Yahoo!
• Who use Pig? 46
47. • Ad-hoc analysis
• Running in cluster computing architecture
• Operation similar SQL syntax
• Open source code
• Pig characteristics 47
48. • Ad-hoc analysis,
• Running in cluster computing architecture
• Operation similar SQL syntax
• Open source code
• Pig interface 48
49. • Connect to the local Hadoop cluster
• Install Pig (Pig script, Grunt and embedded method)
• Pig usage 49
50. • records = Load 'first.txt' as (itemname: chararray, price:
int, quality: int);
• filter_records = FILTER records BY price! = 999 AND
quality == 0;
• group_records = GROUP filter_records BY itemname;
• max_temp = FOREACH group_records GENERATE
group, MAX (filter_records.price);
• DUMP max_temp;
• Pig usage 50
51. SQL Pig
SQL is a description of the type Pig is data flow programming
of programming language language
Relational database management Pig requires data in a looser mode
system (RDBMS) stores data in a which can be defined at running
strictly defined mode table time
Simple data structure Pig supports complex nested data
Support transaction, index and Does not support transaction,
random read index and random read
• Pig and SQL comparison 51
52. • Procedures constitute a series of statements
• Operations and commands are case insensitive
• Aliases and function names are case-sensitive
• Multi-line statement in the entire program logic programs
• Pig Latin 52
53. • The hive is a data warehouse tool
• Map structured data file into a database table
• Provides complete sql queries
• Converts sql statement into MapReduce tasks to execute
• Hive 53
55. • Stored in HDFS is divided into blocks
• Distribute on multiple machines
• Hive File System 55
56. • Pig is a programming language that simplifies Hadoop
common tasks
• Hive in Hadoop plays the role of the data warehouse
• Pig use of Hadoop Java APIs can significantly reduce the
amount of code
• Pig attract a large number of software developers
• About the pig and hive 56
57. • HBase is a distributed, open source column-oriented
database
• A structured data distributed storage system
• Bigtable-like ability
• Subproject of the Apache Hadoop project
• Suitable for unstructured data storage
• HBase is column-based rather than line-based mode
• Require random access, real-time read and write
HBase 57
58. • Amazon/A9
• AOL
• Facebook
• Fox interactive media
• Google
• IBM
• New York Times
• PowerSet (now Microsoft)
• Quantcast
• Rackspace/Mailtrust
• Veoh
• Yahoo!
• Who use Pig? 58
59. • Hadoop Distributed Coordination Service
• Provides simple operations and additional abstract
operations such as sorting and notice
• Implement a lot of coordination data structures and
protocols
• Provides a generic coordination modes and methods of
open source shares repository
• High-performance, which has more than 10,000 ops to
write the main benchmark throughput is even higher then
mainly to read several times
ZooKeeper 59
60. • Aimed to assist in efficient data exchange between
RDBMS and Hadoop
• View database tables and other useful gadgets
• Support JDBC specification databases, such as DB2,
MySQL
Sqoop 60
61. • Amazon/A9
• AOL
• Facebook
• Fox interactive media
• Google
• IBM
• New York Times
• PowerSet (now Microsoft)
• Quantcast
• Rackspace/Mailtrust
• Veoh
• Yahoo!
• Who use Pig? 61
62. • Data analysis. Retrieve from: http://public.web.cern.ch/public/en/research/DataAnalysis-en.html
• James Gallagher. DNA sequencing of MRSA used to stop outbreak. http://www.bbc.co.uk/news/health-
20314024
• Shankland. (2009) Google uncloaks once-secret server. Retrieve from:
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/ma
preduce-osdi04.pdf
• J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI’04, 6th
Symposium on Operating Systems Design and Implementation, Sponsored by USENIX, in cooperation
with ACM SIGOPS, pages 137–150, 2004.
• Ralf Lammel. (2007). Google’s MapReduce programming model—Revised. Science of Computer
Porgramming, Volume 68 Issue 3, October, 2007.
• Lucas Mearian. By 2020, there will be 5,200 GB of data for every person on Earth.
http://www.computerworld.com/s/article/9234563/By_2020_there_will_be_5_200_GB_of_data_for_every
_person_on_Earth
• Tom White. Hadoop: the definitive guide.
http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA648&dq=hadoop&hl=en&sa=X&ei=6mfKU
PW7Je3U2QWtzIDgCg&ved=0CDcQ6AEwAA
• Ilan Horn. Introduction to MapReduce, an Abstraction for Large-Scale Computation.
http://www.slideshare.net/rantav/introduction-to-map-reduce#btnNext
References 62
63. • A brief view to the Platform. Retrieve from: http://hadooper.blogspot.com/
• Hadoop. Retrieve from: http://pig.apache.org/
• Applications and organizations using Hadoop. Retrieve from:
http://wiki.apache.org/hadoop/PoweredBy
• Installing and Running Pig. Retrieve from:
http://ofps.oreilly.com/titles/9781449302641/running_pig.html
• Alan, Gates. Programming Pig. 1 st ed. O'Reilly Media, 2009. 11-50. Print.
• What is Hive? Retrieve from: http://hive.apache.org/docs/r0.8.1/
• Hive vs. Pig. Retrieve from: http://www.larsgeorge.com/2009/10/hive-vs-pig.html
• George , Lars . HBase: The Definitive Guide. 1 st ed. O'Reilly Media, 2011. 212-
215. Print.
• White , Tom . Hadoop: The Definitive Guide. 1 st ed. O'Reilly Media, 2009. 312-
368. Print.
References 63