Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M
University of Texas at Austin, Fall 2011
Lecture 2 September 1, 2011
Jason Baldridge and Matt Lease
https://sites.google.com/a/utcompling.com/dicta-f11/
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
1. Data-Intensive Computing for Text Analysis
CS395T / INF385T / LIN386M
University of Texas at Austin, Fall 2011
Lecture 2
September 1, 2011
Jason Baldridge Matt Lease
Department of Linguistics School of Information
University of Texas at Austin University of Texas at Austin
Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
2. Acknowledgments
Course design and slides derived from
Jimmy Lin’s cloud computing courses at
the University of Maryland, College Park
Some figures courtesy of
• Chuck Lam’s Hadoop In Action (2011)
• Tom White’s Hadoop: The Definitive Guide,
2nd Edition (2010)
6. “Big Ideas”
Scale “out”, not “up”
Limits of SMP and large shared-memory machines
Move processing to the data
Cluster have limited bandwidth
Process data sequentially, avoid random access
Seeks are expensive, disk throughput is reasonable
Seamless scalability
From the mythical man-month to the tradable machine-hour
7. Typical Large-Data Problem
Iterate over a large number of records
Compute something of interest from each
Shuffle and sort intermediate results
Aggregate intermediate results
Generate final output
Key idea: provide a functional abstraction for
these two operations
(Dean and Ghemawat, OSDI 2004)
9. MapReduce “Runtime”
Handles scheduling
Assigns workers to map and reduce tasks
Handles “data distribution”
Moves processes to data
Handles synchronization
Gathers, sorts, and shuffles intermediate data
Handles errors and faults
Detects worker failures and restarts
Built on a distributed file system
10. MapReduce
Programmers specify two functions
map ( K1, V1 ) → list ( K2, V2 )
reduce ( K2, list(V2) ) → list ( K3, V3)
Note correspondence of types map output → reduce input
Data Flow
Input → “input splits”: each a sequence of logical (K1,V1) “records”
Map
• Each split processed by same map node
• map invoked iteratively: once per record in the split
• For each record processed, map may emit 0-N (K2,V2) pairs
Reduce
• reduce invoked iteratively for each ( K2, list(V2) ) intermediate value
• For each processed, reduce may emit 0-N (K3,V3) pairs
Each reducer’s output written to a persistent file in HDFS
12. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 30
Data Flow
Input → “input splits”: each a sequence of logical (K1,V1) “records”
For each split, for each record, do map(K1,V1) (multiple calls)
Each map call may emit any number of (K2,V2) pairs (0-N)
Run-time
Groups all values with the same key into ( K2, list(V2) )
Determines which reducer will process this
Copies data across network as needed for reducer
Ensures intra-node sort of keys processed by each reducer
• No guarantee by default of inter-node total sort across reducers
13. “Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )
reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)
Map(String docid, String text):
for each word w in text:
Emit(w, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, sum);
14. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
Shuffle and Sort: aggregate values by keys
a 1 5 b 2 7 c 2 3 6 8
reduce reduce reduce
r1 s1 r2 s2 r3 s3
Courtesy of Chuck Lam’s Hadoop In
Action (2011), pp. 45, 52
15. Partition
Given: map ( K1, V1 ) → list ( K2, V2 )
reduce ( K2, list(V2) ) → list ( K3, V3)
partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]
Each distinct key (with associated values) sent to a single reducer
• Same reduce node may process multiple keys in separate reduce() calls
Balances workload across reducers: equal number of keys to each
• Default: simple hash of the key, e.g., hash(k’) mod N (# reducers)
Customizable
• Some keys require more computation than others
• e.g. value skew, or key-specific computation performed
• For skew, sampling can dynamically estimate distribution & set partition
• Secondary/Tertiary sorting (e.g. bigrams or arbitrary n-grams)?
16. Secondary Sorting (Lin 57, White 241)
How to output sorted bigrams (1st word, then list of 2nds)?
What if we use word1 as the key, word 2 as the value?
What if we use <first>--<second> as the key?
Pattern
Create a composite key of (first, second)
Define a Key Comparator based on both words
• This will produce the sort order we want (aa ab ac ba bb bc ca cb…)
Define a partition function based only on first word
• All bigrams with the same first word go to same reducer
• How do you know when the first word changes across invocations?
Preserve state in the reducer across invocations
• Will be called separately for each bigram, but we want to remember
the current first word across bigrams seen
Hadoop also provides Group Comparator
17. Combine
Given: map ( K1, V1 ) → list ( K2, V2 )
reduce ( K2, list(V2) ) → list ( K3, V3)
combine ( K2, list(V2) ) → list ( K2, V2 )
Optional optimization
Local aggregation to reduce network traffic
No guarantee it will be used, how many times it will be called
Semantics of program cannot depend on its use
Signature: same input as reduce, same output as map
Combine may be run repeatedly on its own output
Lin: Associative & Commutative combiner = reducer
• See next slide
18. Functional Properties
Associative: f( a, f(b,c) ) = f( f(a,b), c )
Grouping of operations doesn’t matter
YES: Addition, multiplication, concatenation
NO: division, subtraction, NAND
NAND(1, NAND(1,0)) = 0 != 1 = NAND( NAND(1,0), 0 )
Commutative: f(a,b) = f(b,a)
Ordering of arguments doesn’t matter
YES: addition, multiplication, NAND
NO: division, subtraction, concatenation
Concatenate(“a,”b”) != concatenate(“b”,a”)
Distributive
White (p. 32) and Lam (p. 84) mention with regard to combiners
But really, go with associative + commutative in Lin (pp. 20, 27)
19. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8
combine combine combine combine
a 1 b 2 c 9 a 5 c 2 b 7 c 8
partition partition partition partition
Shuffle and Sort: aggregate values by keys
a 1 5 b 2 7 c 2 9 8 8
3 6
reduce reduce reduce
r1 s1 r2 s2 r3 s3
20. User
Program
(1) submit
Master
(2) schedule map (2) schedule reduce
worker
split 0
(6) write output
split 1 (5) remote read worker
(3) read file 0
split 2 (4) local write
worker
split 3
split 4 output
worker
file 1
worker
Input Map Intermediate files Reduce Output
files phase (on local disk) phase files
Adapted from (Dean and Ghemawat, OSDI 2004)
21. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178
Shuffle and 2 Sorts
As map emits values, local sorting
runs in tandem (1st sort)
Combine is optionally called
0..N times for local aggregation
on sorted (K2, list(V2)) tuples (more sorting of output)
Partition determines which (logical) reducer Rj each key will go to
Node’s TaskTracker tells JobTracker it has keys for Rj
JobTracker determines node to run Rj based on data locality
When local map/combine/sort finishes, sends data to Rj’s node
Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)
For each (K, list(V)) tuple in merged output, call reduce(…)
22. Distributed File System
Don’t move data… move computation to the data!
Store data on the local disks of nodes in the cluster
Start up the workers on the node that has the data local
Why?
Not enough RAM to hold all the data in memory
Disk access is slow, but disk throughput is reasonable
A distributed file system is the answer
GFS (Google File System) for Google’s MapReduce
HDFS (Hadoop Distributed File System) for Hadoop
23. GFS: Assumptions
Commodity hardware over “exotic” hardware
Scale “out”, not “up”
High component failure rates
Inexpensive commodity components fail all the time
“Modest” number of huge files
Multi-gigabyte files are common, if not encouraged
Files are write-once, mostly appended to
Perhaps concurrently
Large streaming reads over random access
High sustained throughput over low latency
GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
24. GFS: Design Decisions
Files stored as chunks
Fixed size (64MB)
Reliability through replication
Each chunk replicated across 3+ chunkservers
Single master to coordinate access, keep metadata
Simple centralized management
No data caching
Little benefit due to large datasets, streaming reads
Simplify the API
Push some of the issues onto the client (e.g., data layout)
HDFS = GFS clone (same basic ideas)
25. Basic Cluster Components
1 “Manager” node (can be split onto 2 nodes)
Namenode (NN)
Jobtracker (JT)
1-N “Worker” nodes
Tasktracker (TT)
Datanode (DN)
Optional Secondary Namenode
Periodic backups of Namenode in case of failure
27. Namenode Responsibilities
Managing the file system namespace:
Holds file/directory structure, metadata, file-to-block mapping,
access permissions, etc.
Coordinating file operations:
Directs clients to datanodes for reads and writes
No data is moved through the namenode
Maintaining overall health:
Periodic communication with the datanodes
Block re-replication and rebalancing
Garbage collection
28. Putting everything together…
namenode job submission node
namenode daemon jobtracker
tasktracker tasktracker tasktracker
datanode daemon datanode daemon datanode daemon
Linux file system Linux file system Linux file system
… … …
slave node slave node slave node
29. Anatomy of a Job
MapReduce program in Hadoop = Hadoop job
Jobs are divided into map and reduce tasks (+ more!)
An instance of running a task is called a task attempt
Multiple jobs can be composed into a workflow
Job submission process
Client (i.e., driver program) creates a job, configures it, and
submits it to job tracker
JobClient computes input splits (on client end)
Job data (jar, configuration XML) are sent to JobTracker
JobTracker puts job data in shared location, enqueues tasks
TaskTrackers poll for tasks
Off to the races…
30. Why have 1 API when you can have 2?
White pp. 25-27, Lam pp. 77-80
Hadoop 0.19 and earlier had “old API”
Hadoop 0.21 and forward has “new API”
Hadoop 0.20 has both!
Old API most stable, but deprecated
Current books use old API predominantly, but discuss changes
• Example code using new API available online from publisher
Some old API classes/methods not yet ported to new API
Cloud9 uses both, and you can too
32. New API
org.apache.hadoop.mapred now deprecated; instead use
org.apache.hadoop.mapreduce &
org.apache.hadoop.mapreduce.lib
Mapper, Reducer now abstract classes, not interfaces
Use Context instead of OutputCollector and Reporter
Context.write(), not OutputCollector.collect()
Reduce takes value list as Iterable, not Iterator
Can use java’s foreach syntax for iterating
Can throw InterruptedException as well as IOException
JobConf & JobClient replaced by Configuration & Job