2. Why Data?
Get insights to offer a better product
“More data usually beats better algorithms”
Get of insights to make better decisions
Avoid “guesstimates”
Take a competitive advantage
3. What Is Challenging?
Store data reliably
Analyze data quickly
Cost-effective way
Use expressible and high-level language
4. Fundamental Ideas
A big system of machines, not a big machine
Failures will happen
Move computation to data, not data to computation
Write complex code only once, but right
A system of multiple animals
5. Apache Hadoop
An open-source Java software
Storing and processing of very large data sets
A clusters of commodity machines
A simple programming model
6. Apache Hadoop
Two main components:
HDFS - a distributed file system
MapReduce – a distributed processing layer
Many other tools belong to “Apache Hadoop Ecosystem”
8. The Purpose Of HDFS
Store large datasets in a distributed, scalable and fault-tolerant way
High throughput
It is like a big truck
Very large files
to move heavy stuff
Streaming reads and writes (no edits)
(not Ferrari)
Write once, read many times
9. HDFS Mis-Usage
Do NOT use, if you have
Low-latency requests
Random reads and writes
Lots of small files
Then better to consider
RDBMs,
File servers,
Hbase or Cassandra...
10. Splitting Files And Replicating Blocks
Split a very large file into smaller (but still large) blocks
Store them redundantly on a set of machines
11. Spiting Files Into Blocks
Today, 128MB or 256MB is recommended
The default block size is 64MB
Minimize the overhead of a disk seek operation (less than 1%)
A file is just “sliced” into chunks after each 64MB (or so)
It does NOT matter whether it is text/binary, compressed or not
It does matter later (when reading the data)
12. Replicating Blocks
The default replication factor is 3
It can be changed per a file or a directory
It can be increased for “hot” datasets (temporarily or permanently)
Trade-off between
Reliability, availability, performance
Disk space
13. Master And Slaves
The Master node keeps and manages all metadata information
The Slave nodes store blocks of data and serve them to the client
Master node (called NameNode)
Slave nodes (called DataNodes)
14. Classical* HDFS Cluster
*no NameNode HA, no HDFS Replication
Manages metadata
Does some
“house-keeping”
operations for
NameNode
Stores and retrieves
blocks of data
15. HDFS NameNode
Performs all the metadata-related operations
Keeps information in RAM (for fast lookup)
The filesystem tree
Metadata for all files/directories (e.g. ownership, permissions)
Names and locations of blocks
Metadata (not all) is additionally stored on disk(s) (for reliability)
The filesystem snapshot (fsimage) + editlog (edits) files
16. HDFS DataNode
Stores and retrieves blocks of data
Data is stored as regular files on a local filesystem (e.g. ext4)
e.g. blk_-992391354910561645 (+ checksums in a separate file)
A block itself does not know which file it belongs to!
Sends a heartbeat message to the NN to say that it is still alive
Sends a block report to the NN periodically
17. HDFS Secondary NameNode
NOT a failover NameNode
Periodically merges a prior snapshot (fsimage) and editlog(s) (edits)
Fetches current fsimage and edits files from the NameNode
Applies edits to fsimage to create the up-to-date fsimage
Then sends the up-to-date fsimage back to the NameNode
We can configure frequency of this operation
Reduces the NameNode startup time
Prevents edits to become too large
18. Exemplary HDFS Commands
hadoop fs -ls -R /user/kawaa
hadoop fs -cat /toplist/2013-05-15/poland.txt | less
hadoop fs -put logs.txt /incoming/logs/user
hadoop fs -count /toplist
hadoop fs -chown kawaa:kawaa /toplist/2013-05-15/poland.avro
It is distributed, but it gives you a beautiful abstraction!
19. Reading A File From HDFS
Block data is never sent through the NameNode
The NameNode redirects a client to an appropriate DataNode
The NameNode chooses a DataNode that is as “close” as possible
$ hadoop fs -cat /toplist/2013-05-15/poland.txt
Blocks locations
Lots of data comes
from DataNodes
to a client
20. HDFS Network Topology
Network topology defined by an administrator in a supplied script
Convert IP address into a path to a rack (e.g /dc1/rack1)
A path is used to calculate distance between nodes
Image source: “Hadoop: The Definitive Guide” by Tom White
21. HDFS Block Placement
Pluggable (default in BlockPlacementPolicyDefault.java)
1st replica on the same node where a writer is located
Otherwise “random” (but not too “busy” or almost full) node is used
2nd and the 3rd replicas on two different nodes in a different rack
The rest are placed on random nodes
No DataNode with more than one replica of any block
No rack with more than two replicas of the same block (if possible)
22. HDFS Balancer
Moves block from over-utilized DNs to under-utilized DNs
Stops when HDFS is balanced
the utilization of
differs from the
Maintains the block placement policy
every DN
utilization
of the cluster by no
more than a given threshold
25. HDFS Block
Question
Why a block itself does NOT know
which file it belongs to?
Answer
Design decision → simplicity, performance
Filename, permissions, ownership etc might change
It would require updating all block replicas that belongs to a file
27. HDFS Metadata
Question
Why NN does NOT store information
about block locations on disks?
Answer
Design decision → simplicity
They are sent by DataNodes as block reports periodically
Locations of block replicas may change over time
A change in IP address or hostname of DataNode
Balancing the cluster (moving blocks between nodes)
Moving disks between servers (e.g. failure of a motherboard)
29. HDFS Replication
Question
How many files represent a block replica in HDFS?
Answer
Actually, two files:
The first file for data itself
The second file for block’s metadata
Checksums for the block data
The block’s generation stamp
by default less than 1%
of the actual data
30. HDFS Block Placement
Question
Why does NOT the default block placement strategy take the disk
space utilization (%) into account?
It only checks, if a node
a) has enough disk space to write a block, and
b) does not serve too many clients ...
31. HDFS Block Placement
Question
Why does NOT the default block placement strategy take the disk
space utilization (%) into account?
Answer
Some DataNodes might become overloaded by incoming data
e.g. a newly added node to the cluster
33. HDFS And Local File System
Runs on the top of a native file system (e.g. ext3, ext4, xfs)
HDFS is simply a Java application that uses a native file system
34. HDFS Data Integrity
HDFS detects corrupted blocks
When writing
Client computes the checksums for each block
Client sends checksums to a DN together with data
When reading
Client verifies the checksums when reading a block
If verification fails, NN is notified about the corrupt replica
Then a DN fetches a different replica from another DN
35. HDFS NameNode Scalability
Stats based on Yahoo! Clusters
An average file ≈ 1.5 blocks (block size = 128 MB)
An average file ≈ 600 bytes in RAM (1 file and 2 blocks objects)
100M files ≈ 60 GB of metadata
1 GB of metadata ≈ 1 PB physical storage (but usually less*)
*Sadly, based on practical observations, the block to file ratio tends to
decrease during the lifetime of the cluster
Dekel Tankel, Yahoo!
36. HDFS NameNode Performance
Read/write operations throughput limited by one machine
~120K read ops/sec
~6K write ops/sec
MapReduce tasks are also HDFS clients
Internal load increases as the cluster grows
More block reports and heartbeats from DataNodes
More MapReduce tasks sending requests
Bigger snapshots transferred from/to Secondary NameNode
37. HDFS Main Limitations
Single NameNode
Keeps all metadata in RAM
Performs all metadata operations
Becomes a single point of failure (SPOF)
38. HDFS Main Improvements
Introduce multiple NameNodes in form of:
HDFS Federation
HDFS High Availability (HA)
Find
More:
http://slidesha.re/15zZlet
41. Usually it means some kind of disk failure
$ ls /disk/hd12/
ls: reading directory /disk/hd12/: Input/output error
org.apache.hadoop.util.DiskChecker$DiskErrorException:
Too many failed volumes - current valid volumes: 11,
volumes configured: 12, volumes failed: 1, Volume failures tolerated: 0
org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
Increase dfs.datanode.failed.volumes.tolerated
to avoid expensive block replication when a disk fails
(and just monitor failed disks)
44. Problem
A user can not run resource-intensive Hive queries
It happened immediately after expanding the cluster
45. Description
The queries are valid
The queries are resource-intensive
The queries run successfully on a small dataset
But they fail on a large dataset
Surprisingly they run successfully through other user accounts
The user has right permissions to HDFS directories and Hive tables
46. The NameNode is throwing thousands of warnings and exceptions
14592 times only during 8 min (4768/min in a peak)
47. Normally
Hadoop is a very trusty elephant
The username comes from the client machine (and is not verified)
The groupname is resolved on the NameNode server
Using the shell command ''id -Gn <username>''
If a user does not have an account on the NameNode server
The ExitCodeException exception is thrown
48. Possible Fixes
Create an user account on the NameNode server (dirty, insecure)
Use AD/LDAP for a user-group resolution
hadoop.security.group.mapping.ldap.* settings
If you also need the full-authentication, deploy Kerberos
49. Our Fix
We decided to use LDAP for a user-group resolution
However, LDAP settings in Hadoop did not work for us
Because posixGroup is not a supported filter group class in
hadoop.security.group.mapping.ldap.search.filter.group
We found a workaround using nsswitch.conf
50. Lesson Learned
Know who is going to use your cluster
Know who is abusing the cluster (HDFS access and MR jobs)
Parse the NameNode logs regularly
Look for FATAL, ERROR, Exception messages
Especially before and after expanding the cluster
52. MapReduce Model
Programming model inspired by functional programming
map() and reduce() functions processing <key, value> pairs
Useful for processing very large datasets in a distributed way
Simple, but very expressible
55. MapReduce Job
Input data is divided into
splits and converted into
<key, value> pairs
Invokes map() function
multiple times
Invokes reduce()
Function multiple times
Keys are sorted,
values not (but
could be)
56. MapReduce Example: ArtistCount
Artist, Song, Timestamp, User
Key is the offset of the line
from the beginning
of the line
We could specify which artist
goes to which reducer
(HashParitioner is default one)
57. MapReduce Example: ArtistCount
map(Integer key, EndSong value, Context context):
context.write(value.artist, 1)
reduce(String key, Iterator<Integer> values, Context context):
int count = 0
for each v in values:
count += v
context.write(key, count)
Pseudo-code in
non-existing language ;)
59. Data Locality in HDFS and MapReduce
By default, three replicas
should be available somewhere
on the cluster
Ideally, Mapper code is sent
to a node that has
the replica of this block
60. MapReduce Implementation
Batch processing system
Automatic parallelization and distribution of computation
Fault-tolerance
Deals with all messy details related to distributed processing
Relatively easy to use for programmers
Java API, Streaming (Python, Perl, Ruby …)
Apache Pig, Apache Hive, (S)Crunch
Status and monitoring tools
62. JobTracker Reponsibilities
Manages the computational resources
Available TaskTrackers, map and reduce slots
Schedules all user jobs
Schedules all tasks that belongs to a job
Monitors tasks executions
Restarts failed and speculatively runs slow tasks
Calculates job counters totals
63. TaskTracker Reponsibilities
Runs map and reduce tasks
Reports to JobTracker
Heartbeats saying that it is still alive
Number of free map and reduce slots
Task progress, status, counters etc
65. MapReduce Job Submission
They are copied with
a higher replication
factor
(by default, 10)
Image source: “Hadoop: The Definitive Guide” by Tom White
Tasks are started
in a separate JVM
to isolate a user code
form Hadoop code
66. MapReduce: Sort And Shuffle Phase
Map phase
Reduce phase
Other maps tasks
Image source: “Hadoop: The Definitive Guide” by Tom White
Other reduce tasks
67. MapReduce: Partitioner
Specifies which Reducer should get a given <key, value> pair
Aim for an even distribution of the intermediate data
Skewed data may overload a single reducer
And make a job running much longer
68. Speculative Execution
Scheduling a redundant copy of the remaining, long-running task
The output from the one that finishes first is used
The other one is killed, since it is no longer needed
An optimization, not a feature to make jobs run more reliably
69. Speculative Execution
Enable, if tasks often experience “external” problems e.g. hardware
degradation (disk, network card), system problems, memory
unavailability..
Otherwise
Speculative execution can reduce overall throughput
Redundant tasks run with similar speed as non-redundant ones
Might help one job, all the others have to wait longer for slots
Redundantly running reduce tasks will
transfer over the network all intermediate data
write their output redundantly (for a moment) to directory in HDFS
71. Java API
Very customizable
Input/Output Format, Record Reader/Writer,
Partitioner, Writable, Comparator …
Unit testing with MRUnit
HPROF profiler can give a lot of insights
Reuse objects (especially keys and values) when possible
Split String efficiently e.g. StringUtils instead of String.split
Hadoop Java API
http://slidesha.re/1c50IPk
More about
72. MapReduce Job Configuration
Tons of configuration parameters
Input split size (~implies the number of map tasks)
Number of reduce tasks
Available memory for tasks
Compression settings
Combiner
Partitioner
and more...
74. MapReduce Input <Key, Value> Pairs
Question
Why each line in text file is, by default, converted to
<offset, line> instead of <line_number, line>?
75. MapReduce Input <Key, Value> Pairs
Question
Why each line in text file is, by default, converted to
<offset, line> instead of <line_number, line>?
Answer
If your lines are not fixed-width,
you need to read file from the beginning to the end
to find line_number of each line (thus it is not parallelized).
77. “I noticed that a bottleneck seems to be coming from the map tasks.
Is there any reason that we can't open any of the allocated reduce slots
to map tasks?”
Regards,
Chris
How to
hard-code the
number of
map and
reduce
slots
efficiently?
78. This may change again soon ...
We initially started with 60/40
But today we are closer to something like 70/30
Time Spend In Occupied Slots
Occupied Map And Reduce Slots
79. We are currently introducing a new feature to Luigi
Automatic settings of
Maximum input split size (~implies the number of map tasks)
Number of reduce task
More settings soon (e.g. size of map output buffer)
The goal is each task running 5-15 minutes on average
Because even perfect manual setting
may become outdated
because input size grows
80. The current PoC ;)
type
# map
# reduce
avg map time
avg reduce time
job execution time
old_1
4826
25
46sec
1hrs, 52mins, 14sec
2hrs, 52mins, 16sec
new_1
391
294
4mins, 46sec
8mins, 24sec
23mins, 12sec
type
# map
# reduce
avg map time
avg reduce time
job execution time
old_2
4936
800
7mins, 30sec
22mins, 18sec
5hrs, 20mins, 1sec
new_2
4936
1893
8mins, 52sec
7mins, 35sec
1hrs, 18mins, 29sec
It should help in extreme cases
short-living maps
short-living and long-living reduces
83. Only 1 task failed,
2x more task
were killed
than
were completed
85. Logs show that the JobTracker gets a request to kill the tasks
Who actually can send a kill request?
User (using e.g. mapred job -kill-task)
JobTracker (a speculative duplicate, or when a whole job fails)
Fair Scheduler
Diplomatically, it's called “preemption”
86. Key Observations
Killed tasks came from ad-hoc and resource-intensive Hive queries
Tasks are usually killed quickly after they start
Surviving tasks are running fine for long time
Hive queries are running in their own Fair Scheduler's pool
87. Eureka!
FairScheduler has a license to kill!
Preempt the newest tasks
in an over-share pool
to forcibly make some room
for starving pools
88. Hive pool was running over its minimum and fair shares
Other pools were running under their minimum and fair shares
So that
Fair Scheduler was (legally) killing Hive tasks from time to time
Fair Scheduler can kill to be KIND...
89. Possible Fixes
Disable the preemption
Tune minimum shares based on your workload
Tune preemption timeouts based on your workload
Limit the number of map/reduce tasks in a pool
Limit the number of jobs in a pool
Switch to Capacity Scheduler
90. Lessons Learned
A scheduler should NOT be considered as the ''black-box''
It is so easy to implement a long-running Hive query