SlideShare une entreprise Scribd logo
1  sur  96
Télécharger pour lire hors ligne
Apache Hadoop
In Theory And Practice
Adam Kawa

Data Engineer @ Spotify
Why Data?
Get insights to offer a better product
“More data usually beats better algorithms”
Get of insights to make better decisions
Avoid “guesstimates”
Take a competitive advantage
What Is Challenging?
Store data reliably
Analyze data quickly
Cost-effective way
Use expressible and high-level language
Fundamental Ideas
A big system of machines, not a big machine
Failures will happen
Move computation to data, not data to computation
Write complex code only once, but right

A system of multiple animals
Apache Hadoop
An open-source Java software
Storing and processing of very large data sets
A clusters of commodity machines
A simple programming model
Apache Hadoop
Two main components:
HDFS - a distributed file system
MapReduce – a distributed processing layer

Many other tools belong to “Apache Hadoop Ecosystem”
Component

HDFS
The Purpose Of HDFS
Store large datasets in a distributed, scalable and fault-tolerant way
High throughput
It is like a big truck
Very large files
to move heavy stuff
Streaming reads and writes (no edits)
(not Ferrari)
Write once, read many times
HDFS Mis-Usage
Do NOT use, if you have
Low-latency requests
Random reads and writes
Lots of small files
Then better to consider
RDBMs,
File servers,
Hbase or Cassandra...
Splitting Files And Replicating Blocks
Split a very large file into smaller (but still large) blocks
Store them redundantly on a set of machines
Spiting Files Into Blocks

Today, 128MB or 256MB is recommended

The default block size is 64MB
Minimize the overhead of a disk seek operation (less than 1%)
A file is just “sliced” into chunks after each 64MB (or so)
It does NOT matter whether it is text/binary, compressed or not
It does matter later (when reading the data)
Replicating Blocks
The default replication factor is 3
It can be changed per a file or a directory
It can be increased for “hot” datasets (temporarily or permanently)
Trade-off between
Reliability, availability, performance
Disk space
Master And Slaves
The Master node keeps and manages all metadata information
The Slave nodes store blocks of data and serve them to the client
Master node (called NameNode)

Slave nodes (called DataNodes)
Classical* HDFS Cluster
*no NameNode HA, no HDFS Replication
Manages metadata

Does some
“house-keeping”
operations for
NameNode

Stores and retrieves
blocks of data
HDFS NameNode
Performs all the metadata-related operations
Keeps information in RAM (for fast lookup)
The filesystem tree
Metadata for all files/directories (e.g. ownership, permissions)
Names and locations of blocks
Metadata (not all) is additionally stored on disk(s) (for reliability)
The filesystem snapshot (fsimage) + editlog (edits) files
HDFS DataNode
Stores and retrieves blocks of data
Data is stored as regular files on a local filesystem (e.g. ext4)
e.g. blk_-992391354910561645 (+ checksums in a separate file)
A block itself does not know which file it belongs to!
Sends a heartbeat message to the NN to say that it is still alive
Sends a block report to the NN periodically
HDFS Secondary NameNode
NOT a failover NameNode
Periodically merges a prior snapshot (fsimage) and editlog(s) (edits)
Fetches current fsimage and edits files from the NameNode
Applies edits to fsimage to create the up-to-date fsimage
Then sends the up-to-date fsimage back to the NameNode
We can configure frequency of this operation
Reduces the NameNode startup time
Prevents edits to become too large
Exemplary HDFS Commands
hadoop fs -ls -R /user/kawaa
hadoop fs -cat /toplist/2013-05-15/poland.txt | less
hadoop fs -put logs.txt /incoming/logs/user
hadoop fs -count /toplist
hadoop fs -chown kawaa:kawaa /toplist/2013-05-15/poland.avro

It is distributed, but it gives you a beautiful abstraction!
Reading A File From HDFS
Block data is never sent through the NameNode
The NameNode redirects a client to an appropriate DataNode
The NameNode chooses a DataNode that is as “close” as possible
$ hadoop fs -cat /toplist/2013-05-15/poland.txt

Blocks locations

Lots of data comes
from DataNodes
to a client
HDFS Network Topology
Network topology defined by an administrator in a supplied script
Convert IP address into a path to a rack (e.g /dc1/rack1)
A path is used to calculate distance between nodes

Image source: “Hadoop: The Definitive Guide” by Tom White
HDFS Block Placement
Pluggable (default in BlockPlacementPolicyDefault.java)
1st replica on the same node where a writer is located
Otherwise “random” (but not too “busy” or almost full) node is used
2nd and the 3rd replicas on two different nodes in a different rack
The rest are placed on random nodes
No DataNode with more than one replica of any block
No rack with more than two replicas of the same block (if possible)
HDFS Balancer
Moves block from over-utilized DNs to under-utilized DNs
Stops when HDFS is balanced
the utilization of
differs from the
Maintains the block placement policy

every DN
utilization

of the cluster by no
more than a given threshold
Questions

HDFS
HDFS Block
Question
Why a block itself does NOT know
which file it belongs to?
HDFS Block
Question
Why a block itself does NOT know
which file it belongs to?
Answer
Design decision → simplicity, performance
Filename, permissions, ownership etc might change
It would require updating all block replicas that belongs to a file
HDFS Metadata
Question
Why NN does NOT store information
about block locations on disks?
HDFS Metadata
Question
Why NN does NOT store information
about block locations on disks?
Answer
Design decision → simplicity
They are sent by DataNodes as block reports periodically
Locations of block replicas may change over time
A change in IP address or hostname of DataNode
Balancing the cluster (moving blocks between nodes)
Moving disks between servers (e.g. failure of a motherboard)
HDFS Replication
Question
How many files represent a block replica in HDFS?
HDFS Replication
Question
How many files represent a block replica in HDFS?
Answer
Actually, two files:
The first file for data itself
The second file for block’s metadata
Checksums for the block data
The block’s generation stamp

by default less than 1%
of the actual data
HDFS Block Placement
Question
Why does NOT the default block placement strategy take the disk
space utilization (%) into account?

It only checks, if a node
a) has enough disk space to write a block, and
b) does not serve too many clients ...
HDFS Block Placement
Question
Why does NOT the default block placement strategy take the disk
space utilization (%) into account?
Answer
Some DataNodes might become overloaded by incoming data
e.g. a newly added node to the cluster
Facts

HDFS
HDFS And Local File System
Runs on the top of a native file system (e.g. ext3, ext4, xfs)
HDFS is simply a Java application that uses a native file system
HDFS Data Integrity
HDFS detects corrupted blocks
When writing
Client computes the checksums for each block
Client sends checksums to a DN together with data
When reading
Client verifies the checksums when reading a block
If verification fails, NN is notified about the corrupt replica
Then a DN fetches a different replica from another DN
HDFS NameNode Scalability
Stats based on Yahoo! Clusters
An average file ≈ 1.5 blocks (block size = 128 MB)
An average file ≈ 600 bytes in RAM (1 file and 2 blocks objects)
100M files ≈ 60 GB of metadata
1 GB of metadata ≈ 1 PB physical storage (but usually less*)
*Sadly, based on practical observations, the block to file ratio tends to
decrease during the lifetime of the cluster
Dekel Tankel, Yahoo!
HDFS NameNode Performance
Read/write operations throughput limited by one machine
~120K read ops/sec
~6K write ops/sec
MapReduce tasks are also HDFS clients
Internal load increases as the cluster grows
More block reports and heartbeats from DataNodes
More MapReduce tasks sending requests
Bigger snapshots transferred from/to Secondary NameNode
HDFS Main Limitations
Single NameNode
Keeps all metadata in RAM
Performs all metadata operations
Becomes a single point of failure (SPOF)
HDFS Main Improvements
Introduce multiple NameNodes in form of:
HDFS Federation
HDFS High Availability (HA)
Find

More:

http://slidesha.re/15zZlet
In practice

HDFS
Problem
DataNode can not start on a server for some reason
Usually it means some kind of disk failure
$ ls /disk/hd12/
ls: reading directory /disk/hd12/: Input/output error
org.apache.hadoop.util.DiskChecker$DiskErrorException:
Too many failed volumes - current valid volumes: 11,
volumes configured: 12, volumes failed: 1, Volume failures tolerated: 0
org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:

Increase dfs.datanode.failed.volumes.tolerated
to avoid expensive block replication when a disk fails
(and just monitor failed disks)
It was exciting this see stuff breaking!
In practice

HDFS
Problem

A user can not run resource-intensive Hive queries
It happened immediately after expanding the cluster
Description
The queries are valid
The queries are resource-intensive
The queries run successfully on a small dataset
But they fail on a large dataset
Surprisingly they run successfully through other user accounts
The user has right permissions to HDFS directories and Hive tables
The NameNode is throwing thousands of warnings and exceptions
14592 times only during 8 min (4768/min in a peak)
Normally
Hadoop is a very trusty elephant
The username comes from the client machine (and is not verified)
The groupname is resolved on the NameNode server
Using the shell command ''id -Gn <username>''
If a user does not have an account on the NameNode server
The ExitCodeException exception is thrown
Possible Fixes
Create an user account on the NameNode server (dirty, insecure)
Use AD/LDAP for a user-group resolution
hadoop.security.group.mapping.ldap.* settings
If you also need the full-authentication, deploy Kerberos
Our Fix
We decided to use LDAP for a user-group resolution
However, LDAP settings in Hadoop did not work for us
Because posixGroup is not a supported filter group class in
hadoop.security.group.mapping.ldap.search.filter.group

We found a workaround using nsswitch.conf
Lesson Learned
Know who is going to use your cluster
Know who is abusing the cluster (HDFS access and MR jobs)
Parse the NameNode logs regularly
Look for FATAL, ERROR, Exception messages
Especially before and after expanding the cluster
Component

MapReduce
MapReduce Model
Programming model inspired by functional programming
map() and reduce() functions processing <key, value> pairs
Useful for processing very large datasets in a distributed way
Simple, but very expressible
Map And Reduce Functions
Map And Reduce Functions Counting Word
MapReduce Job
Input data is divided into
splits and converted into
<key, value> pairs

Invokes map() function
multiple times

Invokes reduce()
Function multiple times
Keys are sorted,
values not (but
could be)
MapReduce Example: ArtistCount
Artist, Song, Timestamp, User

Key is the offset of the line
from the beginning
of the line

We could specify which artist
goes to which reducer
(HashParitioner is default one)
MapReduce Example: ArtistCount
map(Integer key, EndSong value, Context context):
context.write(value.artist, 1)
reduce(String key, Iterator<Integer> values, Context context):
int count = 0
for each v in values:
count += v
context.write(key, count)
Pseudo-code in
non-existing language ;)
MapReduce Combiner
Make sure that the Combiner
combines fast and enough
(otherwise it adds overhead only)
Data Locality in HDFS and MapReduce
By default, three replicas
should be available somewhere
on the cluster

Ideally, Mapper code is sent
to a node that has
the replica of this block
MapReduce Implementation
Batch processing system
Automatic parallelization and distribution of computation
Fault-tolerance
Deals with all messy details related to distributed processing
Relatively easy to use for programmers
Java API, Streaming (Python, Perl, Ruby …)
Apache Pig, Apache Hive, (S)Crunch
Status and monitoring tools
“Classical” MapReduce Daemons
Keeps track of TTs,
schedules jobs and tasks executions

Runs map and reduce tasks,
Reports to JobTracker
JobTracker Reponsibilities
Manages the computational resources
Available TaskTrackers, map and reduce slots
Schedules all user jobs
Schedules all tasks that belongs to a job
Monitors tasks executions
Restarts failed and speculatively runs slow tasks
Calculates job counters totals
TaskTracker Reponsibilities
Runs map and reduce tasks
Reports to JobTracker
Heartbeats saying that it is still alive
Number of free map and reduce slots
Task progress, status, counters etc
Apache Hadoop Cluster
It can consists of 1, 5, 100 and 4000 nodes
MapReduce Job Submission
They are copied with
a higher replication
factor
(by default, 10)

Image source: “Hadoop: The Definitive Guide” by Tom White

Tasks are started
in a separate JVM
to isolate a user code
form Hadoop code
MapReduce: Sort And Shuffle Phase
Map phase

Reduce phase

Other maps tasks
Image source: “Hadoop: The Definitive Guide” by Tom White

Other reduce tasks
MapReduce: Partitioner
Specifies which Reducer should get a given <key, value> pair
Aim for an even distribution of the intermediate data
Skewed data may overload a single reducer
And make a job running much longer
Speculative Execution
Scheduling a redundant copy of the remaining, long-running task
The output from the one that finishes first is used
The other one is killed, since it is no longer needed
An optimization, not a feature to make jobs run more reliably
Speculative Execution
Enable, if tasks often experience “external” problems e.g. hardware
degradation (disk, network card), system problems, memory
unavailability..

Otherwise

Speculative execution can reduce overall throughput
Redundant tasks run with similar speed as non-redundant ones
Might help one job, all the others have to wait longer for slots
Redundantly running reduce tasks will
transfer over the network all intermediate data
write their output redundantly (for a moment) to directory in HDFS
Facts

MapReduce
Java API
Very customizable
Input/Output Format, Record Reader/Writer,
Partitioner, Writable, Comparator …

Unit testing with MRUnit
HPROF profiler can give a lot of insights
Reuse objects (especially keys and values) when possible
Split String efficiently e.g. StringUtils instead of String.split

Hadoop Java API
http://slidesha.re/1c50IPk
More about
MapReduce Job Configuration
Tons of configuration parameters
Input split size (~implies the number of map tasks)
Number of reduce tasks
Available memory for tasks
Compression settings
Combiner
Partitioner
and more...
Questions

MapReduce
MapReduce Input <Key, Value> Pairs
Question
Why each line in text file is, by default, converted to
<offset, line> instead of <line_number, line>?
MapReduce Input <Key, Value> Pairs
Question
Why each line in text file is, by default, converted to
<offset, line> instead of <line_number, line>?
Answer
If your lines are not fixed-width,
you need to read file from the beginning to the end
to find line_number of each line (thus it is not parallelized).
In practice

MapReduce
“I noticed that a bottleneck seems to be coming from the map tasks.
Is there any reason that we can't open any of the allocated reduce slots
to map tasks?”
Regards,
Chris

How to
hard-code the
number of
map and
reduce
slots
efficiently?
This may change again soon ...

We initially started with 60/40
But today we are closer to something like 70/30
Time Spend In Occupied Slots

Occupied Map And Reduce Slots
We are currently introducing a new feature to Luigi
Automatic settings of
Maximum input split size (~implies the number of map tasks)
Number of reduce task
More settings soon (e.g. size of map output buffer)
The goal is each task running 5-15 minutes on average
Because even perfect manual setting
may become outdated
because input size grows
The current PoC ;)
type

# map

# reduce

avg map time

avg reduce time

job execution time

old_1

4826

25

46sec

1hrs, 52mins, 14sec

2hrs, 52mins, 16sec

new_1

391

294

4mins, 46sec

8mins, 24sec

23mins, 12sec

type

# map

# reduce

avg map time

avg reduce time

job execution time

old_2

4936

800

7mins, 30sec

22mins, 18sec

5hrs, 20mins, 1sec

new_2

4936

1893

8mins, 52sec

7mins, 35sec

1hrs, 18mins, 29sec

It should help in extreme cases
short-living maps
short-living and long-living reduces
In practice

MapReduce
Problem
Surprisingly, Hive queries are running extremely long
Thousands task are constantly being killed
Only 1 task failed,
2x more task
were killed
than
were completed
Apache Hadoop In Theory And Practice
Logs show that the JobTracker gets a request to kill the tasks
Who actually can send a kill request?
User (using e.g. mapred job -kill-task)
JobTracker (a speculative duplicate, or when a whole job fails)
Fair Scheduler
Diplomatically, it's called “preemption”
Key Observations
Killed tasks came from ad-hoc and resource-intensive Hive queries
Tasks are usually killed quickly after they start
Surviving tasks are running fine for long time
Hive queries are running in their own Fair Scheduler's pool
Eureka!
FairScheduler has a license to kill!
Preempt the newest tasks
in an over-share pool
to forcibly make some room
for starving pools
Hive pool was running over its minimum and fair shares
Other pools were running under their minimum and fair shares
So that
Fair Scheduler was (legally) killing Hive tasks from time to time

Fair Scheduler can kill to be KIND...
Possible Fixes
Disable the preemption
Tune minimum shares based on your workload
Tune preemption timeouts based on your workload
Limit the number of map/reduce tasks in a pool
Limit the number of jobs in a pool
Switch to Capacity Scheduler
Lessons Learned

A scheduler should NOT be considered as the ''black-box''
It is so easy to implement a long-running Hive query
More About

Hadoop Adventures
http://slidesha.re/1ctbTHT
In the reality

Hadoop is fun!
Questions?
Stockholm and Sweden
Want to join the band?
Check out spotify.com/jobs or @Spotifyjobs
for more information
kawaa@spotify.com
HakunaMapData.com
Thank you!

Contenu connexe

Tendances

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introductionchrislusf
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetupiwrigley
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive TutorialSandeep Patil
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Apache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdfApache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdfdogma28
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache KylinShi Shao Feng
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons Provectus
 

Tendances (20)

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Apache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdfApache Flink and Apache Hudi.pdf
Apache Flink and Apache Hudi.pdf
 
Spark tunning in Apache Kylin
Spark tunning in Apache KylinSpark tunning in Apache Kylin
Spark tunning in Apache Kylin
 
Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons          Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine. Pros and Cons
 

Similaire à Apache Hadoop In Theory And Practice

Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Hadoop at a glance
Hadoop at a glanceHadoop at a glance
Hadoop at a glanceTan Tran
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxSakthiVinoth78
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 

Similaire à Apache Hadoop In Theory And Practice (20)

module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Hadoop at a glance
Hadoop at a glanceHadoop at a glance
Hadoop at a glance
 
Hdfs
HdfsHdfs
Hdfs
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptx
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
HDFS
HDFSHDFS
HDFS
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hdfs
HdfsHdfs
Hdfs
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 

Plus de Adam Kawa

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At SpotifyAdam Kawa
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Adam Kawa
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
 
Apache Hadoop Java API
Apache Hadoop Java APIApache Hadoop Java API
Apache Hadoop Java APIAdam Kawa
 
Apache Hadoop Ecosystem (based on an exemplary data-driven…
Apache Hadoop Ecosystem (based on an exemplary data-driven…Apache Hadoop Ecosystem (based on an exemplary data-driven…
Apache Hadoop Ecosystem (based on an exemplary data-driven…Adam Kawa
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARNAdam Kawa
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacjiAdam Kawa
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGAdam Kawa
 

Plus de Adam Kawa (12)

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Apache Hadoop Java API
Apache Hadoop Java APIApache Hadoop Java API
Apache Hadoop Java API
 
Apache Hadoop Ecosystem (based on an exemplary data-driven…
Apache Hadoop Ecosystem (based on an exemplary data-driven…Apache Hadoop Ecosystem (based on an exemplary data-driven…
Apache Hadoop Ecosystem (based on an exemplary data-driven…
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacji
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 

Dernier

Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 

Dernier (20)

Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 

Apache Hadoop In Theory And Practice

  • 1. Apache Hadoop In Theory And Practice Adam Kawa Data Engineer @ Spotify
  • 2. Why Data? Get insights to offer a better product “More data usually beats better algorithms” Get of insights to make better decisions Avoid “guesstimates” Take a competitive advantage
  • 3. What Is Challenging? Store data reliably Analyze data quickly Cost-effective way Use expressible and high-level language
  • 4. Fundamental Ideas A big system of machines, not a big machine Failures will happen Move computation to data, not data to computation Write complex code only once, but right A system of multiple animals
  • 5. Apache Hadoop An open-source Java software Storing and processing of very large data sets A clusters of commodity machines A simple programming model
  • 6. Apache Hadoop Two main components: HDFS - a distributed file system MapReduce – a distributed processing layer Many other tools belong to “Apache Hadoop Ecosystem”
  • 8. The Purpose Of HDFS Store large datasets in a distributed, scalable and fault-tolerant way High throughput It is like a big truck Very large files to move heavy stuff Streaming reads and writes (no edits) (not Ferrari) Write once, read many times
  • 9. HDFS Mis-Usage Do NOT use, if you have Low-latency requests Random reads and writes Lots of small files Then better to consider RDBMs, File servers, Hbase or Cassandra...
  • 10. Splitting Files And Replicating Blocks Split a very large file into smaller (but still large) blocks Store them redundantly on a set of machines
  • 11. Spiting Files Into Blocks Today, 128MB or 256MB is recommended The default block size is 64MB Minimize the overhead of a disk seek operation (less than 1%) A file is just “sliced” into chunks after each 64MB (or so) It does NOT matter whether it is text/binary, compressed or not It does matter later (when reading the data)
  • 12. Replicating Blocks The default replication factor is 3 It can be changed per a file or a directory It can be increased for “hot” datasets (temporarily or permanently) Trade-off between Reliability, availability, performance Disk space
  • 13. Master And Slaves The Master node keeps and manages all metadata information The Slave nodes store blocks of data and serve them to the client Master node (called NameNode) Slave nodes (called DataNodes)
  • 14. Classical* HDFS Cluster *no NameNode HA, no HDFS Replication Manages metadata Does some “house-keeping” operations for NameNode Stores and retrieves blocks of data
  • 15. HDFS NameNode Performs all the metadata-related operations Keeps information in RAM (for fast lookup) The filesystem tree Metadata for all files/directories (e.g. ownership, permissions) Names and locations of blocks Metadata (not all) is additionally stored on disk(s) (for reliability) The filesystem snapshot (fsimage) + editlog (edits) files
  • 16. HDFS DataNode Stores and retrieves blocks of data Data is stored as regular files on a local filesystem (e.g. ext4) e.g. blk_-992391354910561645 (+ checksums in a separate file) A block itself does not know which file it belongs to! Sends a heartbeat message to the NN to say that it is still alive Sends a block report to the NN periodically
  • 17. HDFS Secondary NameNode NOT a failover NameNode Periodically merges a prior snapshot (fsimage) and editlog(s) (edits) Fetches current fsimage and edits files from the NameNode Applies edits to fsimage to create the up-to-date fsimage Then sends the up-to-date fsimage back to the NameNode We can configure frequency of this operation Reduces the NameNode startup time Prevents edits to become too large
  • 18. Exemplary HDFS Commands hadoop fs -ls -R /user/kawaa hadoop fs -cat /toplist/2013-05-15/poland.txt | less hadoop fs -put logs.txt /incoming/logs/user hadoop fs -count /toplist hadoop fs -chown kawaa:kawaa /toplist/2013-05-15/poland.avro It is distributed, but it gives you a beautiful abstraction!
  • 19. Reading A File From HDFS Block data is never sent through the NameNode The NameNode redirects a client to an appropriate DataNode The NameNode chooses a DataNode that is as “close” as possible $ hadoop fs -cat /toplist/2013-05-15/poland.txt Blocks locations Lots of data comes from DataNodes to a client
  • 20. HDFS Network Topology Network topology defined by an administrator in a supplied script Convert IP address into a path to a rack (e.g /dc1/rack1) A path is used to calculate distance between nodes Image source: “Hadoop: The Definitive Guide” by Tom White
  • 21. HDFS Block Placement Pluggable (default in BlockPlacementPolicyDefault.java) 1st replica on the same node where a writer is located Otherwise “random” (but not too “busy” or almost full) node is used 2nd and the 3rd replicas on two different nodes in a different rack The rest are placed on random nodes No DataNode with more than one replica of any block No rack with more than two replicas of the same block (if possible)
  • 22. HDFS Balancer Moves block from over-utilized DNs to under-utilized DNs Stops when HDFS is balanced the utilization of differs from the Maintains the block placement policy every DN utilization of the cluster by no more than a given threshold
  • 24. HDFS Block Question Why a block itself does NOT know which file it belongs to?
  • 25. HDFS Block Question Why a block itself does NOT know which file it belongs to? Answer Design decision → simplicity, performance Filename, permissions, ownership etc might change It would require updating all block replicas that belongs to a file
  • 26. HDFS Metadata Question Why NN does NOT store information about block locations on disks?
  • 27. HDFS Metadata Question Why NN does NOT store information about block locations on disks? Answer Design decision → simplicity They are sent by DataNodes as block reports periodically Locations of block replicas may change over time A change in IP address or hostname of DataNode Balancing the cluster (moving blocks between nodes) Moving disks between servers (e.g. failure of a motherboard)
  • 28. HDFS Replication Question How many files represent a block replica in HDFS?
  • 29. HDFS Replication Question How many files represent a block replica in HDFS? Answer Actually, two files: The first file for data itself The second file for block’s metadata Checksums for the block data The block’s generation stamp by default less than 1% of the actual data
  • 30. HDFS Block Placement Question Why does NOT the default block placement strategy take the disk space utilization (%) into account? It only checks, if a node a) has enough disk space to write a block, and b) does not serve too many clients ...
  • 31. HDFS Block Placement Question Why does NOT the default block placement strategy take the disk space utilization (%) into account? Answer Some DataNodes might become overloaded by incoming data e.g. a newly added node to the cluster
  • 33. HDFS And Local File System Runs on the top of a native file system (e.g. ext3, ext4, xfs) HDFS is simply a Java application that uses a native file system
  • 34. HDFS Data Integrity HDFS detects corrupted blocks When writing Client computes the checksums for each block Client sends checksums to a DN together with data When reading Client verifies the checksums when reading a block If verification fails, NN is notified about the corrupt replica Then a DN fetches a different replica from another DN
  • 35. HDFS NameNode Scalability Stats based on Yahoo! Clusters An average file ≈ 1.5 blocks (block size = 128 MB) An average file ≈ 600 bytes in RAM (1 file and 2 blocks objects) 100M files ≈ 60 GB of metadata 1 GB of metadata ≈ 1 PB physical storage (but usually less*) *Sadly, based on practical observations, the block to file ratio tends to decrease during the lifetime of the cluster Dekel Tankel, Yahoo!
  • 36. HDFS NameNode Performance Read/write operations throughput limited by one machine ~120K read ops/sec ~6K write ops/sec MapReduce tasks are also HDFS clients Internal load increases as the cluster grows More block reports and heartbeats from DataNodes More MapReduce tasks sending requests Bigger snapshots transferred from/to Secondary NameNode
  • 37. HDFS Main Limitations Single NameNode Keeps all metadata in RAM Performs all metadata operations Becomes a single point of failure (SPOF)
  • 38. HDFS Main Improvements Introduce multiple NameNodes in form of: HDFS Federation HDFS High Availability (HA) Find More: http://slidesha.re/15zZlet
  • 40. Problem DataNode can not start on a server for some reason
  • 41. Usually it means some kind of disk failure $ ls /disk/hd12/ ls: reading directory /disk/hd12/: Input/output error org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 11, volumes configured: 12, volumes failed: 1, Volume failures tolerated: 0 org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: Increase dfs.datanode.failed.volumes.tolerated to avoid expensive block replication when a disk fails (and just monitor failed disks)
  • 42. It was exciting this see stuff breaking!
  • 44. Problem A user can not run resource-intensive Hive queries It happened immediately after expanding the cluster
  • 45. Description The queries are valid The queries are resource-intensive The queries run successfully on a small dataset But they fail on a large dataset Surprisingly they run successfully through other user accounts The user has right permissions to HDFS directories and Hive tables
  • 46. The NameNode is throwing thousands of warnings and exceptions 14592 times only during 8 min (4768/min in a peak)
  • 47. Normally Hadoop is a very trusty elephant The username comes from the client machine (and is not verified) The groupname is resolved on the NameNode server Using the shell command ''id -Gn <username>'' If a user does not have an account on the NameNode server The ExitCodeException exception is thrown
  • 48. Possible Fixes Create an user account on the NameNode server (dirty, insecure) Use AD/LDAP for a user-group resolution hadoop.security.group.mapping.ldap.* settings If you also need the full-authentication, deploy Kerberos
  • 49. Our Fix We decided to use LDAP for a user-group resolution However, LDAP settings in Hadoop did not work for us Because posixGroup is not a supported filter group class in hadoop.security.group.mapping.ldap.search.filter.group We found a workaround using nsswitch.conf
  • 50. Lesson Learned Know who is going to use your cluster Know who is abusing the cluster (HDFS access and MR jobs) Parse the NameNode logs regularly Look for FATAL, ERROR, Exception messages Especially before and after expanding the cluster
  • 52. MapReduce Model Programming model inspired by functional programming map() and reduce() functions processing <key, value> pairs Useful for processing very large datasets in a distributed way Simple, but very expressible
  • 53. Map And Reduce Functions
  • 54. Map And Reduce Functions Counting Word
  • 55. MapReduce Job Input data is divided into splits and converted into <key, value> pairs Invokes map() function multiple times Invokes reduce() Function multiple times Keys are sorted, values not (but could be)
  • 56. MapReduce Example: ArtistCount Artist, Song, Timestamp, User Key is the offset of the line from the beginning of the line We could specify which artist goes to which reducer (HashParitioner is default one)
  • 57. MapReduce Example: ArtistCount map(Integer key, EndSong value, Context context): context.write(value.artist, 1) reduce(String key, Iterator<Integer> values, Context context): int count = 0 for each v in values: count += v context.write(key, count) Pseudo-code in non-existing language ;)
  • 58. MapReduce Combiner Make sure that the Combiner combines fast and enough (otherwise it adds overhead only)
  • 59. Data Locality in HDFS and MapReduce By default, three replicas should be available somewhere on the cluster Ideally, Mapper code is sent to a node that has the replica of this block
  • 60. MapReduce Implementation Batch processing system Automatic parallelization and distribution of computation Fault-tolerance Deals with all messy details related to distributed processing Relatively easy to use for programmers Java API, Streaming (Python, Perl, Ruby …) Apache Pig, Apache Hive, (S)Crunch Status and monitoring tools
  • 61. “Classical” MapReduce Daemons Keeps track of TTs, schedules jobs and tasks executions Runs map and reduce tasks, Reports to JobTracker
  • 62. JobTracker Reponsibilities Manages the computational resources Available TaskTrackers, map and reduce slots Schedules all user jobs Schedules all tasks that belongs to a job Monitors tasks executions Restarts failed and speculatively runs slow tasks Calculates job counters totals
  • 63. TaskTracker Reponsibilities Runs map and reduce tasks Reports to JobTracker Heartbeats saying that it is still alive Number of free map and reduce slots Task progress, status, counters etc
  • 64. Apache Hadoop Cluster It can consists of 1, 5, 100 and 4000 nodes
  • 65. MapReduce Job Submission They are copied with a higher replication factor (by default, 10) Image source: “Hadoop: The Definitive Guide” by Tom White Tasks are started in a separate JVM to isolate a user code form Hadoop code
  • 66. MapReduce: Sort And Shuffle Phase Map phase Reduce phase Other maps tasks Image source: “Hadoop: The Definitive Guide” by Tom White Other reduce tasks
  • 67. MapReduce: Partitioner Specifies which Reducer should get a given <key, value> pair Aim for an even distribution of the intermediate data Skewed data may overload a single reducer And make a job running much longer
  • 68. Speculative Execution Scheduling a redundant copy of the remaining, long-running task The output from the one that finishes first is used The other one is killed, since it is no longer needed An optimization, not a feature to make jobs run more reliably
  • 69. Speculative Execution Enable, if tasks often experience “external” problems e.g. hardware degradation (disk, network card), system problems, memory unavailability.. Otherwise Speculative execution can reduce overall throughput Redundant tasks run with similar speed as non-redundant ones Might help one job, all the others have to wait longer for slots Redundantly running reduce tasks will transfer over the network all intermediate data write their output redundantly (for a moment) to directory in HDFS
  • 71. Java API Very customizable Input/Output Format, Record Reader/Writer, Partitioner, Writable, Comparator … Unit testing with MRUnit HPROF profiler can give a lot of insights Reuse objects (especially keys and values) when possible Split String efficiently e.g. StringUtils instead of String.split Hadoop Java API http://slidesha.re/1c50IPk More about
  • 72. MapReduce Job Configuration Tons of configuration parameters Input split size (~implies the number of map tasks) Number of reduce tasks Available memory for tasks Compression settings Combiner Partitioner and more...
  • 74. MapReduce Input <Key, Value> Pairs Question Why each line in text file is, by default, converted to <offset, line> instead of <line_number, line>?
  • 75. MapReduce Input <Key, Value> Pairs Question Why each line in text file is, by default, converted to <offset, line> instead of <line_number, line>? Answer If your lines are not fixed-width, you need to read file from the beginning to the end to find line_number of each line (thus it is not parallelized).
  • 77. “I noticed that a bottleneck seems to be coming from the map tasks. Is there any reason that we can't open any of the allocated reduce slots to map tasks?” Regards, Chris How to hard-code the number of map and reduce slots efficiently?
  • 78. This may change again soon ... We initially started with 60/40 But today we are closer to something like 70/30 Time Spend In Occupied Slots Occupied Map And Reduce Slots
  • 79. We are currently introducing a new feature to Luigi Automatic settings of Maximum input split size (~implies the number of map tasks) Number of reduce task More settings soon (e.g. size of map output buffer) The goal is each task running 5-15 minutes on average Because even perfect manual setting may become outdated because input size grows
  • 80. The current PoC ;) type # map # reduce avg map time avg reduce time job execution time old_1 4826 25 46sec 1hrs, 52mins, 14sec 2hrs, 52mins, 16sec new_1 391 294 4mins, 46sec 8mins, 24sec 23mins, 12sec type # map # reduce avg map time avg reduce time job execution time old_2 4936 800 7mins, 30sec 22mins, 18sec 5hrs, 20mins, 1sec new_2 4936 1893 8mins, 52sec 7mins, 35sec 1hrs, 18mins, 29sec It should help in extreme cases short-living maps short-living and long-living reduces
  • 82. Problem Surprisingly, Hive queries are running extremely long Thousands task are constantly being killed
  • 83. Only 1 task failed, 2x more task were killed than were completed
  • 85. Logs show that the JobTracker gets a request to kill the tasks Who actually can send a kill request? User (using e.g. mapred job -kill-task) JobTracker (a speculative duplicate, or when a whole job fails) Fair Scheduler Diplomatically, it's called “preemption”
  • 86. Key Observations Killed tasks came from ad-hoc and resource-intensive Hive queries Tasks are usually killed quickly after they start Surviving tasks are running fine for long time Hive queries are running in their own Fair Scheduler's pool
  • 87. Eureka! FairScheduler has a license to kill! Preempt the newest tasks in an over-share pool to forcibly make some room for starving pools
  • 88. Hive pool was running over its minimum and fair shares Other pools were running under their minimum and fair shares So that Fair Scheduler was (legally) killing Hive tasks from time to time Fair Scheduler can kill to be KIND...
  • 89. Possible Fixes Disable the preemption Tune minimum shares based on your workload Tune preemption timeouts based on your workload Limit the number of map/reduce tasks in a pool Limit the number of jobs in a pool Switch to Capacity Scheduler
  • 90. Lessons Learned A scheduler should NOT be considered as the ''black-box'' It is so easy to implement a long-running Hive query
  • 95. Want to join the band? Check out spotify.com/jobs or @Spotifyjobs for more information kawaa@spotify.com HakunaMapData.com