SlideShare une entreprise Scribd logo
1  sur  42
Hadoop Troubleshooting 101
Kate Ting, Cloudera
Ariel Rabkin, Cloudera/UC Berkeley
How to Destroy Your Cluster

  with 7 Misconfigurations
Agenda
• Ticket Breakdown
• What are Misconfigurations?
  • Memory Mismanagement
     • TT OOME
     • JT OOME
     • Native Threads
  • Thread Mismanagement
     • Fetch Failures
     • Replicas
  • Disk Mismanagement
     • No File
     • Too Many Files
• Cloudera Manager
Agenda
• Ticket Breakdown
• What are Misconfigurations?
  • Memory Mismanagement
     • TT OOME
     • JT OOME
     • Native Threads
  • Thread Mismanagement
     • Fetch Failures
     • Replicas
  • Disk Mismanagement
     • No File
     • Too Many Files
• Cloudera Manager
Ticket Breakdown

• No one issue was more than 2% of tickets

• First symptom ≠ root cause

• Bad configuration goes undetected
Breakdown of Tickets by Component
Agenda
• Ticket Breakdown
• What are Misconfigurations?
  • Memory Mismanagement
     • TT OOME
     • JT OOME
     • Native Threads
  • Thread Mismanagement
     • Fetch Failures
     • Replicas
  • Disk Mismanagement
     • No File
     • Too Many Files
• Cloudera Manager
What are Misconfigurations?

• Any diagnostic ticket requiring a change to Hadoop or to
  OS config files

• Comprise 35% of tickets

• e.g. resource-allocation: memory, file-handles, disk-space
Problem Tickets by Category and Time
Why Care About Misconfigurations?
The life of an over-subscribed MR/HBase
   cluster is nasty, brutish, and short.
            (with apologies to Thomas Hobbes)
Oversubscription of MR heap caused swap.




      Swap caused RegionServer to time out and die.



Dead RegionServer caused MR tasks to fail until MR job died.
Whodunit?



  A Faulty MR Config Killed HBase
Who Are We?

• Kate Ting
  o Customer Operations Engineer, Cloudera
  o 9 months troubleshooting 100+ Hadoop clusters
  o kate@cloudera.com

• Ariel Rabkin
  o Customer Operations Tools Team, Intern, Cloudera
  o UC Berkeley CS PhD candidate
  o Dissertation on diagnosing configuration bugs
  o asrabkin@eecs.berkeley.edu
Agenda
• Ticket Breakdown
• What are Misconfigurations?
  • Memory Mismanagement
     • TT OOME
     • JT OOME
     • Native Threads
  • Thread Mismanagement
     • Fetch Failures
     • Replicas
  • Disk Mismanagement
     • No File
     • Too Many Files
• Cloudera Manager
1. Task Out Of Memory Error

FATAL org.apache.hadoop.mapred.TaskTracker:
Error running child : java.lang.OutOfMemoryError: Java heap space
     at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask
.java:781)
     at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:350
)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
     at org.apache.hadoop.mapred.Child.main(Child.java:170)
1. Task Out Of Memory Error
• What does it mean?
  o Memory leak in task code

• What causes this?
  o MR task heap sizes will not fit

• How can it be resolved?
   o Set io.sort.mb < mapred.child.java.opts
       o e.g. 512M for io.sort.mb and 1G for mapred.child.java.opts
       io.sort.mb = size of map buffer; smaller buffer => more spills
   o Use fewer mappers & reducers
       mappers = # of cores on node
           if not using HBase
           if # disks = # cores
       4:3 mapper:reducer ratio
(Mappers + Reducers)*
               Child Task Heap
                       +
                   DN heap
                       +
Total RAM          TT heap
                       +
                     3GB
                       +
                   RS heap
                       +
             Other Services' heap
2. JobTracker Out of Memory Error
ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed:
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.TaskInProgress.<init>(TaskInProgress.java:122)
at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:653)
at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:3965)
at
org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskIniti
alizationListener.java:79)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:8
86)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
2. JobTracker Out of Memory Error

• What does it mean?
  o Total JT memory usage > allocated RAM

• What causes this?
  o Tasks too small
  o Too much job history
2. JobTracker Out of Memory Error

• How can it be resolved?
  o Verify by summary output: sudo -u mapred jmap -J-d64
    -histo:live <PID of JT>
  o Increase JT's heap space
  o Separate JT and NN
  o Decrease
    mapred.jobtracker.completeuserjobs.maximum to 5
      JT cleans up mem more often, reducing needed
       RAM
3. Unable to Create New Native Thread

ERROR mapred.JvmManager: Caught Throwable in JVMRunner. Aborting
TaskTracker.
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:640)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:234)
at org.apache.hadoop.util.Shell.run(Shell.java:182)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
at
org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.j
ava:127)
at
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild
(JvmManager.java:472)
at
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(Jvm
Manager.java:446)
3. Unable to Create New Native Thread
• What does it mean?
  o DN show up as dead even though processes are still
    running on those machines

• What causes this?
  o nproc default of 4200 processes is too low

• How can it be resolved?
  o Adjust /etc/security/limits.conf:
      hbase soft/hard nproc 50000
      hdfs soft/hard nproc 50000
      mapred soft/hard nproc 50000
      Restart DN, TT, JT, NN
Agenda
• Ticket Breakdown
• What are Misconfigurations?
  • Memory Mismanagement
     • TT OOME
     • JT OOME
     • Native Threads
  • Thread Mismanagment
     • Fetch Failures
     • Replicas
  • Disk Mismanagement
     • No File
     • Too Many Files
• Cloudera Manager
4. Too Many Fetch-Failures



INFO org.apache.hadoop.mapred.JobInProgress: Too many
fetch-failures for output of task:
4. Too Many Fetch-Failures

• What does it mean?
  o Reducer fetch operations fail to retrieve mapper outputs
  o Too many fetch failures occur on a particular TT
     o => blacklisted TT

• What causes this?
  o DNS issues
  o Not enough http threads on the mapper side for the
    reducers
  o JVM bug
4. Too Many Fetch-Failures
• How can it be resolved?
   o mapred.reduce.slowstart.completed.maps = 0.80
       Allows reducers from other jobs to run while a big job waits on
        mappers
   o tasktracker.http.threads = 80
       Specifies # threads used by the TT to serve map output to
        reducers
   o mapred.reduce.parallel.copies = SQRT(NodeCount) with a floor
     of 10
       Specifies # parallel copies used by reducers to fetch map
        output
   o Stop using 6.1.26 Jetty, which is fetch-failure prone and upgrade
     to CDH3u2 (MAPREDUCE-2980), MAPREDUCE-3184
5. Not Able to Place Enough Replicas



WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not
able to place enough replicas
5. Not Able to Place Enough Replicas



• What does it mean?
  o NN is unable to choose some DNs
5. Not Able to Place Enough Replicas
•   What causes this?
    o dfs replication > # avail DNs
        o # avail DNs is low due to low disk space
        o mapred.submit.replication default of 10 is too high
    o NN is unable to satisfy block replacement policy
        o If # racks > 2, a block has to exist in at least 2 racks
    o DN being decommissioned
    o DN has too much load on it
    o Block-size unreasonably large
    o Not enough xcievers threads
        Default 256 threads that DN can manage is too low
        Note the accidental misspelling
5. Not Able to Place Enough Replicas
• How can it be resolved?
  o Set dfs.datanode.max.xcievers = 4096 then restart DN
  o Look for nodes down (or rack down)
  o Check disk space
     o Log dir may be filling up or a runaway task may fill
       up an entire disk
  o Rebalance
Agenda
• Ticket Breakdown
• What are Misconfigurations?
  • Memory Exhaustion
     • TT OOME
     • JT OOME
     • Native Threads
  • Thread Exhaustion
     • Fetch Failures
     • Replicas
  • Data Space Exhaustion
     • No File
     • Too Many Files
• Cloudera Manager
6. No Such File or Directory

ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because
ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:
496)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:319)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapred.TaskTracker.initializeDirectories(TaskTracker.java:666)
at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:734)
at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:1431)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3521)
Total Storage




                MR space
                DFS space
6. No Such File or Directory
• What does it mean?
  o TT failing to start or jobs are failing
• What causes this?
  o Diskspace filling up on the TT disk
  o Permissions on the directories that TT writes to
• How can it be resolved?
   Job logging dir dfs.datanode.du.reserved = 10% total
    vol
   Permissions should be 755 and owner should be
    mapred for userlogs and mapred.local.dir
7. Too Many Open Files

ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(<ip address>:50010,
storageID=<storage id>, infoPort=50075, ipcPort=50020):DataXceiver
java.io.IOException: Too many open files
      at sun.nio.ch.IOUtil.initPipe(Native Method)
      at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
      at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
      at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
      at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
      at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
      at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
      at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
      at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
      at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
      at java.io.DataInputStream.readShort(DataInputStream.java:295)
      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:97)
7. Too Many Open Files
• What does it mean?
  o Hitting open file handles limit of user account running
    Hadoop

• What causes this?
  o Nofile default of 1024 files is too low

• How can it be resolved?
  o Adjust /etc/security/limits.conf:
      hdfs - nofile 32768
      mapred - nofile 32768
      hbase - nofile 32768
      Restart DN, TT, JT, NN
Agenda
• Ticket Breakdown
• What are Misconfigurations?
  • Memory Mismanagement
     • TT OOME
     • JT OOME
     • Native Threads
  • Thread Mismanagement
     • Fetch Failures
     • Replicas
  • Disk Mismanagement
     • No File
     • Too Many Files
• Cloudera Manager
Additional Resources
• Avoiding Common Hadoop Administration Issues:
  http://www.cloudera.com/blog/2010/08/avoiding-common-hadoop-
  administration-issues/

• Tips for Improving MapReduce Performance:
  http://www.cloudera.com/blog/2009/12/7-tips-for-improving-
  mapreduce-performance/

• Basic Hardware Recommendations:
  http://www.cloudera.com/blog/2010/03/clouderas-support-team-
  shares-some-basic-hardware-recommendations/

• Cloudera Knowledge
  Base: http://ccp.cloudera.com/display/KB/Knowledge+Base
Acknowledgements

Adam Warrington
Amr Awadallah
Angus Klein
Brock Noland
Clint Heath
Eric Sammer
Esteban Gutierrez
Jeff Bean
Joey Echeverria
Laura Gonzalez
Linden Hillenbrand
Omer Trajman
Patrick Angeles
Takeaways

• Configuration is up to you. Bugs are up to
  the community.

• Misconfigurations are hard to diagnose.

• Get it right the first time with Cloudera
  Manager.

Contenu connexe

Tendances

HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 

Tendances (20)

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
[234]멀티테넌트 하둡 클러스터 운영 경험기
[234]멀티테넌트 하둡 클러스터 운영 경험기[234]멀티테넌트 하둡 클러스터 운영 경험기
[234]멀티테넌트 하둡 클러스터 운영 경험기
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 

En vedette

Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Edureka!
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
Edureka!
 
Why Migrate from MySQL to Cassandra
Why Migrate from MySQL to CassandraWhy Migrate from MySQL to Cassandra
Why Migrate from MySQL to Cassandra
DATAVERSITY
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
mapr-academy
 
How to solve the mismanagement crisis
How to solve the mismanagement crisisHow to solve the mismanagement crisis
How to solve the mismanagement crisis
Balamurugan Ka
 
How Hadoop Exploits Data Locality
How Hadoop Exploits Data LocalityHow Hadoop Exploits Data Locality
How Hadoop Exploits Data Locality
Uday Vakalapudi
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
Tapan Avasthi
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012
jbellis
 

En vedette (20)

Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
A day in the life of hadoop administrator!
A day in the life of hadoop administrator!A day in the life of hadoop administrator!
A day in the life of hadoop administrator!
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache Hadoop
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems Webinar10 Common Hadoop-able Problems Webinar
10 Common Hadoop-able Problems Webinar
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Why Migrate from MySQL to Cassandra
Why Migrate from MySQL to CassandraWhy Migrate from MySQL to Cassandra
Why Migrate from MySQL to Cassandra
 
Apache Ambari: Simplified Hadoop Cluster Operation & Troubleshooting
Apache Ambari: Simplified Hadoop Cluster Operation & TroubleshootingApache Ambari: Simplified Hadoop Cluster Operation & Troubleshooting
Apache Ambari: Simplified Hadoop Cluster Operation & Troubleshooting
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
 
How to solve the mismanagement crisis
How to solve the mismanagement crisisHow to solve the mismanagement crisis
How to solve the mismanagement crisis
 
How Hadoop Exploits Data Locality
How Hadoop Exploits Data LocalityHow Hadoop Exploits Data Locality
How Hadoop Exploits Data Locality
 
Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
A short introduction to Vertica
A short introduction to VerticaA short introduction to Vertica
A short introduction to Vertica
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012Cassandra at NoSql Matters 2012
Cassandra at NoSql Matters 2012
 

Similaire à Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera

Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
mundlapudi
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Spark Summit
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 

Similaire à Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera (20)

Yarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingYarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-ting
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Building an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen TingBuilding an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen Ting
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 
HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trenches
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
 
Top ten-list
Top ten-listTop ten-list
Top ten-list
 

Plus de Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera

  • 1. Hadoop Troubleshooting 101 Kate Ting, Cloudera Ariel Rabkin, Cloudera/UC Berkeley
  • 2. How to Destroy Your Cluster with 7 Misconfigurations
  • 3. Agenda • Ticket Breakdown • What are Misconfigurations? • Memory Mismanagement • TT OOME • JT OOME • Native Threads • Thread Mismanagement • Fetch Failures • Replicas • Disk Mismanagement • No File • Too Many Files • Cloudera Manager
  • 4. Agenda • Ticket Breakdown • What are Misconfigurations? • Memory Mismanagement • TT OOME • JT OOME • Native Threads • Thread Mismanagement • Fetch Failures • Replicas • Disk Mismanagement • No File • Too Many Files • Cloudera Manager
  • 5. Ticket Breakdown • No one issue was more than 2% of tickets • First symptom ≠ root cause • Bad configuration goes undetected
  • 6. Breakdown of Tickets by Component
  • 7. Agenda • Ticket Breakdown • What are Misconfigurations? • Memory Mismanagement • TT OOME • JT OOME • Native Threads • Thread Mismanagement • Fetch Failures • Replicas • Disk Mismanagement • No File • Too Many Files • Cloudera Manager
  • 8. What are Misconfigurations? • Any diagnostic ticket requiring a change to Hadoop or to OS config files • Comprise 35% of tickets • e.g. resource-allocation: memory, file-handles, disk-space
  • 9. Problem Tickets by Category and Time
  • 10. Why Care About Misconfigurations?
  • 11. The life of an over-subscribed MR/HBase cluster is nasty, brutish, and short. (with apologies to Thomas Hobbes)
  • 12. Oversubscription of MR heap caused swap. Swap caused RegionServer to time out and die. Dead RegionServer caused MR tasks to fail until MR job died.
  • 13. Whodunit? A Faulty MR Config Killed HBase
  • 14. Who Are We? • Kate Ting o Customer Operations Engineer, Cloudera o 9 months troubleshooting 100+ Hadoop clusters o kate@cloudera.com • Ariel Rabkin o Customer Operations Tools Team, Intern, Cloudera o UC Berkeley CS PhD candidate o Dissertation on diagnosing configuration bugs o asrabkin@eecs.berkeley.edu
  • 15. Agenda • Ticket Breakdown • What are Misconfigurations? • Memory Mismanagement • TT OOME • JT OOME • Native Threads • Thread Mismanagement • Fetch Failures • Replicas • Disk Mismanagement • No File • Too Many Files • Cloudera Manager
  • 16. 1. Task Out Of Memory Error FATAL org.apache.hadoop.mapred.TaskTracker: Error running child : java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask .java:781) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:350 ) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170)
  • 17. 1. Task Out Of Memory Error • What does it mean? o Memory leak in task code • What causes this? o MR task heap sizes will not fit • How can it be resolved? o Set io.sort.mb < mapred.child.java.opts o e.g. 512M for io.sort.mb and 1G for mapred.child.java.opts  io.sort.mb = size of map buffer; smaller buffer => more spills o Use fewer mappers & reducers  mappers = # of cores on node  if not using HBase  if # disks = # cores  4:3 mapper:reducer ratio
  • 18. (Mappers + Reducers)* Child Task Heap + DN heap + Total RAM TT heap + 3GB + RS heap + Other Services' heap
  • 19. 2. JobTracker Out of Memory Error ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.TaskInProgress.<init>(TaskInProgress.java:122) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:653) at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:3965) at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskIniti alizationListener.java:79) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:8 86) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
  • 20. 2. JobTracker Out of Memory Error • What does it mean? o Total JT memory usage > allocated RAM • What causes this? o Tasks too small o Too much job history
  • 21. 2. JobTracker Out of Memory Error • How can it be resolved? o Verify by summary output: sudo -u mapred jmap -J-d64 -histo:live <PID of JT> o Increase JT's heap space o Separate JT and NN o Decrease mapred.jobtracker.completeuserjobs.maximum to 5  JT cleans up mem more often, reducing needed RAM
  • 22. 3. Unable to Create New Native Thread ERROR mapred.JvmManager: Caught Throwable in JVMRunner. Aborting TaskTracker. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:640) at org.apache.hadoop.util.Shell.runCommand(Shell.java:234) at org.apache.hadoop.util.Shell.run(Shell.java:182) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) at org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.j ava:127) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild (JvmManager.java:472) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(Jvm Manager.java:446)
  • 23. 3. Unable to Create New Native Thread • What does it mean? o DN show up as dead even though processes are still running on those machines • What causes this? o nproc default of 4200 processes is too low • How can it be resolved? o Adjust /etc/security/limits.conf:  hbase soft/hard nproc 50000  hdfs soft/hard nproc 50000  mapred soft/hard nproc 50000  Restart DN, TT, JT, NN
  • 24. Agenda • Ticket Breakdown • What are Misconfigurations? • Memory Mismanagement • TT OOME • JT OOME • Native Threads • Thread Mismanagment • Fetch Failures • Replicas • Disk Mismanagement • No File • Too Many Files • Cloudera Manager
  • 25. 4. Too Many Fetch-Failures INFO org.apache.hadoop.mapred.JobInProgress: Too many fetch-failures for output of task:
  • 26. 4. Too Many Fetch-Failures • What does it mean? o Reducer fetch operations fail to retrieve mapper outputs o Too many fetch failures occur on a particular TT o => blacklisted TT • What causes this? o DNS issues o Not enough http threads on the mapper side for the reducers o JVM bug
  • 27. 4. Too Many Fetch-Failures • How can it be resolved? o mapred.reduce.slowstart.completed.maps = 0.80  Allows reducers from other jobs to run while a big job waits on mappers o tasktracker.http.threads = 80  Specifies # threads used by the TT to serve map output to reducers o mapred.reduce.parallel.copies = SQRT(NodeCount) with a floor of 10  Specifies # parallel copies used by reducers to fetch map output o Stop using 6.1.26 Jetty, which is fetch-failure prone and upgrade to CDH3u2 (MAPREDUCE-2980), MAPREDUCE-3184
  • 28. 5. Not Able to Place Enough Replicas WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place enough replicas
  • 29. 5. Not Able to Place Enough Replicas • What does it mean? o NN is unable to choose some DNs
  • 30. 5. Not Able to Place Enough Replicas • What causes this? o dfs replication > # avail DNs o # avail DNs is low due to low disk space o mapred.submit.replication default of 10 is too high o NN is unable to satisfy block replacement policy o If # racks > 2, a block has to exist in at least 2 racks o DN being decommissioned o DN has too much load on it o Block-size unreasonably large o Not enough xcievers threads  Default 256 threads that DN can manage is too low  Note the accidental misspelling
  • 31. 5. Not Able to Place Enough Replicas • How can it be resolved? o Set dfs.datanode.max.xcievers = 4096 then restart DN o Look for nodes down (or rack down) o Check disk space o Log dir may be filling up or a runaway task may fill up an entire disk o Rebalance
  • 32. Agenda • Ticket Breakdown • What are Misconfigurations? • Memory Exhaustion • TT OOME • JT OOME • Native Threads • Thread Exhaustion • Fetch Failures • Replicas • Data Space Exhaustion • No File • Too Many Files • Cloudera Manager
  • 33. 6. No Such File or Directory ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java: 496) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:319) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189) at org.apache.hadoop.mapred.TaskTracker.initializeDirectories(TaskTracker.java:666) at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:734) at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:1431) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3521)
  • 34. Total Storage MR space DFS space
  • 35. 6. No Such File or Directory • What does it mean? o TT failing to start or jobs are failing • What causes this? o Diskspace filling up on the TT disk o Permissions on the directories that TT writes to • How can it be resolved?  Job logging dir dfs.datanode.du.reserved = 10% total vol  Permissions should be 755 and owner should be mapred for userlogs and mapred.local.dir
  • 36. 7. Too Many Open Files ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(<ip address>:50010, storageID=<storage id>, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Too many open files at sun.nio.ch.IOUtil.initPipe(Native Method) at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49) at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readShort(DataInputStream.java:295) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:97)
  • 37. 7. Too Many Open Files • What does it mean? o Hitting open file handles limit of user account running Hadoop • What causes this? o Nofile default of 1024 files is too low • How can it be resolved? o Adjust /etc/security/limits.conf:  hdfs - nofile 32768  mapred - nofile 32768  hbase - nofile 32768  Restart DN, TT, JT, NN
  • 38. Agenda • Ticket Breakdown • What are Misconfigurations? • Memory Mismanagement • TT OOME • JT OOME • Native Threads • Thread Mismanagement • Fetch Failures • Replicas • Disk Mismanagement • No File • Too Many Files • Cloudera Manager
  • 39.
  • 40. Additional Resources • Avoiding Common Hadoop Administration Issues: http://www.cloudera.com/blog/2010/08/avoiding-common-hadoop- administration-issues/ • Tips for Improving MapReduce Performance: http://www.cloudera.com/blog/2009/12/7-tips-for-improving- mapreduce-performance/ • Basic Hardware Recommendations: http://www.cloudera.com/blog/2010/03/clouderas-support-team- shares-some-basic-hardware-recommendations/ • Cloudera Knowledge Base: http://ccp.cloudera.com/display/KB/Knowledge+Base
  • 41. Acknowledgements Adam Warrington Amr Awadallah Angus Klein Brock Noland Clint Heath Eric Sammer Esteban Gutierrez Jeff Bean Joey Echeverria Laura Gonzalez Linden Hillenbrand Omer Trajman Patrick Angeles
  • 42. Takeaways • Configuration is up to you. Bugs are up to the community. • Misconfigurations are hard to diagnose. • Get it right the first time with Cloudera Manager.

Notes de l'éditeur

  1. You are sitting here because you know that misconfigurations and bugs break the most clusters. Fixing bugs is up to the community. Fixing misconfigurations is up to you. I guarantee if you wait an hour before walking out that you will leave with the know-how to get your configuration right the first time.
  2. hadoop errors show up far from where you configured so don&apos;t know what log files to analyze
  3. if your cluster is broken, probably a misconfiguration. this is a hard prob b/c the err msgs are not tightly tied to the root cause
  4. @tlipcon if sum(max heap size) &gt; physical RAM - 3GB, go directly to jail. do not pass go. do not collect $200.
  5. why not fix the defaults? b/c they are not static defaults, they are not all hadoop defaults but also OS defaults
  6. resource allocation is hard because the place where you are misallocating is far from where you ran out e.g. fetch failures could be due to system, config, or bug but not sure what from the error msg and you don&apos;t know which misconfig from err msg also ff doesn&apos;t show up on the node where it&apos;s configured
  7. one large cluster doesn&apos;t necessarily show all the different dependencies;cloudera has seen a lot of diverse operation clusters.part of what cloudera does is provide tools to help diagnose and understand how hadoop operates
  8. Not biased or anything but you can get it right the first time with ClouderaMgr