Contenu connexe
Similaire à 70a monitoring & troubleshooting (20)
70a monitoring & troubleshooting
- 2. Monitoring & Troubleshooting
Agenda
• Cluster Monitoring Tools
• Troubleshooting MapReduce Jobs
• Troubleshooting Scenarios
• Working with MapR Support
• Things to Avoid
© 2012 MapR Technologies Troubleshooting 2
- 3. Monitoring & Troubleshooting
Objectives
At the end of this module you will be able to:
• Identify the tools you can use to monitor your cluster
• Explain how MapR central logging can help you monitor MapReduce jobs
• Describe several common troubleshooting scenarios and how to resolve
issues based on these scenarios
• List the tools you can use to work with MapR Support
© 2012 MapR Technologies Troubleshooting 3
- 5. Monitoring Tools
Built-In Tools
– MapR Control System
– MapR Metrics
3rd Party Tools
– Nagios
– Ganglia
5 © 2012 MapR Technologies Troubleshooting 5
- 6. MapR Control System
MapR Control System
– Dashboard with cluster overview
• Node health
• MapR-FS and available disks
• Resource utilization
– bandwidth
– disk space
– CPU
• MapReduce job status
• Alarms
6 © 2012 MapR Technologies Troubleshooting 6
- 8. MapR Metrics
MapR Metrics
– View performance information about Hadoop jobs
• Predict cluster usage
• Measure which jobs consume resources
• Troubleshoot failures & performance issues
– Metrics provided on
• Cumulative CPU/memory usage
• # of running/failed tasks/attempts
• Speed of input, output, and shuffle
• Duration of task attempts
• Data read, written, or shuffled
• Memory in use
• Number of records skipped/spilled
8 © 2012 MapR Technologies Troubleshooting 8
- 10. 3rd Party Tools
Nagios
– Configuration script generator
Ganglia
– CLDB does metrics
– MapRGangliaContext
– Only need gmond on CLDB node
10 © 2012 MapR Technologies Troubleshooting 10
- 11. MapR Service Logs
/opt/mapr/logs
For example:
– CLDB
– Warden
– FileServer (mfs)
– NFS
11 © 2012 MapR Technologies Troubleshooting 11
- 12. Troubleshooting
MapReduce Jobs
© 2012 MapR Technologies Troubleshooting 12
- 13. Central Logging
MapR 2.0 introduces central logging
– Log files written to “local” volume on MapR-FS
• replication factor = 1
– I/O confined to node
– /var/mapr/local/<host>/logs/mapred/userlogs
– Configurable via JobTracker variable
• mapr.localvolumes.path
13 © 2012 MapR Technologies Troubleshooting 13
- 14. Central Logging
New CLI for MapReduce logs
maprcli job linklogs -jobid <jobPatten> -todir
<maprfsDir> [ -jobconf <pathToJobXml>]
– Create a job-centric view of all logs on all involved TaskTracker nodes
– Creates the following structure under <maprfsDir> for all <jobid>’s
matching <jobPattern>
• <jobid>/hosts/<host>/
– symbolic links to log directories of tasks executed for <jobid> on <host>
• <jobid>/mappers/
– symbolic links to log directories of all map task attempts for <jobid> across the
cluster
• <jobid>/reducers/
– symbolic links to log directories of all reduce task attempts for <jobid> across the
cluster
14 © 2012 MapR Technologies Troubleshooting 14
- 16. Troubleshooting Scenarios
Slow nodes
Out of memory
Out of disk space
Time skew
No ZooKeeper quorum
Contention for resources
Requirements not met
16 © 2012 MapR Technologies Troubleshooting 16
- 17. Identifying Slow Nodes
Before installation:
– Use dd to benchmark read/write speed
• dd bs=4M if=/dev/null of=/dev/sd<x>
– Compare performance across nodes to test network throughput:
• dd bs=4M if=/dev/null | sudo ssh root@node 'dd bs=4M of=/dev/foo’
After installation:
– Look at task starting and completion times
– Look in system logs for memory or CPU problems
– Look at the performance of writes to the local volume
(where intermediate data goes)
Slow disks identified based on a threshold in mfs.conf
– May really be slow NIC
17 © 2012 MapR Technologies Troubleshooting 17
- 18. Out of Memory
Make sure there is enough swap space
See if a memory-intensive job is running
Use ulimit to make sure there are no limits on the number of file
descriptors, resource usage, and the number of processes
Garbage collection can result in out-of-memory errors
18 © 2012 MapR Technologies Troubleshooting 18
- 19. Out of Disk Space
MapR logs go to /opt/mapr/logs
– If this partition is too small, space can run out
– Set up a cron job to clean out old logs
– Move to a larger partition
19 © 2012 MapR Technologies Troubleshooting 19
- 20. Time Skew
NTP is your friend
20 Seconds differential is the max allowed
20 © 2012 MapR Technologies Troubleshooting 20
- 21. No ZooKeeper Quorum
Not enough ZooKeepers running
configure.sh run improperly
– Different ZooKeeper or CLDB nodes specified
Network problem
– Hostname resolution
– Physical connection down
21 © 2012 MapR Technologies Troubleshooting 21
- 22. Contention for Resources
Make sure there’s no limit on file descriptors, processes
Make sure the service layout follows good guidelines
– Don’t run ZooKeeper with CLDB or JobTracker
– Fewer task slots when running TaskTracker with CLDB or ZooKeeper
– Avoid running the active JobTracker on the primary CLDB node
Don’t run other random things on cluster nodes
Don’t mix distributions
22 © 2012 MapR Technologies Troubleshooting 22
- 23. Requirements Not Met
Use Sun Java JDK
Same users/groups with same UID/GID numbers on all nodes
Proper licensing
Host resolution between all nodes
– DNS or /etc/hosts
Keyless ssh between all nodes for the root user
All necessary ports open
– Watch out for iptables and selinux
23 © 2012 MapR Technologies Troubleshooting 23
- 25. Working with MapR Support
mapr-support-collect and mapr-support dump
fsck and gfsck
25 © 2012 MapR Technologies Troubleshooting 25
- 27. Things to Avoid
Remove ZooKeeper data manually
Format disks (unless you are sure)
Run configure.sh incorrectly
Use dd on an installed node
Modify configuration files
– Without a good reason
– Inconsistently
27 © 2012 MapR Technologies Troubleshooting 27