4. Use Starfish to ?
• Visualize understand what’s happening
• Optimize speed up performance
• Strategize size requirements intelligently
5. Visualize
• See how MapReduce apps are performing
• Understand bottlenecks in Hadoop
• Find misconfigured Hadoop parameters
• Learn to develop better MapReduce apps
6. Optimize
• Tune hadoop easily with automatic health check and
recommendations
• Find optimal parameter settings for MapReduce
applications in Java, streaming, Hive ,Pig ,and other
languages
7. Strategize
• Make intelligent resource allocation choices for
Hadoop
• Find optimal EC2 instances for workloads
• Meet time and cost budgets with ease
8. • If you have downloaded the Starfish Binaries,
simply extract the package in a directory of your
choice.
2
Starfish Installation Instructions
tar -xzf starfish-0.3.0.tar.gz
9. In order to profile the execution of a Map-Reduce job in a Hadoop
cluster, you must first install the pre-compiled BTrace scripts and jars
(included in Starfish).
1.Set the following global profiling parameter in bin/config.sh:
•SLAVES_BTRACE_DIR: BTrace installation directory at the slave
nodes. Please specify the full path and ensure you have the
appropriate write permissions. The path will be created if it doesn't
exist.
•CLUSTER_NAME: A descriptive name for the cluster (like test,
production, etc.). Do not include spaces or special characters in the
name.
•PROFILER_OUTPUT_DIR: The local directory to place the collected
logs and profile files. Please specify the full path and ensure you
have the appropriate write permissions. The path will be created if
it doesn't exist.
2
BTrace Installation Instructions(1)
10. BTrace Installation Instructions(2)
Vi /starfish-0.3.0/bin/config.sh
# The btrace install directory on the slave machines
# Specify a FULL path! This setting is required!
# Example: SLAVES_BTRACE_DIR=/root/btrace
SLAVES_BTRACE_DIR=/opt/btrace
# A descriptive name for the cluster, like test, production, etc.
# No spaces or special characters in the name. This setting is
required!
CLUSTER_NAME=etu
# The local directory to place the output files
# If left blank, it defaults to the working directory (not recommended)
PROFILER_OUTPUT_DIR=/opt/btrace
11. Install BTrace using the
provided bin/install_btrace.sh from the master
node in the cluster. The sole input to the script is the
path to a file containing the names or ip addresses of
the slave nodes in the cluster.
2
BTrace Installation Instructions(2)
./bin/install_btrace.sh slave.txt
18. Job Optimization
Job Optimization on a Live Hadoop Cluster
./bin/optimize mode job_id hadoop jar jarFile args...
Print on the console the configuration parameter settings suggested by the Cost-based
Optimizer for a WordCount MapReduce job.
./bin/optimize recommend job_2010030839_0000 hadoop jar
contrib/examples/hadoop-starfish-examples.jar wordcount
/input/path /output/path
Execute a WordCount MapReduce job using the configuration parameter settings
automatically suggested by the Cost-based Optimizer.
./bin/optimize run job_2010030839_0000 hadoop jar
contrib/examples/hadoop-starfish-examples.jar wordcount
/input/path /output/path
The expected parameters are identical to the parameters required by ${HADOOP_HOME}/bin/hadoop. Here is an example for profiling a WordCount MapReduce program.