Technology Evangelist. Love Art and Technology. Swathi is a DZone MVB and is not an employee of DZone and has posted 15 posts at DZone. You can read more from them at their website. View Full User Profile

Starfish : A Hadoop Performance Tuning Tool

  • submit to reddit

Its been a long time since I've blogged... a lapse of 3-4months or so... :( Well, I thought of writing about an awesome tool for performance tuning in Hadoop called “Starfish”.

What is Starfish?
Starfish is a Self-tuning System For Big Data Analytics. It's an open source project hosted at GitHub.
Github Link: (If 404, not sure why!?)

What is the need for Starfish?
Need for Performance!!

What does it do and what are its components?
It enables Hadoop users and applications to get good performance automatically.
It has three main components.
1. Profiler
2. What-if Engine
3. Optimizer

1. Job Profile / Profiler :

  1. Profile is a concise statistical summary of MR Job execution.
  2. This profiling is based on the dataflow and cost estimation of an MR Job.
  3. Data Flow estimation would be considered with the number of bytes of <K,V> pairs processed during a job’s execution.
  4. Cost estimation would be considered with execution time at the level of tasks and phases within the tasks for a MR job execution. (Basically, the resource usage and execution time)
  5. The performance models consider the above two and the configuration parameters associated with the MR Job.
  6. Space of configuration choices:
    • Number of map tasks
    • Number of reduce tasks
    • Partitioning of map outputs to reduce tasks
    • Memory allocation to task-level buffers
    • Multiphase external sorting in the tasks
    • Whether output data from tasks should be compressed
    • Whether combine function should be used ...

job j = < program p, data d, resources r, configuration c >
Thus, we can tell performance is a function of a job j.
perf = F(p,d,r,c)
Job profile is generated by Profiler through measurement or by the What-if Engine through estimation.

2. What-if Engine:
The What-if Engine uses a mix of simulation and model-based estimation at the phase level of MapReduce job execution, in order to predict the performance of a MapReduce job before executed on a Hadoop cluster.
It estimates the perf using properties of p, d, r, and c.
ie. Given profile for job j = <p, d1, r1, c1>
 Estimate profile for job j' = <p, d2, r2, c2>
It has white box models consisting detailed set of equations for Hadoop.
Input data properties
Dataflow statistics
Configuration parameters
⇒ Calculate dataflow in each task phase in a map task

3. Optimizer:
It finds the optimal configuration settings to use for executing a MapReduce job. It recommends and can also run with the recommended job configuration settings.

Normal Execution:
Program : WordCount
Data Size : 4.45GB
Time taken to complete the job : 8m 5s

Starfish Profiling and Optimized Execution:
Program : WordCount
Data Size: 4.45GB
Time taken to complete the job : 4m 59s

Executed with cluster of 1 Master, 3 Slave nodes

What’s achieved?

  • Perform in-depth job analysis with profiles
  • Predict the behavior of hypothetical job executions
  • Optimize arbitrary MapReduce programs
Installation ??
It’s pretty easy to install.
  • Prerequisites :
    • Hadoop Cluster of 0.20.2 or should be up and running. Tested for Cloudera Distributions.
    • Java JDK should be installed.
  • Compile the source code
    • Compile the entire source code and create the jar files:


    • Execute all available JUnit tests and verify the code was compiled successfully:

    ant test

    • Generate the javadoc documentation in docs/api:

    ant javadoc

Ensure that in ~/.bashrc,

JAVA_HOME and HADOOP_HOME environment variables are set.

  • BTrace Installation in the Slave Nodes

After the compilation, btrace directory created will contain all the classes and the jars. These must be shipped to the slave nodes.

  • Create a file (in Master node) “slaves_list.txt”

This file should contain the slave node IP addresses or the hostnames. Make sure the hostnames are updated in the Master node ie. /etc/hosts (IP address and their respective slave hostname).

Example :

$vi slaves_list.txt




  • Set the global profile parameter in bin/

  • SLAVES_BTRACE_DIR: BTrace installation directory at the slave nodes. Please specify the full path and ensure you have the appropriate write permissions. The path will be created if it doesn't exist.
  • CLUSTER_NAME: A descriptive name for the cluster. Do not include spaces or special characters in the name.
  • PROFILER_OUTPUT_DIR: The local directory to place the collected logs and profile files. Please specify the full path and ensure you have the appropriate write permissions. The path will be created if it doesn't exist.
  • Run the script

bin/ <absolute_path_slaves_list.txt>

  • This will copy the btrace jars in the SLAVES_BTRACE_DIR of the slave nodes.
This is all with the installation.

Execution is followed by
The link is a great source to get started with both installation and execution. The documentation is equally great!
Happy Learning! :)

Published at DZone with permission of Swathi Venkatachala, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)