Big Data/BI Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 492 posts at DZone. You can read more from them at their website. View Full User Profile

Mahout: Parallelising the Creation of DecisionTrees

  • submit to reddit

A couple of months ago I wrote a blog post describing our use of Mahout random forests for the Kaggle Digit Recogniser Problem and after seeing how long it took to create forests with 500+ trees I wanted to see if this could be sped up by parallelising the process.

From looking at the DecisionTree it seemed like it should be possible to create lots of small forests and then combine them together.

After unsuccessfully trying to achieve this by directly using DecisionForest I decided to just copy all the code from that class into my own version which allowed me to achieve this.

The code to build up the forest ends up looking like this:

List<Node> trees = new ArrayList<Node>();
MultiDecisionForest forest = MultiDecisionForest.load(new Configuration(), new Path("/path/to/mahout-tree"));
MultiDecisionForest forest = new MultiDecisionForest(trees);

We can then use forest to classify values in a test data set and it seems to work reasonably well.

I wanted to try and avoid putting any threading code in so I made use of GNU parallel which is available on Mac OS X with a brew install parallel and on Ubuntu by adding the following repository to/etc/apt/sources.list

deb oneiric main 
deb-src oneiric main

…followed by a apt-get update and apt-get install parallel.

I then wrote a script to parallelise the creation of the forests:

startTime=`date '+%s'`
seq 1 ${numberOfRuns} | parallel -P 8 "./"
endTime=`date '+%s'`
echo "Started: ${start}"
echo "Finished: ${end}"
echo "Took: " $(expr $endTime - $startTime)

java -Xmx1024m -cp target/machinenursery-1.0.0-SNAPSHOT-standalone.jar

It should be possible to achieve this by using the parallel option in xargs but unfortunately I wasn’t able to achieve the same success with that command.

hadn’t come across the seq command until today but it works quite well here for allowing us to specify how many times we want to call the script.

I was probably able to achieve about a 30% speed increase when running this on my Air. There was a greater increase running on a high CPU AWS instance although for some reason some of the jobs seemed to get killed and I couldn’t figure out why.

Sadly even with a new classifier with a massive number of trees I didn’t see an improvement over the Weka random forest using AdaBoost which I wrote about a month ago. We had an accuracy of 96.282% here compared to 96.529% with the Weka version.

Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)