NoSQL Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 536 posts at DZone. You can read more from them at their website. View Full User Profile

Neo4j: The Batch Inserter and the Sunk Cost Fallacy

09.25.2012
| 3806 views |
  • submit to reddit

About a year and a half ago I wrote about the sunk cost fallacy which is defined like so:

The Misconception: You make rational decisions based on the future value of objects, investments and experiences.

The Truth: Your decisions are tainted by the emotional investments you accumulate, and the more you invest in something the harder it becomes to abandon it.

Over the past few weeks Ashok and I have been doing some exploration of one of our client’s data by modelling it in a neo4j graph and seeing what interesting things the traversals reveal.

We needed to import around 800,000 nodes with ~ 2 million relationships and because I find that the feedback loop in Ruby is much quicker than Java I suggested that we write the data loading code using the neo4j.rb gem.

Initially we just loaded a small subset of the data so that we could get a rough feel for it and check that we were creating relationships between nodes that actually made sense.

It took a couple of minutes to load everything but that was quick enough.

Eventually, however, we wanted to load the full data set and realised that this approach wasn’t really going to scale very well.

The first version created every node/relationship within its own transaction and took around an hour to load everything.

To speed that up we batched up the nodes and only committed a transaction every 10,000 nodes which took the time down to around 20 minutes which was not bad but not amazing.

At one stage Ashok suggested we should try out the batch inserter API but having spent quite a few hours getting the Ruby version into shape I really didn’t want to let it go – the sunk cost fallacy in full flow!

A couple of days later we got some new data to load on top of the initial graph and Ashok suggested we use the batch inserter just for that bit of data.

Since that didn’t involve deleting any of the code we’d already written I was more keen to try that out.

This time we were adding around 200 nodes but another 1 million relationships to the existing nodes and the end to end time for this bit of code to run was 24 seconds.

Having finally been convinced that the batcher inserter was way better than anything else I spent a couple of hours earlier this week moving all our Ruby code over and it now takes just under 2 minutes to load the whole graph.

To learn how to write code for the batcher inserter we followed the examples from BatchInsertExampleTest which covered everything that we wanted to do.

Hopefully the next time I come across such a situation I’ll be better able to judge when I’m holding onto something even when I should just let go!

 

 

Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)