NoSQL Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 493 posts at DZone. You can read more from them at their website. View Full User Profile

Neo4j: Loading Data - REST API vs Batch Import

03.01.2013
| 3479 views |
  • submit to reddit

A couple of weeks ago when I first started playing around with my football data set I was loading all the data into neo4j using the REST API via neography which was taking around 4 minutes to load.

The data set consisted of just over 250 matches which translated into 8,000 nodes & 30,000 relationships so it’s very small by all means.

Ashok and I were discussing how that could be quicker and the first thing we tried was to store inserted nodes in an in memory hash map and look them up from there rather than doing an index lookup each time.

These were the timings for different numbers of matches when I did that:

--------------------------------------------------------------------
| Matches | Cache-Hits | Cache-Misses | Lucene        | In memory  |
--------------------------------------------------------------------
| 25      | 501        | 325          | 26.692s       | 22.877s    |
| 50      | 1275       | 373          | 50.491s       | 38.304s    |
| 263     | 8016       | 480          | 4m 11.031s    | 2m 49.951s |
--------------------------------------------------------------------

For the full data set it was about 30% faster which was a nice improvement but still left me waiting around for a bit longer than I wanted to!

I’ve previously used the batch inserter and I was planning to use that again to get a significant improvement in loading time until Ashok pointed out Michael Hunger’s batch-import which seemed worth a try.

I had to add an extra step to the pipeline to put all the nodes and relationships into CSV files and then pass those files to the batch-import JAR.

There was a massive improvement in the load time using it. These were the timings:

-------------------------------------------------------------------
| Matches | Lucene        | In memory  | Batch Import             |
-------------------------------------------------------------------
| 25      | 26.692s       | 22.877s    | 0.378s + 0.921s = 1.299s |
| 50      | 50.491s       | 38.304s    | 0.392s + 1.025s = 1.417s |
| 263     | 4m 11.031s    | 2m 49.951s | 0.524s + 1.239s = 1.763s |
-------------------------------------------------------------------

(the two numbers represent the time taken to generate the CSV files and then the time to import them)

From my brief skimming of the code it seems to take in the files and then route them through the batch importer API so I imagine similar results would be had by calling that directly.

I know this is not a very fair comparison given that you probably shouldn’t be using the REST API to insert data but since I’ve done it a couple of times I thought it’d be interesting to measure anyway!





Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)