NoSQL Zone is brought to you in partnership with:

Max De Marzi, is a seasoned web developer. He started building websites in 1996 and has worked with Ruby on Rails since 2006. The web forced Max to wear many hats and master a wide range of technologies. He can be a system admin, database developer, graphic designer, back-end engineer and data scientist in the course of one afternoon. Max is a graph database enthusiast. He built the Neography Ruby Gem, a rest api wrapper to the Neo4j Graph Database. He is addicted to learning new things, loves a challenge and finding pragmatic solutions. Max is very easy to work with, focuses under pressure and has the patience of a rock. Max is a DZone MVB and is not an employee of DZone and has posted 60 posts at DZone. You can read more from them at their website. View Full User Profile

Permission Resolution with Neo4j - Part 2

03.25.2013
| 1677 views |
  • submit to reddit

the-princess-bride-original3

Let’s try tackling something a little bigger. In Part 1 we created a small graph to test our permission resolution graph algorithm and it worked like a charm on our dozen or so nodes and edges. I don’t have fast hands, so instead of typing out a million node graph, we’ll build a graph generator and use the batch importer to load it into Neo4j. What I want to create is a set of files to feed to the batch-importer.

A nodes.csv file (is actually tab separated, the c is just there to make sure you were paying attention) looks like the following:

unique_id       type
9a984170-71cc-0130-92a0-20c9d042eca9    user   
9a984450-71cc-0130-92a0-20c9d042eca9    user   
9a984550-71cc-0130-92a0-20c9d042eca9    user
...
a67769a0-71cc-0130-92a0-20c9d042eca9    doc    
a6776a40-71cc-0130-92a0-20c9d042eca9    doc

The node ids are not set above, instead the line number represents the node id that will be created in our graph. The unique_id and type are properties of the nodes. A rels.csv file is also needed, and it looks like the following:

start   end     type    flags
1       3003    IS_MEMBER_OF           
1       3060    IS_MEMBER_OF           
2       3032    IS_MEMBER_OF   
...
754949  272265  IS_CHILD_OF            
825621  283395  IS_CHILD_OF

In this case the start and end columns are the node ids these relationships are connecting via a type (required) and some properties (if any). To make these files I build a quick Rakefile to help me make two sets of these. One graph will have a million nodes, the other 10 million and we will see how well Neo4j scales in this regard. Will a 10x increase in the number of documents in the graph result in 1/10x performance of our algorithm?

require 'neography/tasks'
require './neo_generate.rb'
 
namespace :neo4j do
  task :create do
    %x[rm *.csv]
    create_graph
  end
 
  task :create_bigger do
    %x[rm *.csv]
    create_bigger_graph
  end
   
  task :load do
    %x[rm -rf neo4j/data/graph.db]
    load_graph
  end
end

The create graph and create bigger graph methods are almost identical, the only real difference is how many nodes they will end up creating:

def create_graph
  create_node_properties
  create_nodes
  create_nodes_index
  create_relationship_properties
  create_relationships
end 
	
def create_bigger_graph
  create_node_properties
  create_more_nodes
  create_nodes_index
  create_relationship_properties
  create_relationships
end 

We are going to set the first 3000 nodes to be users, the next 100 to be groups, and the next 1 million to be documents.

def create_nodes   
 @nodes = {
           "user"  => { "start" =>     1, "end"   =>   3000},
           "group" => { "start" =>  3001, "end"   =>   3100},
           "doc"   => { "start" =>  3101, "end"   =>1003100}
          }
   
  @nodes.each{ |node| generate_nodes(node[0], node[1])}
end
rake neo4j:create

The csv files created for the 1 Million node graph aren’t very large:

-rw-r--r--   1 maxdemarzi  staff    47M Mar 18 02:37 documents_index.csv
-rw-r--r--   1 maxdemarzi  staff    40M Mar 18 02:37 nodes.csv
-rw-r--r--   1 maxdemarzi  staff   259M Mar 18 02:46 rels.csv
-rw-r--r--   1 maxdemarzi  staff   140K Mar 18 02:37 users_index.csv

Now let’s load these in:

rake neo4j:load
java -server -Xmx4G -jar ./batch-import-jar-with-dependencies.jar neo4j/data/graph.db nodes.csv rels.csv node_index Users exact users_index.csv node_index Documents exact documents_index.csv

This will run the following:

Using Existing Configuration File
..........
Importing 1003100 Nodes took 2 seconds
.................................................................................................... 19508 ms for 10000000
.........
Importing 10976303 Relationships took 21 seconds
 
Importing 3000 Done inserting into Users Index took 0 seconds
..........
Importing 1000000 Done inserting into Documents Index took 7 seconds
 
Total import time: 34 seconds

Not bad for 1 million nodes and 10 million relationships:

rake neo4j:start

Once we start neo4j and take a look at the web admin, we can see our graph:

Load Small Graph

For our bigger graph, we just add another zero and create 10 Million documents.

def create_more_nodes   
   @nodes = {
             "user"  => { "start" =>     1, "end"   =>    3000},
             "group" => { "start" =>  3001, "end"   =>    3100},
             "doc"   => { "start" =>  3101, "end"   =>10003100}
            }
     
    @nodes.each{ |node| generate_nodes(node[0], node[1])}
  end

We’ll run a different rake task which will overwrite the smaller csv files.

rake neo4j:create_bigger

The csv files created for the 10 Million node graph are just a tad bigger than for 1 Million node graph:

-rw-r--r--  1 maxdemarzi  staff   476M Mar 19 00:33 documents_index.csv
-rw-r--r--  1 maxdemarzi  staff   401M Mar 19 00:31 nodes.csv
-rw-r--r--  1 maxdemarzi  staff   510M Mar 19 05:16 rels.csv
-rw-r--r--  1 maxdemarzi  staff   140K Mar 19 00:33 users_index.csv

Let’s stop the neo4j server and load the bigger graph instead:

rake neo4j:stop
rake neo4j:load
java -server -Xmx4G -jar ./batch-import-jar-with-dependencies.jar neo4j/data/graph.db nodes.csv rels.csv node_index Users exact users_index.csv node_index Documents exact documents_index.csv

I wonder how long this will take:

Using Existing Configuration File
.................................................................................................... 14052 ms for 10000000
 
Importing 10003100 Nodes took 14 seconds
.................................................................................................... 19242 ms for 10000000
..................................................................................................
Importing 19812750 Relationships took 37 seconds
 
Importing 3000 Done inserting into Users Index took 0 seconds
.................................................................................................... 64223 ms for 10000000
 
Importing 10000000 Done inserting into Documents Index took 64 seconds
 
Total import time: 135 seconds 

That’s not bad either. Just over two minutes.

rake neo4j:start

10M graph

Alright. Now we have two bigger graphs we can play with. Stay tuned for the next part where I’ll add two Gatling performance tests to the mix.


Published at DZone with permission of Max De Marzi, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)