Max De Marzi, is a seasoned web developer. He started building websites in 1996 and has worked with Ruby on Rails since 2006. The web forced Max to wear many hats and master a wide range of technologies. He can be a system admin, database developer, graphic designer, back-end engineer and data scientist in the course of one afternoon. Max is a graph database enthusiast. He built the Neography Ruby Gem, a rest api wrapper to the Neo4j Graph Database. He is addicted to learning new things, loves a challenge and finding pragmatic solutions. Max is very easy to work with, focuses under pressure and has the patience of a rock. Max is a DZone MVB and is not an employee of DZone and has posted 60 posts at DZone. You can read more from them at their website. View Full User Profile

Determining "Graphiness" Using a Neo4j Graph Generator

07.05.2012
| 4281 views |
  • submit to reddit

In the US Air Guitar Championships, competitors use their talents to fret on an “invisible” guitar to rock a live crowd and deliver a performance that transcends the imitation of a real guitar and becomes an art form in and of itself. The key factor that determines the winner is having the elusive quality of “Airness“. When considering using Neo4j in a project, one of the key considerations is having a domain model that yields itself to a graph representation. In other words, does your data have “Graphiness“. However, it didn’t dawn on me until recently that when starting a proof of concept, you probably don’t have that data (or enough of it) or maybe your security guys won’t let you within 100 miles of the company production data with this newfangled nosql thingamajig.

So in order to validate our ideas and build a proof of concept, we’ll need to generate sample data and test our algorithms (aka Cypher and Gremlin queries) against it. I will show you how to build a rudimentary graph generator, you’ll have to tweak it to match your domain, but it’s a start. We’ll also going to use the Batch Importer to quickly load our data into Neo4j.

If you recall, I’ve had three blog posts about the Batch Importer. In the first one, I showed you how to install the Batch Importer, in the second one, I showed you how to use data in your relational database to generate the csv files to create your graph, and just recently I showed you how to quickly index your data.

The Batch Importer expects a series of tab separated files for input. So let’s generate these files. We will create a graph with 6 node types. Here they are with the amount of each we are going to create:

  # Nodes
  #   Users             21,000
  #   Companies          4,000
  #   Activity            1.2M
  #   Item                1.3M 
  #   Entity              3.5M
  #   Tags              20,000

We’ll link these together with a set of relationships:

  # Relationships
  #  Users            -[:belongs_to]-> Companies
  #  User             -[:performs]->   Activity
  #  Activity         -[:belongs_to]-> Item
  #  Item             -[:references]-> Entity
  #  Item             -[:tagged]->     Tags
  #

To make this example easier, we’ll create just two indexes. A fulltext node index called “vertices” and an exact relationship index called “edges”. You’ll probably want to create multiple indexes for each type of node or relationship.

I want to make running this straight forward, so we’ll do this in a series of rake commands:

rake neo4j:install
rake neo4j:create
rake neo4j:load
rake neo4j:start

If you’ve been following my blog, you know what install and start do, but we need to build the method that will handle create, and load. We can whip up a quick Rakefile for these:

require 'neography/tasks'
require './neo_generate.rb'

namespace :neo4j do
  task :create do
    create_graph
  end
  task :load do
    load_graph
  end
end

Now we can start with create_graph. If you recall, the batch importer is looking for a series of tab separated files. One which contains the nodes, another for the relationships and optionally other files for each index you want to create. Each file had a header with some properties, so our create_graph method will look like this:

  def create_graph
    create_node_properties
    create_nodes
    create_nodes_index
    create_relationship_properties
    create_relationships
    create_relationships_index
  end  

I’m going to arbitrarily decide here that each one of my nodes will have two properties, and we’ll call these property1 and property2 because I am super creative when it comes to naming things.

  def create_node_properties
    @node_properties = ["type", "property1", "property2"]
    generate_node_properties(@node_properties)  
  end

Did I say two? I meant three. Just for my own sanity I like to give nodes a type property and put the type of node that they are, so we’ll include “type” as the first property.

  #  Recreate nodes.csv and set the node properties 
  #  
  def generate_node_properties(args)
    File.open("nodes.csv", "w") do |file|
      file.puts properties.join("\t")
    end
  end

With our header out of the way, we can turn our attention to actually creating these nodes. We’ll use a hash which will have the type of node, the start id and end id of the nodes, and some properties. But what properties should our nodes have? What should their values be? This is a bit tricky. The simplest solution is to just generate gobbledygook with random strings:

  # Generate random lowercase text of a given length 
  # 
  # Args
  #  length - Integer (default = 8)
  #
  def generate_text(length=8)
    chars = 'abcdefghjkmnpqrstuvwxyz'
    key = ''
    length.times { |i| key << chars[rand(chars.length)] }
    key
  end

Another possibility is to use one of the Random Data Generator Gems like Forgery to create more intelligent and specific random data (like female first names for example). We are going to take the easy way out this time and just give each node two random properties, except for user and company which will get properties from Forgery.

  def create_nodes
    # Define Node Property Values
    node_values    = [lambda { generate_text              }, lambda { generate_text              }]
    user_values    = [lambda { Forgery::Name.full_name    }, lambda { Forgery::Personal.language }]
    company_values = [lambda { Forgery::Name.company_name }, lambda { Forgery::Name.industry     }]
    
   @nodes = {"user"     => { "start" => 1,
                             "end"   => 21000, 
                             "props" => user_values},
             "company"  => { "start" => 21001,
                             "end"   => 25000, 
                             "props" => company_values},
             "activity" => { "start" => 25001,
                             "end"   => 1225000, 
                             "props" => node_values},
             "item"     => { "start" => 1225001,
                             "end"   => 2525000, 
                             "props" => node_values},
             "entity"   => { "start" => 2525001,
                             "end"   => 6025000, 
                             "props" => node_values},
             "tag"      => { "start" => 6025001,
                             "end"   => 6045000, 
                             "props" => node_values}
    }
    
    # Write nodes to file
    @nodes.each{ |node| generate_nodes(node[0], node[1])}
  end

Great, now to finally generate these nodes, we’ll write to nodes.csv the type of the node and we’ll call our lambda so each node gets a different random string.

  # Generate nodes given a type and hash
  #
  def generate_nodes(type, hash)
    puts "Generating #{(1 + hash["end"] - hash["start"])} #{type} nodes..."
    nodes = File.open("nodes.csv", "a")

    (1 + hash["end"] - hash["start"]).times do |t|
        properties = [type] + hash["props"].collect{|l| l.call}
        nodes.puts properties.join("\t")
    end
    nodes.close
  end

Our nodes.csv file will look like this once it’s done:

type    property1       property2
user    Helen Harvey    Kashmiri
user    Sean Matthews   Afrikaans
user    William Harper  Haitian Creole
user    Bruce Hill      Macedonian
user    Chris Riley     Swahili

With nodes out of the way, it’s time for relationships. We’ll keep it simple and say each relationship also has two properties.

  def create_relationship_properties
    @rel_properties = ["property1", "property2"]
    generate_rel_properties(@rel_properties)
  end

I meant three properties. Once again I’m adding type, but this is different from the node type above as each relationship in Neo4j MUST have a type, it is not an optional property. The “\t” you see below is putting tabs between each field, sorry if I didn’t mention this earlier and you were like what the heck is that?

  #  Recreate rels.csv and set the relationship properties 
  #  
  def generate_rel_properties(properties)
    File.open("rels.csv", "w") do |file|
      header = ["start", "end", "type"] + properties
      file.puts header.join("\t")
    end
  end

I showed you how to create nice fake data for the nodes, so we’ll keep it simple here and just do bland random 8 character strings. I am using the number field to set how many of these relationships will be created, their type is required, and some properties. You’ll also notice I have this “connection” key which is either :sequential or :random. I’ll explain that in a bit.

  def create_relationships
    # Define Relationsihp Property Values
    rel_values = [lambda { generate_text }, lambda { generate_text }]

    rels = {"user_to_company"  => { "from"  => @nodes["user"],
                                    "to"     => @nodes["company"],
                                    "number" => 21000,
                                    "type"   => "belongs_to",
                                    "props"  => rel_values,
                                    "connection" => :sequential },
            "user_to_activity" => { "from"  => @nodes["user"],
                                    "to"    => @nodes["activity"],
                                    "number" => 1200000,
                                    "type"   => "performs",
                                    "props"  => rel_values,
                                    "connection" => :random },
            "activity_to_item" => { "from"  => @nodes["activity"],
                                    "to"    => @nodes["item"],
                                    "number" => 3000000,
                                    "type"   => "belongs",
                                    "props"  => rel_values,
                                    "connection" => :random },
            "item_to_entity"   => { "from"  => @nodes["item"],
                                    "to"    => @nodes["entity"],
                                    "number" => 6000000,
                                    "type"   => "references",
                                    "props"  => rel_values,
                                    "connection" => :random },
            "item_to_tag"      => { "from"  => @nodes["item"],
                                    "to"    => @nodes["tag"],
                                    "number" => 250000,
                                    "type"   => "tagged",
                                    "props"  => rel_values,
                                    "connection" => :random }                                   
    }
  
    # Write relationships to file
    rels.each{ |rel| generate_rels(rel[1])}  
  end

I am using the “connection” to decide how to connect these nodes together. I’m generating either random connections between nodes or generating sequential connections (as in each “from node” connects to one “to node” until there are no more connections, and if there are more connections than nodes, we loop around).

Feel free to combine the two or create new connection types (clustered for example).

  def generate_rels(hash)
    puts "Generating #{hash["number"]} #{hash["type"]} relationships..."
    File.open("rels.csv", "a") do |file|
    
      case hash["connection"]
        when :random      
          hash["number"].times do |t|
            file.puts "#{rand(hash["from"]["start"]..hash["from"]["end"])}\t#{rand(hash["to"]["start"]..hash["to"]["end"])}\t#{hash["type"]}\t#{hash["props"].collect{|l| l.call}.join("\t")}" 
          end
        when :sequential
          from_size = hash["from"]["end"] - hash["from"]["start"]
          to_size = hash["to"]["end"] - hash["to"]["start"]
          hash["number"].times do |t|
            file.puts "#{hash["from"]["start"] + (t % from_size)}\t#{hash["to"]["start"]  + (t % to_size)}\t#{hash["type"]}\t#{hash["props"].collect{|l| l.call}.join("\t")}" 
          end
      end
    end
  end

Our rels.csv file will look like this once it’s done:

start   end     type    property1       property2
1       21001   belongs_to      sjqwkvag        vpxahvcr
2       21002   belongs_to      pfxnxznu        vrprnpky
3       21003   belongs_to      gcyxumgy        nrxepdzb
4       21004   belongs_to      aayyejkw        xpenqebd
5       21005   belongs_to      hvhjexas        kmyqucmn

To create our node index, we will simply open nodes.csv and output it, adding the node id as the first column. Michael is working on using the nodes.csv headers as a way to tell the Batch Importer to index the nodes, but until that work is done, this will work.

  def create_nodes_index
    puts "Generating Node Index..."
    nodes = File.open("nodes.csv", "r")
    nodes_index = File.open("nodes_index.csv","w")
    counter = 0
    
    while (line = nodes.gets)
      nodes_index.puts "#{counter}\t#{line}"
      counter += 1
    end
    
    nodes.close
    nodes_index.close
  end

Therefore nodes_index.csv will look like:

0       type    property1       property2
1       user    Helen Harvey    Kashmiri
2       user    Sean Matthews   Afrikaans
3       user    William Harper  Haitian Creole
4       user    Bruce Hill      Macedonian
5       user    Chris Riley     Swahili

We’ll do something similar with the relationships, but skip the starting and ending nodes as well as the relationship type.

  def create_relationships_index
    puts "Generating Relationship Index..."
    rels = File.open("rels.csv", "r")
    rels_index = File.open("rels_index.csv","w")
    counter = -1
    
    while (line = rels.gets)
      size ||= line.split("\t").size
      rels_index.puts "#{counter}\t#{line.split("\t")[3..size].join("\t")}"
      counter += 1
    end
    
    rels.close
    rels_index.close
  end

Our rels_index.csv file will look like:

-1      property1       property2
0       nwjsbmgg        gnsnefrf
1       szqqygra        maumqtnp
2       pdtamztw        uvcserrp
3       wewdtztx        bkezsmva
4       gynprabv        eszjgmfs
5       drcaxsse        ungxbzzm

Let’s run neo4j:create to generate these files. Now would be a good time for a quick stretch, bio-break, etc. as this could take a couple of minutes.

Generating 21000 user nodes...
Generating 4000 company nodes...
Generating 1200000 activity nodes...
Generating 1300000 item nodes...
Generating 3500000 entity nodes...
Generating 20000 tag nodes...
Generating Node Index...
Generating 21000 belongs_to relationships...
Generating 1200000 performs relationships...
Generating 3000000 belongs relationships...
Generating 6000000 references relationships...
Generating 250000 tagged relationships...
Generating Relationship Index...

Welcome back, so now we have these four csv files generated, we need to actually run the batch importer to get them into Neo4j. So we will run rake neo4j:load to make this happen, which as you remember calls the load_graph method. It looks like this:

  # Execute the command needed to import the generated files
  #
  def load_graph
    puts "Running the following:"
    command ="java -server -Xmx4G -jar ../batch-import/target/batch-import-jar-with-dependencies.jar neo4j/data/graph.db nodes.csv rels.csv node_index vertices fulltext nodes_index.csv rel_index edges exact rels_index.csv" 
    puts command
    exec command    
  end </pre></p>

The batch importer will now do its thing:

............................................................
Importing 6045000 Nodes took 39 seconds 
....................................................................................................377369 ms for 10000000
....
Importing 10471000 Relationships took 476 seconds 
............................................................
Importing 6045000 Nodes into vertices Index took 226 seconds 
....................................................................................................261031 ms for 10000000
....
Importing 10471000 Relationships into edges Index took 266 seconds 
1153 seconds 

Now we can run rake neo4j:start to see our graph in Neo4j.

Let’s jump in to the Console and make sure our data is there:

START me = node:vertices(type="user") 
RETURN me 
LIMIT 5

Success!

==> +-------------------------------------------------------------------------------+
==> | me                                                                            |
==> +-------------------------------------------------------------------------------+
==> | Node[1]{property2->"Kashmiri",property1->"Helen Harvey",type->"user"}         |
==> | Node[2]{property2->"Afrikaans",property1->"Sean Matthews",type->"user"}       |
==> | Node[3]{property2->"Haitian Creole",property1->"William Harper",type->"user"} |
==> | Node[4]{property2->"Macedonian",property1->"Bruce Hill",type->"user"}         |
==> | Node[5]{property2->"Swahili",property1->"Chris Riley",type->"user"}           |
==> +-------------------------------------------------------------------------------+
==> 5 rows, 111 ms

 

 

Published at DZone with permission of Max De Marzi, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)