NoSQL Zone is brought to you in partnership with:

Aleksa Vukotic is a keen lean and agile advocate, author, trainer, software architect and developer, with experience of leading roles in the successful deliver of business critical projects. Turning to technology, Aleksa has been involve with the Spring Framework since early days, becoming an expert in enterprise Java development with Spring, along with other open source technologies. In addition to his Java EE expertise, Aleksa is often utilising his problem solving skills to tackle the most complex problems arising within the projects. More recently, the focus of Aleksa's technology interest has been the application of NoSQL storage solutions for solving complex data problems. Aleksa is Senior Consultant and Data Management Practice lead at Open Credo, based in London, UK. Aleksa is a DZone MVB and is not an employee of DZone and has posted 5 posts at DZone. You can read more from them at their website. View Full User Profile

Dealing with Super-Nodes and Indexed Relationships in Neo4j, Part 2

03.19.2012
| 3291 views |
  • submit to reddit

In the previous post we compared the performance of fetching relationships from densely populated nodes using Neo4j native store and using lucene index. We've seen that we can fetch the small subset of relationships from a super-node (containing ~1M relationships in total)  directly from the Lucene index, the performance of the first run (cold-caches) is better then using the Neo store directly. The subsequent runs with caches warmed up show comparable performace, slightly in favour of direct Neo store fetching, sue to low level cache optimizations.

To conclude our research, in this blog entry we're going to take a look at the performance of fetching large numbers of relationships from the super-nodes.

Let's remind ourselves with the data set we are using:

  • 5 nodes representing cities/towns (for example “London”,  “Manchester”, “Edinburgh”, “Cambridge” and “Oxford”)
  • Each city node has “IS_IN” relationship to a country node UK, (every city is related to a country node with IS_IN relationship, and we have only one country node - UK)
  • We have 1M users, and each of the users have a “HAS_VISITED” relationship to each of the five cities. I know this is not a likely scenario, but in a large data set (say among all Facebook users) you will find more then 5 cities that more then 100K users have visited
  • To summarise, we have 1,000,006 nodes (1M users+5 cities + 1 UK country node) and 5, 000,005 relationships (1M users-HAS)_VISITED-5 cities + 5 cities IS_IN UK).
In the previous blog entry we were fetching IS_IN relationships from each city super-node, with every city having exactly one relationship of that type (of 1,000,001, or 0.0001%).
In this example. we're going to load all HAS_VISITED relationships for a city node, which means that we're interested in loading 99.9999% relationships on the given node. We're going to compare the use of the Neo4j API to traverse through relationships, and the IndexedRelationshipExapnder we introduced in the previous blog entry.
To fetch all HAS_VISITED relationships using Neo API, we're going to use similar code as before:

Listing 1: Using Neo4j API to load relationships
Iterable<Relationships> relationships = city.getRelationships(DynamicRelationshipType.withName("IS_IN"));
for (Relationship r : relationships) {
   cnt++;
}
To fetch relationships from the Lucene index, we are going to use Neo's Traversal API with our IndexedRelationshipExpander
Listing 2: Using IndexedRelationhipExpander (implemented in previous post) to load relationships from Lucene index
RelationshipType relationshiptype = DynamicRelationshipType.withName("HAS_VISITED");
Direction direction = Direction.INCOMING;
TraversalDescription traversal =
    Traversal.description()
    .evaluator(Evaluators.atDepth(1))
    .expand(
        IndexedRelationhipExpander.create(this.embeddedGraphDatabase, relationshipType, direction)
    );
traversal.traverse(city);
Table 1 shows the performance comparison of the standard expander and our new indexed expander when fetching all relationships of type HAS_VISITED from the city node. Remember, each city node will have 1M HAS_VISITED relationships.
Table 1: Performance comparison of the two approaches, by loading all relationships from super-nodes       As you can see, the first run using the standard expander takes just over 5 seconds, compared with almost 15 seconds using Lucene. Subsequent runs with warmed-up caches show even more variation, just over 1 second using the standard expander and around 8 seconds using the indexed relationships expander. This shows that loading large result sets from the Lucene index is considerably slower than loading relationships from the Neo4j store directly. This is probably expected, as Neo4j isn't designed to be used as a Lucene store, but the graph database.

In reality (or at least in the use cases I came across), you rarely want to traverse through huge number of dense relationships on super nodes (such as HAS_VISITED relationship in our example). If you load a small subset of relationships from the super-node, the indexed relationship expander will work well. When you want to retrieve a larger a number of relationships without retrieving them all in one a go, a better approach is to order the hits from the Lucene index in the expander (this is quick in lucene), and only return a configured maximum number of relationship that you can handle in a timely manner. Why load 1M of users that have visited the given city, when you can only display 10 on the screen – load 10 instead!

We implemented OrderedIndexedRelationshipExpander to illustrate how we can use the power of Lucene to perform sorting and return only top N results: Listing 3: Relationship expander that orders and pages the Lucene search results
public class OrderedIndexedRelationhipExpander implements RelationshipExpander{
 //The rest of the class is same as in the
 //IndexedRelationshipExpander
 //Omitted for clarity
public static final int DEFAULT_MAX_HITS = 100000;

    @Override
    public Iterable<Relationship> expand(Node node) {

        QueryContext queryContext = new QueryContext(this.relationshipType.name())
          .sort(Sort.INDEXORDER)
      .top(DEFAULT_MAX_HITS); #1
        Iterable<Relationship> hits = null;
        ReadableRelationshipIndex autoIndex = this.graphDatabaseService.index().getRelationshipAutoIndexer().getAutoIndex();
        switch (this.direction){
            case OUTGOING:
                hits = autoIndex.query("type",queryContext, node, null);
                break;
            case INCOMING:
                hits = autoIndex.query("type", queryContext, null, node);
                break;
            case BOTH:
                IndexHits<Relationship> out = autoIndex.query("type", queryContext, node, null);
                IndexHits<Relationship> in = autoIndex.query("type", queryContext, null, node);
                hits = Iterables.concat(out, in);
                break;
        }

        return hits;
    }

}
The only change we made from the previous example is line (#1), where we create QueryContext using the search term, sort order and the number of top hits to return. Everything else is exactly the same as in the IndexedRelationshipExpander.

Let’s now repeat the excersize, traversing the HAS_VISITED relationships on the super-node. Remember, we’re returning 100K top results – still a large enough collection.

Listing 4: Using OrderedIndexedRelationshipExpander to load top 100K relationships

RelationshipType relationshiptype = DynamicRelationshipType.withName("HAS_VISITED");
Direction direction = Direction.BOTH;
TraversalDescription traversal =
    Traversal.description()
    .evaluator(Evaluators.atDepth(1))
    .expand(
        OrderedIndexedRelationhipExpander.create(this.embeddedGraphDatabase, relationshipType, direction)
    );
traversal.traverse(city);

Table 2 shows the measured performance results:

Table 2: OrderedIndexedRelationshipExpander performance results

  

The first-pass hit is under 3 seconds, with warm caches performance stabilizes around 500ms, which is a huge improvement.

We knew that Neo4j sometimes struggles with super nodes traversal. We also knew that Neo4j has great Lucene search engine capabilities built in the server engine.
In this blog we demonstrated that using the Neo store has better performance then Lucene index when fetching large number of relationships from super-nodes. However, the performance was not ideal (over 1 second).

Using some practical common sense, and using Lucene engine for ordering and paging of search results, we managed to increase the performance by more the double.

Warm caches results are still in favour of Neo store node/relationship retrieval - if you can live with slow first pass, then you should consider using the power of Neo engine. If, however, you can take advantage of paging, and if the first request performance is important for your use case, you may consider looking up the relationships in the Lucene index.

This approach is tested for nodes with large number of relationships. Neo engine has been optimized to handle traversals for the "standard" nodes (with up to 1000 relationships for example), so if your domain can be modeled in a graph without any super-nodes, you should try to use only the power of Neo graph engine.

Published at DZone with permission of Aleksa Vukotic, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)