Dealing with Super-Nodes and Indexed Relationships in Neo4j, Part 2
In the previous post we compared the performance of fetching relationships from densely populated nodes using Neo4j native store and using lucene index. We've seen that we can fetch the small subset of relationships from a super-node (containing ~1M relationships in total) directly from the Lucene index, the performance of the first run (cold-caches) is better then using the Neo store directly. The subsequent runs with caches warmed up show comparable performace, slightly in favour of direct Neo store fetching, sue to low level cache optimizations.
To conclude our research, in this blog entry we're going to take a look
at the performance of fetching large numbers of relationships from the
super-nodes.
Let's remind ourselves with the data set we are using:
- 5 nodes representing cities/towns (for example “London”, “Manchester”, “Edinburgh”, “Cambridge” and “Oxford”)
- Each city node has “IS_IN” relationship to a country node UK, (every city is related to a country node with IS_IN relationship, and we have only one country node - UK)
- We have 1M users, and each of the users have a “HAS_VISITED” relationship to each of the five cities. I know this is not a likely scenario, but in a large data set (say among all Facebook users) you will find more then 5 cities that more then 100K users have visited
- To summarise, we have 1,000,006 nodes (1M users+5 cities + 1 UK country node) and 5, 000,005 relationships (1M users-HAS)_VISITED-5 cities + 5 cities IS_IN UK).
In this example. we're going to load all HAS_VISITED relationships for a city node, which means that we're interested in loading 99.9999% relationships on the given node. We're going to compare the use of the Neo4j API to traverse through relationships, and the IndexedRelationshipExapnder we introduced in the previous blog entry.
To fetch all HAS_VISITED relationships using Neo API, we're going to use similar code as before:
Listing 1: Using Neo4j API to load relationships
Iterable<Relationships> relationships = city.getRelationships(DynamicRelationshipType.withName("IS_IN"));
for (Relationship r : relationships) {
cnt++;
}
To fetch relationships from the Lucene index, we are going to use Neo's Traversal API with our IndexedRelationshipExpanderListing 2: Using IndexedRelationhipExpander (implemented in previous post) to load relationships from Lucene index
RelationshipType relationshiptype = DynamicRelationshipType.withName("HAS_VISITED");
Direction direction = Direction.INCOMING;
TraversalDescription traversal =
Traversal.description()
.evaluator(Evaluators.atDepth(1))
.expand(
IndexedRelationhipExpander.create(this.embeddedGraphDatabase, relationshipType, direction)
);
traversal.traverse(city);
Table 1 shows the performance comparison of the standard expander and
our new indexed expander when fetching all relationships of type
HAS_VISITED from the city node. Remember, each city node will have 1M
HAS_VISITED relationships.Table 1: Performance comparison of the two approaches, by loading all relationships from super-nodes
As you can see, the first run using the standard expander takes just
over 5 seconds, compared with almost 15 seconds using Lucene. Subsequent
runs with warmed-up caches show even more variation, just over 1 second
using the standard expander and around 8 seconds using the indexed
relationships expander. This shows that loading large result sets from
the Lucene index is considerably slower than loading relationships from
the Neo4j store directly. This is probably expected, as Neo4j isn't
designed to be used as a Lucene store, but the graph database.In reality (or at least in the use cases I came across), you rarely want to traverse through huge number of dense relationships on super nodes (such as HAS_VISITED relationship in our example). If you load a small subset of relationships from the super-node, the indexed relationship expander will work well. When you want to retrieve a larger a number of relationships without retrieving them all in one a go, a better approach is to order the hits from the Lucene index in the expander (this is quick in lucene), and only return a configured maximum number of relationship that you can handle in a timely manner. Why load 1M of users that have visited the given city, when you can only display 10 on the screen – load 10 instead!
We implemented OrderedIndexedRelationshipExpander to illustrate how we can use the power of Lucene to perform sorting and return only top N results: Listing 3: Relationship expander that orders and pages the Lucene search results
public class OrderedIndexedRelationhipExpander implements RelationshipExpander{
//The rest of the class is same as in the
//IndexedRelationshipExpander
//Omitted for clarity
public static final int DEFAULT_MAX_HITS = 100000;
@Override
public Iterable<Relationship> expand(Node node) {
QueryContext queryContext = new QueryContext(this.relationshipType.name())
.sort(Sort.INDEXORDER)
.top(DEFAULT_MAX_HITS); #1
Iterable<Relationship> hits = null;
ReadableRelationshipIndex autoIndex = this.graphDatabaseService.index().getRelationshipAutoIndexer().getAutoIndex();
switch (this.direction){
case OUTGOING:
hits = autoIndex.query("type",queryContext, node, null);
break;
case INCOMING:
hits = autoIndex.query("type", queryContext, null, node);
break;
case BOTH:
IndexHits<Relationship> out = autoIndex.query("type", queryContext, node, null);
IndexHits<Relationship> in = autoIndex.query("type", queryContext, null, node);
hits = Iterables.concat(out, in);
break;
}
return hits;
}
}
The only change we made from the previous example is line (#1), where we create QueryContext using the search term, sort order and the number of top hits to return. Everything else is exactly the same as in the IndexedRelationshipExpander.
Let’s now repeat the excersize, traversing the HAS_VISITED relationships
on the super-node. Remember, we’re returning 100K top results – still a
large enough collection.
Listing 4: Using OrderedIndexedRelationshipExpander to load top 100K relationships
RelationshipType relationshiptype = DynamicRelationshipType.withName("HAS_VISITED");
Direction direction = Direction.BOTH;
TraversalDescription traversal =
Traversal.description()
.evaluator(Evaluators.atDepth(1))
.expand(
OrderedIndexedRelationhipExpander.create(this.embeddedGraphDatabase, relationshipType, direction)
);
traversal.traverse(city);Table 2 shows the measured performance results:
Table 2: OrderedIndexedRelationshipExpander performance results
The first-pass hit is under 3 seconds, with warm caches performance stabilizes around 500ms, which is a huge improvement.
We knew that Neo4j sometimes struggles with super nodes traversal. We
also knew that Neo4j has great Lucene search engine capabilities built
in the server engine.
In this blog we demonstrated that using the Neo store has better
performance then Lucene index when fetching large number of
relationships from super-nodes. However, the performance was not ideal
(over 1 second).
Warm caches results are still in favour of Neo store node/relationship retrieval - if you can live with slow first pass, then you should consider using the power of Neo engine. If, however, you can take advantage of paging, and if the first request performance is important for your use case, you may consider looking up the relationships in the Lucene index.
This approach is tested for nodes with large number of relationships. Neo engine has been optimized to handle traversals for the "standard" nodes (with up to 1000 relationships for example), so if your domain can be modeled in a graph without any super-nodes, you should try to use only the power of Neo graph engine.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)





