NoSQL Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 524 posts at DZone. You can read more from them at their website. View Full User Profile

Neo4j/Cypher: Redundant Relationships

04.17.2013
| 2845 views |
  • submit to reddit

Last week I was writing a query to find the top scorers in the Premier League so far this season alongside the number of games they’ve played in which initially read like this:

START player = node:players('name:*')
MATCH player-[:started|as_sub]-playedLike-[:in]-game-[r?:scored_in]-player
WITH player, COUNT(DISTINCT game) AS games, COLLECT(r) AS allGoals
RETURN player.name, games, LENGTH(allGoals) AS goals
ORDER BY goals DESC
LIMIT 5

+------------------------------------+
| player.name        | games | goals |
+------------------------------------+
| "Luis Suárez"      | 30    | 22    |
| "Robin Van Persie" | 30    | 19    |
| "Gareth Bale"      | 27    | 17    |
| "Michu"            | 29    | 16    |
| "Demba Ba"         | 28    | 15    |
+------------------------------------+
5 rows
1 ms

I modelled whether a player started a game or came on as a substitute with separate relationship types ‘started’ and ‘as_sub’ but in this query we’re not interested in that, we just want to know whether they played.

In the world of relational database design we tend to try and avoid redundancy but with graphs this isn’t such a big deal so I thought I may as well add a ‘played’ relationship whenever a ‘as_sub’ or ‘started’ one was being created.

We can then simplify the above query to read like this:

START player = node:players('name:*')
MATCH player-[:played]-playedLike-[:in]-game-[r?:scored_in]-player
WITH player, COUNT(DISTINCT game) AS games, COLLECT(r) AS allGoals
RETURN player.name, games, LENGTH(allGoals) AS goals
ORDER BY goals DESC
LIMIT 5

+------------------------------------+
| player.name        | games | goals |
+------------------------------------+
| "Luis Suárez"      | 30    | 22    |
| "Robin Van Persie" | 30    | 19    |
| "Gareth Bale"      | 27    | 17    |
| "Michu"            | 29    | 16    |
| "Demba Ba"         | 28    | 15    |
+------------------------------------+
5 rows
0 ms

When I’m querying I often forget that I modelled starting/substitute separately and think the data has screwed up and it’s always because I’ve forgotten to include the ‘as_sub’ relationship.

Having the ‘played’ relationship means that no longer happens which is cool.

I have a reasonably small data set so I haven’t seen any performance problems from creating this redundancy.

However, since the maximum number of relationships going out from a player would be unlikely to be more than 1000s I don’t think it will become one either.

As always I’d be interested in thoughts from others who have come across similar problems or can see something that I’ve missed.




Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)