Big Data/Analytics Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 553 posts at DZone. You can read more from them at their website. View Full User Profile

Aggregating Relationships within a Path

06.28.2013
| 1660 views |
  • submit to reddit

I recently came across an interesting use case of paths in a graph where we wanted to calculate the frequency of communication between two people by showing how frequently each emailed the other.

The model looked like this:

Emails

which we can create with the following cypher statements:

CREATE (email1 { name: 'Email 1', title: 'Some stuff' })
CREATE (email2 { name: 'Email 2', title: "Absolutely irrelevant" })
CREATE (email3 { name: 'Email 3', title: "Something else" })
CREATE (person1 { name: 'Mark' })
CREATE (person2 { name: 'Jim' })
CREATE (person3 { name: 'Alistair' })
CREATE (person1)-[:SENT]->(email1)
CREATE (person2)-[:RECEIVED]->(email1)
CREATE (person3)-[:RECEIVED]->(email1)
CREATE (person1)-[:SENT]->(email2)
CREATE (person2)-[:RECEIVED]->(email2)
CREATE (person2)-[:SENT]->(email3)
CREATE (person1)-[:RECEIVED]->(email3)

We want to return a list containing pairs of people and how many times they emailed each other, so in this case we want to return a table showing the following:

+-------------------------------------------+
| Person 1 | Person 2 | P1 -> P2 | P2 -> P1 |
|-------------------------------------------|
| Alistair | Mark     | 0        | 1        |
| Jim      | Mark     | 1        | 2        |
+-------------------------------------------+

I started out with the following query which finds all the emails and then traverses out to find the sender and receiver and aggregates them together:

START email = node:node_auto_index('name:"Email 1" name:"Email 2" name: "Email 3"') 
MATCH sender-[:SENT]->email<-[:RECEIVED]-receiver 
RETURN sender.name, receiver.name, COUNT(email) AS count

which returns the following:

==> +-------------------------------------+
==> | sender.name | receiver.name | count |
==> +-------------------------------------+
==> | "Mark"      | "Jim"         | 2     |
==> | "Jim"       | "Mark"        | 1     |
==> | "Mark"      | "Alistair"    | 1     |
==> +-------------------------------------+

It gives us all the pairs that have sent emails to each other but doesn’t group them together which is a bit annoying.

Having realised that it was quite difficult to get that grouping of people if I started from the emails I changed the query to start from people instead and then traverse in on the emails they’d written to each other:

START personA = node:node_auto_index('name:"Mark" name:"Jim" name:"Alistair"'), 
      personB = node:node_auto_index('name:"Mark" name:"Jim" name:"Alistair"') 
MATCH personA-[s:SENT]->email<-[r:RECEIVED]-personB 
WHERE personA <> personB 
 
WITH personA, personB, LENGTH(COLLECT(s)) AS AToB 
MATCH personB-[s?:SENT]->email<-[r:RECEIVED]-personA 
RETURN personA.name, personB.name, AToB, LENGTH(COLLECT(s)) AS BToA

Here we start with our list of people and then find all the combinations where personA has sent an email that’s been received by personB.

If we execute just the first half of that we can see the email sent/received combinations:

START personA = node:node_auto_index('name:"Mark" name:"Jim" name:"Alistair"'), 
      personB = node:node_auto_index('name:"Mark" name:"Jim" name:"Alistair"') 
MATCH personA-[s:SENT]->email<-[r:RECEIVED]-personB 
WHERE personA <> personB 
RETURN personA, personB, email
==> +---------------------------------------------------------------------------------------------------------+
==> | personA              | personB                  | email                                                 |
==> +---------------------------------------------------------------------------------------------------------+
==> | Node[4]{name:"Mark"} | Node[5]{name:"Jim"}      | Node[1]{name:"Email 1",title:"Some stuff"}            |
==> | Node[4]{name:"Mark"} | Node[5]{name:"Jim"}      | Node[2]{name:"Email 2",title:"Absolutely irrelevant"} |
==> | Node[5]{name:"Jim"}  | Node[4]{name:"Mark"}     | Node[3]{name:"Email 3",title:"Something else"}        |
==> | Node[4]{name:"Mark"} | Node[6]{name:"Alistair"} | Node[1]{name:"Email 1",title:"Some stuff"}            |
==> +---------------------------------------------------------------------------------------------------------+

In the second half of the query we find the inverse relationships but we also want to indicate if one person hasn’t sent any emails to the other one (e.g. Alistair to Mark) which is why I brought in the optional ‘SENT’ relationship. If we run that we get the following:

==> +-------------------------------------------+
==> | personA.name | personB.name | AToB | BToA |
==> +-------------------------------------------+
==> | "Jim"        | "Mark"       | 1    | 2    |
==> | "Mark"       | "Alistair"   | 1    | 0    |
==> | "Mark"       | "Jim"        | 2    | 1    |
==> +-------------------------------------------+

It’s a bit better in that we’ve now got the emails sent by pairs of people on the same row but we still have duplication e.g. Jim/Mark and Mark/Jim show as separate rows even though they show the same information.

We also won’t see a row if Person A hasn’t sent an email to PersonB even if Person B has emailed Person A.

At this stage I was a bit stuck but Max pointed out that we could use the all shortest paths algorithm to solve the problem.

We started out with the following query:

START personA = node:node_auto_index('name:"Mark" name:"Jim" name:"Alistair"'), 
      personB = node:node_auto_index('name:"Mark" name: "Jim" name: "Alistair"') 
WITH personA,personB 
MATCH p = AllShortestPaths(personA-[:SENT|RECEIVED*..2]-personB)
RETURN p

which returns:

==> +------------------------------------------------------------------------------------------------------------------------------+
==> | p                                                                                                                            |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | [Node[4]{name:"Mark"}]                                                                                                       |
==> | [Node[4]{name:"Mark"},:SENT[0] {},Node[1]{name:"Email 1",title:"Some stuff"},:RECEIVED[1] {},Node[5]{name:"Jim"}]            |
==> | [Node[4]{name:"Mark"},:SENT[3] {},Node[2]{name:"Email 2",title:"Absolutely irrelevant"},:RECEIVED[4] {},Node[5]{name:"Jim"}] |
==> | [Node[4]{name:"Mark"},:RECEIVED[6] {},Node[3]{name:"Email 3",title:"Something else"},:SENT[5] {},Node[5]{name:"Jim"}]        |
==> | [Node[4]{name:"Mark"},:SENT[0] {},Node[1]{name:"Email 1",title:"Some stuff"},:RECEIVED[2] {},Node[6]{name:"Alistair"}]       |
==> | [Node[5]{name:"Jim"},:RECEIVED[1] {},Node[1]{name:"Email 1",title:"Some stuff"},:SENT[0] {},Node[4]{name:"Mark"}]            |
==> | [Node[5]{name:"Jim"},:RECEIVED[4] {},Node[2]{name:"Email 2",title:"Absolutely irrelevant"},:SENT[3] {},Node[4]{name:"Mark"}] |
==> | [Node[5]{name:"Jim"},:SENT[5] {},Node[3]{name:"Email 3",title:"Something else"},:RECEIVED[6] {},Node[4]{name:"Mark"}]        |
==> | [Node[5]{name:"Jim"}]                                                                                                        |
==> | [Node[5]{name:"Jim"},:RECEIVED[1] {},Node[1]{name:"Email 1",title:"Some stuff"},:RECEIVED[2] {},Node[6]{name:"Alistair"}]    |
==> | [Node[6]{name:"Alistair"},:RECEIVED[2] {},Node[1]{name:"Email 1",title:"Some stuff"},:SENT[0] {},Node[4]{name:"Mark"}]       |
==> | [Node[6]{name:"Alistair"},:RECEIVED[2] {},Node[1]{name:"Email 1",title:"Some stuff"},:RECEIVED[1] {},Node[5]{name:"Jim"}]    |
==> | [Node[6]{name:"Alistair"}]                                                                                                   |
==> +------------------------------------------------------------------------------------------------------------------------------+

Since we’re only really interested in the emails that people have sent to each other we’ll narrow down the result set to only include those paths:

START personA = node:node_auto_index('name:"Mark" name:"Jim" name:"Alistair"'), 
      personB = node:node_auto_index('name:"Mark" name: "Jim" name: "Alistair"') 
WITH personA,personB 
MATCH p = AllShortestPaths(personA-[:SENT|RECEIVED*..2]-personB) 
WHERE ANY (x IN relationships(p) WHERE TYPE(x)= 'SENT') 
RETURN p
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | p                                                                                                                            |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | [Node[4]{name:"Mark"},:SENT[0] {},Node[1]{name:"Email 1",title:"Some stuff"},:RECEIVED[1] {},Node[5]{name:"Jim"}]            |
==> | [Node[4]{name:"Mark"},:SENT[3] {},Node[2]{name:"Email 2",title:"Absolutely irrelevant"},:RECEIVED[4] {},Node[5]{name:"Jim"}] |
==> | [Node[4]{name:"Mark"},:RECEIVED[6] {},Node[3]{name:"Email 3",title:"Something else"},:SENT[5] {},Node[5]{name:"Jim"}]        |
==> | [Node[4]{name:"Mark"},:SENT[0] {},Node[1]{name:"Email 1",title:"Some stuff"},:RECEIVED[2] {},Node[6]{name:"Alistair"}]       |
==> | [Node[5]{name:"Jim"},:RECEIVED[1] {},Node[1]{name:"Email 1",title:"Some stuff"},:SENT[0] {},Node[4]{name:"Mark"}]            |
==> | [Node[5]{name:"Jim"},:RECEIVED[4] {},Node[2]{name:"Email 2",title:"Absolutely irrelevant"},:SENT[3] {},Node[4]{name:"Mark"}] |
==> | [Node[5]{name:"Jim"},:SENT[5] {},Node[3]{name:"Email 3",title:"Something else"},:RECEIVED[6] {},Node[4]{name:"Mark"}]        |
==> | [Node[6]{name:"Alistair"},:RECEIVED[2] {},Node[1]{name:"Email 1",title:"Some stuff"},:SENT[0] {},Node[4]{name:"Mark"}]       |
==> +------------------------------------------------------------------------------------------------------------------------------+

There’s currently some duplication because we have paths going both ways between people included. e.g.

[Node[4]{name:"Mark"},:SENT[0] {},Node[1]{name:"Email 1",title:"Some stuff"},:RECEIVED[1] {},Node[5]{name:"Jim"}]

and:

[Node[5]{name:"Jim"},:RECEIVED[1] {},Node[1]{name:"Email 1",title:"Some stuff"},:SENT[0] {},Node[4]{name:"Mark"}]

refer to the same thing.

We’ll filter by keeping the starting node which has a higher node id – we could use any property to do this comparison but id will do:

START personA = node:node_auto_index('name:"Mark" name:"Jim" name:"Alistair"'), 
      personB = node:node_auto_index('name:"Mark" name: "Jim" name: "Alistair"') 
WITH personA,personB 
MATCH p = AllShortestPaths(personA-[:SENT|RECEIVED*..2]-personB) 
WHERE ANY (x IN relationships(p) WHERE TYPE(x)= 'SENT') 
AND ID(head(nodes(p))) > ID(head((tail(tail(nodes(p)))))) 
RETURN p
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | p                                                                                                                            |
==> +------------------------------------------------------------------------------------------------------------------------------+
==> | [Node[5]{name:"Jim"},:RECEIVED[1] {},Node[1]{name:"Email 1",title:"Some stuff"},:SENT[0] {},Node[4]{name:"Mark"}]            |
==> | [Node[5]{name:"Jim"},:RECEIVED[4] {},Node[2]{name:"Email 2",title:"Absolutely irrelevant"},:SENT[3] {},Node[4]{name:"Mark"}] |
==> | [Node[5]{name:"Jim"},:SENT[5] {},Node[3]{name:"Email 3",title:"Something else"},:RECEIVED[6] {},Node[4]{name:"Mark"}]        |
==> | [Node[6]{name:"Alistair"},:RECEIVED[2] {},Node[1]{name:"Email 1",title:"Some stuff"},:SENT[0] {},Node[4]{name:"Mark"}]       |
==> +------------------------------------------------------------------------------------------------------------------------------+

Now we just need to do a bit more manipulation of these paths and we have exactly what we need:

START personA = node:node_auto_index('name:"Mark" name:"Jim" name:"Alistair"'), 
      personB = node:node_auto_index('name:"Mark" name: "Jim" name: "Alistair"') 
WITH personA,personB 
MATCH p = AllShortestPaths(personA-[:SENT|RECEIVED*..2]-personB) 
WHERE ANY (x IN relationships(p) WHERE TYPE(x)= 'SENT') 
AND ID(head(nodes(p))) > ID(head((tail(tail(nodes(p)))))) 
RETURN HEAD(NODES(p)) AS personA, 
       HEAD((TAIL(TAIL(NODES(p))))) AS personB, 
       LENGTH(FILTER(y IN COLLECT(HEAD(RELS(p))): TYPE(y)= 'SENT')) as AToB,
       LENGTH(FILTER(y IN COLLECT(HEAD(TAIL(RELS(p)))): TYPE(y)= 'SENT')) as BToA

What we’ve done here is count the number of ‘SENT’ relationships from person A’s side of the path and then do the same for person B’s side of the path.

There’s currently no way to do slicing of a collection in cypher otherwise we wouldn’t need those nested calls to TAIL!

We do however get the result we want:

==> +---------------------------------------------------------------+
==> | personA                  | personB              | AToB | BToA |
==> +---------------------------------------------------------------+
==> | Node[6]{name:"Alistair"} | Node[4]{name:"Mark"} | 0    | 1    |
==> | Node[5]{name:"Jim"}      | Node[4]{name:"Mark"} | 1    | 2    |
==> +---------------------------------------------------------------+

I have no idea how well this query would perform for any significantly sized data set but it’s an interesting query nonetheless.

Published at DZone with permission of Mark Needham, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)