Gmail Email Analysis with Neo4j - and Spreadsheets
Curator's Note: The content of this article was written by Rik Van Bruggen on the Neo4j blog.
bunch of different graphistas have pointed out to me in recent months
that there is something funny about Graphs and email. Specifically,
about graphs and email analysis. From my work in previous years at
security companies, I know that Email Forensics is actually big
business. Figuring out who emails whom, about what topics, with what
frequency, at what times - is important. Especially when the proverbial
sh*t hits the fan and fraud comes to light - like in the Enron case.
How do I get insight into email traffic? How do I know what was
communicated to who? And how do I get that insight, without spending a
So a couple of days ago I came across an article or two that features Gmail Meter. This is of course not one of these powerful enterprise-ready forensics tools, but it is interesting. Written by Romain Vialard, it provides you with a really straightforward way to get all kinds of stats and data about your use of Gmail. It does so using a Google Apps Script, that is available to anyone using Google Docs’ spreadsheet functionality. In this blog post, we’ll take a look at how we can actually use Romain’s output, and generate a neo4j database that will allow us to visually and graph-ically explore the email traffic in our gmail inboxes - without doing doing any coding of course. Because I don’t know how to do that - but I do do spreadsheets, as you know by now.
Using Gmail Meter to create the dataset
The first thing you need to do to get going is to get Gmail Meter installed. To do that, you just create a Google Doc Spreadsheet, and insert Gmail Meter from the Script Gallery.
You will need to give it permission to analyse your mailbox (and potentially remove that when you want to uninstall it), but that’s easy. More instructions can be found over here should you need that. But really it’s dead easy. Once you have given it permission that, the Apps Scripts starts churning away at your mailbox.
The end result is a Google spreadsheet with two tabs:
Importing the Gmail Meter data into Neo4j
For the graph import that I will illustrate here, we will only be using the first sheet. Basically just getting to grips with
The way I have done it - which is probably not the only way to do it, but still - is to add two sheets to the workbook coming from Gmail Meter.
Note that - as mentioned above - in the Graph sheet we have, additional to the nodes and relationships, also added the number of emails to the “EMAIL” relationship as a property/weight. This will come in handy later when visualising the email traffic. In a larger graph you could actually use these weights for pathfinding algorithms as well - in case you would want to find out the volume of email traffic between two persons at different places in the graph, through other people. You see that I am using the exact same techniques of my spreadsheet import blog post to generate the Cypher statements required. All I need to do after that is to put all statements into the “Cypher” worksheet, wrap it with a transaction - and copy paste that into the neo4j shell of my empty neo4j database. That will execute the queries, insert all nodes, relationships and properties, and voila - we have our Gmail Meter Graph!
Exploring the Gmail Meter Graph
Once we have generated the database, we can explore it in all the ways that we can with Neo4j’s traditional visualisation/querying tools. First thing I did of course is to look at it with Webadmin. That gives you some ideas already, but things get a lot more interesting when you can visualise the weights (= numbers of emails between two persons).
To do that, I plugged another visualisation tool on top of the database to get a feel for the weight of certain relationships. Our friends at Linkurio.us actually have some very neat (and above all: simple AND powerful) ways to do this - as you can see below it immediately gave me an idea of where the traffic is coming from and going to. Obviously we can explore the network, and also query it with Cypher for interesting relationships:
START mymail=node:node_auto_index(name="firstname.lastname@example.org") MATCH mymail-[email:EMAIL]->otherperson RETURN id(otherperson), email.number, otherperson.name;
START mymail=node:node_auto_index(name="email@example.com") MATCH mymail<-[email:EMAIL]-otherperson WHERE email.number > 4 RETURN id(otherperson), email.number, otherperson.name;
START mymail=node:node_auto_index(name="firstname.lastname@example.org") MATCH mymail<-[email:EMAIL]-otherperson, mymail-[email2:EMAIL]->otherperson WHERE email.number > 2 AND email2.number > 2 RETURN id(otherperson), email.number as From, email2.number as To,
I hope you understand that - because we are only looking at data from one and only one mailbox, the dataset’s power is quite limited. But I am hoping you get the point that the graph exploration of this dataset is great. I did a little experiment where I actually put my professional email data and my personal email data (both use Gmail) into one neo4j database - and that was really interesting.
For those of you wanting more detailed info on this topic, I would encourage you to take a look at the wonderful Graph Databases book, that has a specific chapter about email analysis.
Hoping this was useful - it sure was a learning experience (again) for me.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)