NoSQL Zone is brought to you in partnership with:

Max De Marzi, is a seasoned web developer. He started building websites in 1996 and has worked with Ruby on Rails since 2006. The web forced Max to wear many hats and master a wide range of technologies. He can be a system admin, database developer, graphic designer, back-end engineer and data scientist in the course of one afternoon. Max is a graph database enthusiast. He built the Neography Ruby Gem, a rest api wrapper to the Neo4j Graph Database. He is addicted to learning new things, loves a challenge and finding pragmatic solutions. Max is very easy to work with, focuses under pressure and has the patience of a rock. Max is a DZone MVB and is not an employee of DZone and has posted 57 posts at DZone. You can read more from them at their website. View Full User Profile

My Entry for the HCIR Challenge

07.25.2012
| 2206 views |
  • submit to reddit

A tweet from RiparianData caught my eye the other day:

RiparianData @RiparianData

This year's challenge: expert and expertise discovery (useful in, say in job candidate selection) ripar.in/LJ0xOj @dtunkelang

9 Jul 12

I built getvouched.com with this idea of “expert and expertise discovery” using skill based vouching adjusted by the distance from searcher to target as a way to find rank. So I dug in and found out that Human-computer Information Retrieval (HCIR) combines research from the fields of human-computer interaction (HCI) and information retrieval (IR), placing an emphasis on human involvement in search activities.

The HCIR challenge for this years symposium includes “hiring,” “assembling a conference program,” and “finding people to deliver patent research or expert testimony” as summarized by Patrick Durusau.

I was late to the party (as the deadline to get access to the Mendeley data had passed) but William Gunn and Daniel Tunkelang were kind enough to grant me access.

I got the data via Dropbox, it is mostly tab separated data with one exception which is a JSON dump of publications.

I needed to import this into Neo4j, so I followed the examples from Batch Importer Part 2, and Batch Importer Part 3 to do some ETL, but first I needed to load the data into Postgresql so I could match up the two formats. I’ve outlined how I did this on the HCIR github repo.

What I ended up with was this graph:

publication -[:by_discipline]->      discipline 
publication -[:by_country]->         country
publication -[:by_academic_status]-> academic_status
publication -[:authored_by]->        author
publication -[:published_in]->       journal
author      -[:has_profile]->        profile
profile     -[:interested_in]->      discipline      
profile     -[:member_of]->          group
profile     -[:knows]->              profile

I also created a “vertices” full text index and an “edges” full text index to make my life easier. Just to make sure it imported ok I tested with:

START authors = node:vertices('type:author')
RETURN authors.name
LIMIT 3;

Looking good:

==> +------------------+
==> | authors.name     |
==> +------------------+
==> | "Dominik Papies" |
==> | "Felix Eggers"   |
==> | "Nils Wlömert"   |
==> +------------------+

I wonder who the most prolific author is?

START authors = node:vertices('type:author') 
MATCH authors <-[:authored_by]- publication
RETURN authors.name, count(publication) AS cnt
ORDER BY cnt DESC
LIMIT 5;

“Timothy E Hewett” has authored the most publications in our sample data set.

==> +--------------------------+
==> | authors.name       | cnt |
==> +--------------------------+
==> | "Timothy E Hewett" | 339 |
==> | "Gregory D Myer"   | 226 |
==> | "Kevin R Ford"     | 202 |
==> | "Felix Gugerli"    | 144 |
==> | "K Darowicki"      | 143 |
==> +--------------------------+

I wonder how many co-authors he has?

START author = node:vertices('type:author AND name:"Timothy E Hewett"') 
MATCH author <-[:authored_by]- publication -[:authored_by]-> co_authors
RETURN count(DISTINCT co_authors);

That’s a ton of co-authors.

==> +----------------------------+
==> | count(DISTINCT co_authors) |
==> +----------------------------+
==> | 280                        |
==> +----------------------------+

Let’s pick one author from above and focus in on them.

START me = node:vertices('name:"Felix Eggers"')
RETURN me;

Looks like we have him as an author, and we have his profile as well.

==> +-------------------------------------------------------------------+
==> | me                                                                |
==> +-------------------------------------------------------------------+
==> | Node[17]{name:"Felix Eggers",type:"author",node_id:"17"}          |
==> | Node[400573]{name:"Felix Eggers",type:"profile",node_id:"400573"} |
==> +-------------------------------------------------------------------+

So let’s say that Felix is trying to hire someone like him or assemble a conference program of a research topic he is interested in. We can try to find people who are like Felix a number of different ways:

By Contacts:

We can start with the simple case of who does Felix know?

START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:knows]-> profiles
RETURN DISTINCT profiles.name
LIMIT 5; 

5 out of the 7 authors Felix knows:

==> +-------------------+
==> | profiles.name     |
==> +-------------------+
==> | "Jens Hogreve"    |
==> | "Mathias Lin"     |
==> | "Fabian Eggers"   |
==> | "Tillmann Wagner" |
==> | "Andreas Neus"    |
==> +-------------------+

Felix doesn’t know a whole lot of other authors, let’s expand his network one more level.

START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:knows]-> () -[:knows]-> profiles
RETURN DISTINCT profiles.name;
LIMIT 5; 

5 out of the 16 contacts his contacts know:

==> +------------------------+
==> | profiles.name          |
==> +------------------------+
==> | "Victor Henning"       |
==> | "Jens Hogreve"         |
==> | "Charles Hofacker"     |
==> | "Stephanie Feiereisen" |
==> | "Alexander Stich"      |
==> +------------------------+

Members of the same groups:

Let see what research groups Felix is in, and who else is in those groups.

START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:member_of]-> group <-[:member_of]- other_profiles
RETURN DISTINCT other_profiles.name, COLLECT(DISTINCT group.name)
ORDER BY COUNT(*) DESC
LIMIT 5; 

We find Jeremy and Michael are in the same group as Felix.

==> +-----------------------------------------------------------------------------+
==> | other_profiles.name | COLLECT(DISTINCT group.name)                          |
==> +-----------------------------------------------------------------------------+
==> | "Jeremy Chen"       | ["Conjoint Analysis and Discrete Choice Experiments"] |
==> | "Michael Waltinger" | ["Conjoint Analysis and Discrete Choice Experiments"] |
==> +-----------------------------------------------------------------------------+

Are they in any other groups that can help us expand Felix’s network?

START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:member_of]-> group <-[:member_of]- team_members 
         -[:member_of]-> other_group <-[:member_of]- other_profiles
RETURN DISTINCT other_profiles.name, COLLECT(DISTINCT other_group.name)
ORDER BY COUNT(*) DESC
LIMIT 5; 

Some of those folks are in a ton of groups, let’s just count them so it will be easier to display.

START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:member_of]-> group <-[:member_of]- team_members 
         -[:member_of]-> other_group <-[:member_of]- other_profiles
RETURN DISTINCT other_profiles.name, COUNT( DISTINCT other_group.name)
ORDER BY COUNT(*) DESC
LIMIT 5; 
==> +-----------------------------------------------------------+
==> | other_profiles.name   | COUNT( DISTINCT other_group.name) |
==> +-----------------------------------------------------------+
==> | "ABDUL SALAM YUSSIF"  | 12                                |
==> | "Nicholas Overton"    | 9                                 |
==> | "Ashley Cooke"        | 7                                 |
==> | "Joe Reevy"           | 6                                 |
==> | "Moeez Khademhoseiny" | 6                                 |
==> +-----------------------------------------------------------+

Co-Authors:

START me = node:vertices('type:author AND name:"Felix Eggers"')
MATCH me <-[:authored_by]- publication -[:authored_by]-> co_authors 
RETURN DISTINCT co_authors.name; 

These folks co-authored a publication with Felix, so they must like working together, and share similar research interests.

==> +--------------------------+
==> | co_authors.name          |
==> +--------------------------+
==> | "Victor Henning"         |
==> | "Thorsten Hennig-Thurau" |
==> | "Dominik Papies"         |
==> | "Fabian Eggers"          |
==> | "Henrik Sattler"         |
==> | "Mark B Houston"         |
==> | "Nils Wlömert"           |
==> +--------------------------+

That’s not a ton of people, let’s try his 2nd level co-author network:

START me = node:vertices('type:author AND name:"Felix Eggers"')
MATCH me <-[:authored_by]- my_publications -[:authored_by]-> co_authors 
         <-[:authored_by]- their_publications -[:authored_by]-> their_co_authors
WHERE me <> their_co_authors 
  AND NOT(me <-[:authored_by]- my_publications -[:authored_by]-> their_co_authors)
RETURN DISTINCT their_co_authors.name, COUNT(*) AS cnt
ORDER BY cnt DESC
LIMIT 5; 

We are excluding Felix and his co-authors from the result. I found 18, but here are the top 5:

==> +-----------------------------+
==> | their_co_authors.name | cnt |
==> +-----------------------------+
==> | "Jan Reichelt"        | 27  |
==> | "Jason J Hoyt"        | 21  |
==> | "James Hammerton"     | 15  |
==> | "Kris Jack"           | 15  |
==> | "Dan Harvey"          | 15  |
==> +-----------------------------+

In the same Journal:
We can also take look at authors who appeared in the same Journal as Felix since Journals are usually topic specific and curated for high quality content.

START me = node:vertices('type:author AND name:"Felix Eggers"')
MATCH me <-[:authored_by]- my_publications -[:published_in]-> journal 
         <-[:published_in]- other_publications -[:authored_by]-> authors 
RETURN DISTINCT authors.name, COUNT(*) AS cnt
ORDER BY cnt DESC
LIMIT 5; 

I found 35 authors who were published in the same journals, but here are the top 5:

==> +--------------------------------+
==> | authors.name             | cnt |
==> +--------------------------------+
==> | "Thorsten Hennig-Thurau" | 7   |
==> | "Victor Henning"         | 4   |
==> | "Henrik Sattler"         | 4   |
==> | "Tillmann Wagner"        | 4   |
==> | "Richard J Lutz"         | 2   |
==> +--------------------------------+

We can go to a 2nd level here by using his co-authors:

START me = node:vertices('type:author AND name:"Felix Eggers"')
MATCH me <-[:authored_by]- my_publications -[:authored_by]-> co_authors 
         <-[:authored_by]- their_publications -[:published_in]-> journal 
         <-[:published_in]- other_publications -[:authored_by]-> authors 
WHERE me <> authors 
  AND NOT(me <-[:authored_by]- my_publications -[:authored_by]-> authors)
RETURN DISTINCT authors.name, COUNT(*) AS cnt
ORDER BY cnt DESC
LIMIT 5; 

That query returns 19.5k authors, here are the top 5:

==> +-------------------------+
==> | authors.name      | cnt |
==> +-------------------------+
==> | "Thomas Cochrane" | 444 |
==> | "Amanda Peters"   | 360 |
==> | "David Jones"     | 336 |
==> | "DJ Riddell"      | 312 |
==> | "J Lavoué"        | 300 |
==> +-------------------------+

This list represents authors who have appeared in the same journals as his co-authors ordered by the number of paths that exist to them.

Interested the same Disciplines:

We can actually go multiple ways here.

From his profile we can go to disciplines and find other profiles who are into the same disciplines.

START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:interested_in]-> disciplines <-[:interested_in]- other_profiles
RETURN DISTINCT other_profiles.name, COLLECT(DISTINCT disciplines.name)
ORDER BY COUNT(*) DESC
LIMIT 5; 

That’s going to return a ton of people who are also into Business Administration, here are 5 of them:

==> +----------------------------------------------------------+
==> | other_profiles.name | COLLECT(DISTINCT disciplines.name) |
==> +----------------------------------------------------------+
==> | "John Smith"        | ["Business Administration"]        |
==> | "Andreas Müller"    | ["Business Administration"]        |
==> | "abc abc"           | ["Business Administration"]        |
==> | "abc def"           | ["Business Administration"]        |
==> | "Luis Farinha"      | ["Business Administration"]        |
==> +----------------------------------------------------------+

Since we know Felix is interested in Business Administration, we can also go from disciplines to publications, to other authors who may not have a profile in the system.

START me = node:vertices('type:discipline AND name:"Business Administration"')
MATCH me -[:by_discipline]- publications -[:authored_by]- author
RETURN author.name, COUNT(*) AS cnt
ORDER BY cnt DESC
LIMIT 5;
==> +---------------------------+
==> | author.name         | cnt |
==> +---------------------------+
==> | "Null Mancas Matei" | 11  |
==> | "Joanne Dyer"       | 8   |
==> | "Nicholas J Turro"  | 8   |
==> | "Steffen Jockusch"  | 4   |
==> | "Angel A Martí"     | 4   |
==> +---------------------------+

Anyway, that was just bit of exploring of the data with Neo4j and Cypher. I’ll try to build a website that makes use of these queries before the August 31st deadline. Leave a comment if you have any ideas or want to help.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Published at DZone with permission of Max De Marzi, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)