My Entry for the HCIR Challenge
A tweet from RiparianData caught my eye the other day:
RiparianData @RiparianData
This year's #HCIR challenge: expert and expertise discovery (useful in, say in job candidate selection) ripar.in/LJ0xOj @dtunkelang
9 Jul 12
I built getvouched.com with this idea of “expert and expertise discovery” using skill based vouching adjusted by the distance from searcher to target as a way to find rank. So I dug in and found out that Human-computer Information Retrieval (HCIR) combines research from the fields of human-computer interaction (HCI) and information retrieval (IR), placing an emphasis on human involvement in search activities.
The HCIR challenge
for this years symposium includes “hiring,” “assembling a conference
program,” and “finding people to deliver patent research or expert
testimony” as summarized by Patrick Durusau.
I was late to the party (as the deadline to get access to the Mendeley data had passed) but William Gunn and Daniel Tunkelang were kind enough to grant me access.
I got the data via Dropbox, it is mostly tab separated data with one exception which is a JSON dump of publications.
I needed to import this into Neo4j, so I followed the examples from Batch Importer Part 2, and Batch Importer Part 3 to do some ETL, but first I needed to load the data into Postgresql so I could match up the two formats. I’ve outlined how I did this on the HCIR github repo.
What I ended up with was this graph:
publication -[:by_discipline]-> discipline publication -[:by_country]-> country publication -[:by_academic_status]-> academic_status publication -[:authored_by]-> author publication -[:published_in]-> journal author -[:has_profile]-> profile profile -[:interested_in]-> discipline profile -[:member_of]-> group profile -[:knows]-> profile
I also created a “vertices” full text index and an “edges” full text index to make my life easier. Just to make sure it imported ok I tested with:
START authors = node:vertices('type:author')
RETURN authors.name
LIMIT 3;
Looking good:
==> +------------------+ ==> | authors.name | ==> +------------------+ ==> | "Dominik Papies" | ==> | "Felix Eggers" | ==> | "Nils Wlömert" | ==> +------------------+
I wonder who the most prolific author is?
START authors = node:vertices('type:author')
MATCH authors <-[:authored_by]- publication
RETURN authors.name, count(publication) AS cnt
ORDER BY cnt DESC
LIMIT 5;
“Timothy E Hewett” has authored the most publications in our sample data set.
==> +--------------------------+ ==> | authors.name | cnt | ==> +--------------------------+ ==> | "Timothy E Hewett" | 339 | ==> | "Gregory D Myer" | 226 | ==> | "Kevin R Ford" | 202 | ==> | "Felix Gugerli" | 144 | ==> | "K Darowicki" | 143 | ==> +--------------------------+
I wonder how many co-authors he has?
START author = node:vertices('type:author AND name:"Timothy E Hewett"')
MATCH author <-[:authored_by]- publication -[:authored_by]-> co_authors
RETURN count(DISTINCT co_authors);
That’s a ton of co-authors.
==> +----------------------------+ ==> | count(DISTINCT co_authors) | ==> +----------------------------+ ==> | 280 | ==> +----------------------------+
Let’s pick one author from above and focus in on them.
START me = node:vertices('name:"Felix Eggers"')
RETURN me;
Looks like we have him as an author, and we have his profile as well.
==> +-------------------------------------------------------------------+
==> | me |
==> +-------------------------------------------------------------------+
==> | Node[17]{name:"Felix Eggers",type:"author",node_id:"17"} |
==> | Node[400573]{name:"Felix Eggers",type:"profile",node_id:"400573"} |
==> +-------------------------------------------------------------------+
So let’s say that Felix is trying to hire someone like him or assemble a conference program of a research topic he is interested in. We can try to find people who are like Felix a number of different ways:
By Contacts:
We can start with the simple case of who does Felix know?
START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:knows]-> profiles
RETURN DISTINCT profiles.name
LIMIT 5;
5 out of the 7 authors Felix knows:
==> +-------------------+ ==> | profiles.name | ==> +-------------------+ ==> | "Jens Hogreve" | ==> | "Mathias Lin" | ==> | "Fabian Eggers" | ==> | "Tillmann Wagner" | ==> | "Andreas Neus" | ==> +-------------------+
Felix doesn’t know a whole lot of other authors, let’s expand his network one more level.
START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:knows]-> () -[:knows]-> profiles
RETURN DISTINCT profiles.name;
LIMIT 5;
5 out of the 16 contacts his contacts know:
==> +------------------------+ ==> | profiles.name | ==> +------------------------+ ==> | "Victor Henning" | ==> | "Jens Hogreve" | ==> | "Charles Hofacker" | ==> | "Stephanie Feiereisen" | ==> | "Alexander Stich" | ==> +------------------------+
Members of the same groups:
Let see what research groups Felix is in, and who else is in those groups.
START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:member_of]-> group <-[:member_of]- other_profiles
RETURN DISTINCT other_profiles.name, COLLECT(DISTINCT group.name)
ORDER BY COUNT(*) DESC
LIMIT 5;
We find Jeremy and Michael are in the same group as Felix.
==> +-----------------------------------------------------------------------------+ ==> | other_profiles.name | COLLECT(DISTINCT group.name) | ==> +-----------------------------------------------------------------------------+ ==> | "Jeremy Chen" | ["Conjoint Analysis and Discrete Choice Experiments"] | ==> | "Michael Waltinger" | ["Conjoint Analysis and Discrete Choice Experiments"] | ==> +-----------------------------------------------------------------------------+
Are they in any other groups that can help us expand Felix’s network?
START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:member_of]-> group <-[:member_of]- team_members
-[:member_of]-> other_group <-[:member_of]- other_profiles
RETURN DISTINCT other_profiles.name, COLLECT(DISTINCT other_group.name)
ORDER BY COUNT(*) DESC
LIMIT 5; Some of those folks are in a ton of groups, let’s just count them so it will be easier to display.
START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:member_of]-> group <-[:member_of]- team_members
-[:member_of]-> other_group <-[:member_of]- other_profiles
RETURN DISTINCT other_profiles.name, COUNT( DISTINCT other_group.name)
ORDER BY COUNT(*) DESC
LIMIT 5;
==> +-----------------------------------------------------------+ ==> | other_profiles.name | COUNT( DISTINCT other_group.name) | ==> +-----------------------------------------------------------+ ==> | "ABDUL SALAM YUSSIF" | 12 | ==> | "Nicholas Overton" | 9 | ==> | "Ashley Cooke" | 7 | ==> | "Joe Reevy" | 6 | ==> | "Moeez Khademhoseiny" | 6 | ==> +-----------------------------------------------------------+
Co-Authors:
START me = node:vertices('type:author AND name:"Felix Eggers"')
MATCH me <-[:authored_by]- publication -[:authored_by]-> co_authors
RETURN DISTINCT co_authors.name;
These folks co-authored a publication with Felix, so they must like working together, and share similar research interests.
==> +--------------------------+ ==> | co_authors.name | ==> +--------------------------+ ==> | "Victor Henning" | ==> | "Thorsten Hennig-Thurau" | ==> | "Dominik Papies" | ==> | "Fabian Eggers" | ==> | "Henrik Sattler" | ==> | "Mark B Houston" | ==> | "Nils Wlömert" | ==> +--------------------------+
That’s not a ton of people, let’s try his 2nd level co-author network:
START me = node:vertices('type:author AND name:"Felix Eggers"')
MATCH me <-[:authored_by]- my_publications -[:authored_by]-> co_authors
<-[:authored_by]- their_publications -[:authored_by]-> their_co_authors
WHERE me <> their_co_authors
AND NOT(me <-[:authored_by]- my_publications -[:authored_by]-> their_co_authors)
RETURN DISTINCT their_co_authors.name, COUNT(*) AS cnt
ORDER BY cnt DESC
LIMIT 5;
We are excluding Felix and his co-authors from the result. I found 18, but here are the top 5:
==> +-----------------------------+ ==> | their_co_authors.name | cnt | ==> +-----------------------------+ ==> | "Jan Reichelt" | 27 | ==> | "Jason J Hoyt" | 21 | ==> | "James Hammerton" | 15 | ==> | "Kris Jack" | 15 | ==> | "Dan Harvey" | 15 | ==> +-----------------------------+
In the same Journal:
We can also take look at authors who appeared in the same Journal as
Felix since Journals are usually topic specific and curated for high
quality content.
START me = node:vertices('type:author AND name:"Felix Eggers"')
MATCH me <-[:authored_by]- my_publications -[:published_in]-> journal
<-[:published_in]- other_publications -[:authored_by]-> authors
RETURN DISTINCT authors.name, COUNT(*) AS cnt
ORDER BY cnt DESC
LIMIT 5;
I found 35 authors who were published in the same journals, but here are the top 5:
==> +--------------------------------+ ==> | authors.name | cnt | ==> +--------------------------------+ ==> | "Thorsten Hennig-Thurau" | 7 | ==> | "Victor Henning" | 4 | ==> | "Henrik Sattler" | 4 | ==> | "Tillmann Wagner" | 4 | ==> | "Richard J Lutz" | 2 | ==> +--------------------------------+
We can go to a 2nd level here by using his co-authors:
START me = node:vertices('type:author AND name:"Felix Eggers"')
MATCH me <-[:authored_by]- my_publications -[:authored_by]-> co_authors
<-[:authored_by]- their_publications -[:published_in]-> journal
<-[:published_in]- other_publications -[:authored_by]-> authors
WHERE me <> authors
AND NOT(me <-[:authored_by]- my_publications -[:authored_by]-> authors)
RETURN DISTINCT authors.name, COUNT(*) AS cnt
ORDER BY cnt DESC
LIMIT 5;
That query returns 19.5k authors, here are the top 5:
==> +-------------------------+ ==> | authors.name | cnt | ==> +-------------------------+ ==> | "Thomas Cochrane" | 444 | ==> | "Amanda Peters" | 360 | ==> | "David Jones" | 336 | ==> | "DJ Riddell" | 312 | ==> | "J Lavoué" | 300 | ==> +-------------------------+
This list represents authors who have appeared in the same journals as his co-authors ordered by the number of paths that exist to them.
Interested the same Disciplines:
We can actually go multiple ways here.
From his profile we can go to disciplines and find other profiles who are into the same disciplines.
START me = node:vertices('type:profile AND name:"Felix Eggers"')
MATCH me -[:interested_in]-> disciplines <-[:interested_in]- other_profiles
RETURN DISTINCT other_profiles.name, COLLECT(DISTINCT disciplines.name)
ORDER BY COUNT(*) DESC
LIMIT 5;
That’s going to return a ton of people who are also into Business Administration, here are 5 of them:
==> +----------------------------------------------------------+ ==> | other_profiles.name | COLLECT(DISTINCT disciplines.name) | ==> +----------------------------------------------------------+ ==> | "John Smith" | ["Business Administration"] | ==> | "Andreas Müller" | ["Business Administration"] | ==> | "abc abc" | ["Business Administration"] | ==> | "abc def" | ["Business Administration"] | ==> | "Luis Farinha" | ["Business Administration"] | ==> +----------------------------------------------------------+
Since we know Felix is interested in Business Administration, we can also go from disciplines to publications, to other authors who may not have a profile in the system.
START me = node:vertices('type:discipline AND name:"Business Administration"')
MATCH me -[:by_discipline]- publications -[:authored_by]- author
RETURN author.name, COUNT(*) AS cnt
ORDER BY cnt DESC
LIMIT 5;
==> +---------------------------+ ==> | author.name | cnt | ==> +---------------------------+ ==> | "Null Mancas Matei" | 11 | ==> | "Joanne Dyer" | 8 | ==> | "Nicholas J Turro" | 8 | ==> | "Steffen Jockusch" | 4 | ==> | "Angel A Martí" | 4 | ==> +---------------------------+
Anyway, that was just bit of exploring of the data with Neo4j and Cypher. I’ll try to build a website that makes use of these queries before the August 31st deadline. Leave a comment if you have any ideas or want to help.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)


RiparianData
@RiparianData



