NoSQL Zone is brought to you in partnership with:

Andreas Kollegger is a leading speaker and writer on graph databases and Neo4j and the bridge between community and developer efforts. He works actively in the community, speaking around the world and promoting the larger Neo4j ecosystem of projects. Author of Fair Trade Software, and the lead for Neo4j in the cloud, Andreas plays a valuable role for progressive happenings within Neo4j. Andreas is a DZone MVB and is not an employee of DZone and has posted 70 posts at DZone. You can read more from them at their website. View Full User Profile

Graph Databases in Life Sciences Workshop

11.26.2012
| 2628 views |
  • submit to reddit

This post was originally written by Michael Hunger at the Neo4j Blog.

As Bio-Technology is one of the hot topics of the century and graph databases are on the rise in this decade, we thought it would be a good idea to bring researchers and bioinformatics developers together for a workshop about the applicability of graph databases in biological research and application.

Fortunately Prof. Lennart Martens a group leader in the Department of Medical Protein Research at VIB and Ghent University offered to host the workshop. So Neo Technology's Rik Van Bruggen and Lennart Martens organized the workshop and invited a host of attendees from a variety of backgrounds.

26 participants found their way to the picturesque meeting hall of the University of Ghent (a former monastery) to enjoy a full day packed with presentations, discussions and a hands-on workshop. We were greeted by a life size poster of the metabolic interaction pathways in humans.

After the introduction by Lennart and Rik, I ran a quick intro to NOSQL and graph databases in particular and their applicability in a wide range of fields, also with some reference to existing biotech applications.
Thilo Muth who works as a PhD with Lennart works in the area of Metaproteomics an interesting technique about mapping protein fragments to potential bacterial targets and creating meta-proteins on matching groups. He introduced the topic and how they used graph oriented data models to reason about potential mappings.

Pablo Pareja of Oh no sequences! presented Bio4j an open-source research database (and platform) integrating many different sources for protein, genome and taxonomy information. Bio4j also runs on Neo4j and currently holds almost 1 billion relationships. (Slides 1, 2, 3)

In the time until lunch I answered some questions about Neo4j especially about the roadmap, scaling and we highlighted some visualization approaches, like Gephi, Cytoscape and HivePlots.

During the breaks and over lunch we had lots of interesting discussions about life sciences in general, working with scientist and particiular data management problems.

After lunch, Anthony Liekens presented biograph.be a knowledge discovery system for finding relevant information in the area of life science, e.g. proteins in reactions ranked by their publication relevance. The system employs a page rank algorithm that is implemented using matrix multiplication on a parallel processing system.

Davy Suvee of Janssen Pharmaceutica and datablend.be presented different Graph Database usecases from his experience at a big pharmaceutical company. He closed the presentation with an intro to a time-traveling graph implementation on top of Datomic called FluxGraph.

Thilo then introduced the topic of the workshop "Graph Databases in Life Science" and the "Reactome" database of human protein interaction pathways. He discussed some Neo4j APIs and how they can be used to import the data from flat CSV files into a graph database. The attendees set up their development environment with the Neo4reactome project that we prepared upfront and ran the import successfully.
After importing the data we looked at some use-cases, first visualizing pathways in the Neo4j Web-UI and then running several queries using Neo4j's query language Cypher to find certain proteins (HBA and HBB) and their interaction pathways.
And example task looked like this:

Find the common pathways of HBA and HBB

Both proteins should be involved in particular pathways, which should be easy to find by querying. Now we want to retrieve only the pathways which have both proteins in common.

    START proteinA=node:proteins(accession = "P69905"),     
    proteinB=node:proteins(accession = "P68871") 
    MATCH (proteinA)-[:INVOLVED_IN]->(pathway)<-[:INVOLVED_IN]-(proteinB) 
    RETURN pathway

Results

  • Metabolism
  • O2/CO2 exchange in erythrocytes
  • Uptake of Carbon Dioxide and Release of Oxygen by Erythrocytes
  • Uptake of Oxygen and Release of Carbon Dioxide by Erythrocytes

After the workshop the discussions continued over a broad range of topics.

I want to thank again Lennart Martens, Thilo Muth and Rik Van Bruggen for organizing such a great workshop. And of course Pablo Pareja, Davy Suvee and Anthony Liekens for presenting.

We started a "neo4j-biotech" google group some weeks ago, and would like to invite everyone to join this discussion forum to engage in conversations in the biotech domain with colleagues that have the same background and vocabulary.
Published at DZone with permission of Andreas Kollegger, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)