NoSQL Zone is brought to you in partnership with:

Davy Suvee is the founder of Datablend. He is currently working as an IT Lead/Software Architect in the Research and Development division of a large pharmaceutical company. Required to work with big and unstructured scientific data sets, Davy gathered hands-on expertise and insights in the best practices on Big Data and NoSql. Through Datablend, Davy aims at sharing his practical experience within a broader IT environment. Davy is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Storing and querying RDF data in Neo4J through Sail

11.01.2011
| 9570 views |
  • submit to reddit

Recently, I got asked to implement a storage and querying platform for biological RDF (Resource Description Framework) data. RDF data is a set of statements about resources in the form of subject-predicate-object expressions (also referred to as triples). Let’s have a look at some simple RDF triples that define ‘me’, Davy Suvee:

<http://example.org/person/Davy_Suvee> <http://example.org/person/first_name> "Davy" .
<http://example.org/person/Davy_Suvee> <http://example.org/person/last_name> "Suvee" .
<http://example.org/person/Davy_Suvee> <http://example.org/person/age> "31" .
<http://example.org/person/Davy_Suvee> <http://example.org/company> <http://example.org/company/DataBlend> .
<http://example.org/company/DataBlend> <http://example.org/company/name> "DataBlend" .
<http://example.org/company/DataBlend> <http://example.org/company/vat> "BE0894.523.805" .
          

Each subject is identified through an URI (Uniform Resource Identifier). For instance, I identify myself as being http://www.example.org/person/Davy_Suvee. A predicate, also identified through an URI, either points to a literal value or to a concrete object (which is again identified through an URI). In the example above, the first_name, last_name and age predicates all point to a literal value, while the company predicate points to http://www.example.org/company/DataBlend, the company I work for. The DataBlend subject also exhibits a number of properties, including name and VAT-number. Today’s triplestores allow you to save billions of these triples and information is retrieved through so-called SPARQL-queries. For instance, to retrieve my first name and age, I can use the following SPARQL-query:

PREFIX person: <http://example.org/person/>
SELECT ?first_name ?age
WHERE {
  person:Davy_Suvee person:first_name ?first_name .
  person:Davy_Suvee person:age ?age .
}

 

2. Neo4J as a RDF data store

Similar to SQL, SPARQL provides a set of powerful querying constructs that allow you to declaratively specify your needs. Calculating shortest paths between random subjects on the contrary, can not easily be accomplished through SPARQL (unless one encodes the specific path structure, which kind of defeats the point). Being able to quickly calculate shortest paths, which is a requirement for the project I’m implementing, is one of the main selling points of Graph Databases. As RDF data can be thought of as a graph, it comes as no surprise that many Graph Databases, including Neo4J, provide native support for storing and querying RDF data. In case of Neo4J, this is achieved through the use of the neo4j-rdf, neo4j-rdf-sparql and neo-rdf-sail components. Unfortunately, I couldn’t find a recent piece of code that details the various steps for automatically importing RDF triple files within Neo4J. Hence, this article. The complete source code can be found on the Datablend public GitHub repository.

Start by setting up the Neo4J database connection:

// Create the sail graph database
graphDb = new EmbeddedGraphDatabase("var/flights");
indexService = new LuceneIndexService(graphDb);
fulltextIndex = new SimpleFulltextIndex(graphDb, new File("var/flights/lucene-fulltext"));
rdfStore = new VerboseQuadStore(graphDb, indexService, null, fulltextIndex);
sail = new GraphDatabaseSail(graphDb, rdfStore);
// Initialize the sail store
sail.initialize();
// Get the sail repository connection
connection = new SailRepository(sail).getConnection();

 

An embedded Neo4J graph database (EmbeddedGraphDatabase) is used for importing 5MB of RDF tuples containing airline flight information. (This example data set was found at rdfdata.org, a great resource for some open RDF data sets). In order to easily find back flight information, we fully text-index our RDF triples (through Lucene). Next, we wrap the embedded Neo4J graph database as a VerboseQuadStore (one of internal triples store implementations provided by Neo4J). Finally, we expose our triple store through the Sail interface, which is part of the openrdf.org project. By doing so, we can use an entire range of RDF utilities (parsers and query evaluators) that are part of the openrdf.org project. Once we have a sail connection available, we can import the required RDF triples through the add-method.

connection.add(getResource("sneeair.rdf"), null, RDFFormat.RDFXML, new Resource[]{});

That’s it! Once the import is finished, you can query your RDF triplets by executing a SPARQL-query. The query below for instance, will retrieve the flight number, departure and destination city of all flights that have a duration of 1 hour and 35 minutes.

// Create query
TupleQuery durationquery = connection.prepareTupleQuery(QueryLanguage.SPARQL,
    "PREFIX io: <http://www.daml.org/2001/06/itinerary/itinerary-ont#> " +
    "PREFIX fl: <http://www.snee.com/ns/flights#> " +
    "SELECT ?number ?departure ?destination " +
    "WHERE { " +
        "?flight io:flight ?number . " +
        "?flight fl:flightFromCityName ?departure . " +
        "?flight fl:flightToCityName ?destination . " +
        "?flight io:duration \"1:35\" . " +
    "}");
// Evaluate and print results
TupleQueryResult result = durationquery.evaluate();
while (result.hasNext()) {
    BindingSet binding = result.next();
    System.out.println(binding.getBinding("number").getValue() + " " +
                       binding.getBinding("departure").getValue() + " " +
                       binding.getBinding("destination").getValue());
}

 

3. Shortest path calculation

Through the SimpleFulltextIndex we can easily find back the Neo4J node equivalent of a particular RDF subject. Once we got hold of the required nodes, we can use the graph algorithms provided in the neo4j-graph-algo component to calculate (shortest) paths. Very cool!

References
Published at DZone with permission of Davy Suvee, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Amara Amjad replied on Sun, 2012/03/25 - 2:32am

Thanks for the tutorial how to use neo4j for storing and querying RDF.
Can you say something about the number of statements, load and query time?
I would be really interested to figure out how scalable neo4j is for handling RDF data.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.