Did you know? DZone has great portals for Python, Cloud, NoSQL, and HTML5!
NoSQL Zone is brought to you in partnership with:

Max De Marzi, is a seasoned web developer. He started building websites in 1996 and has worked with Ruby on Rails since 2006. The web forced Max to wear many hats and master a wide range of technologies. He can be a system admin, database developer, graphic designer, back-end engineer and data scientist in the course of one afternoon. Max is a graph database enthusiast. He built the Neography Ruby Gem, a rest api wrapper to the Neo4j Graph Database. He is addicted to learning new things, loves a challenge and finding pragmatic solutions. Max is very easy to work with, focuses under pressure and has the patience of a rock. Max is a DZone MVB and is not an employee of DZone and has posted 22 posts at DZone. You can read more from them at their website. View Full User Profile

A Step-By-Step Tutorial on How to Use Graphipedia to Import Wikipedia into Neo4j

02.16.2012
Email
Views: 2655
  • submit to reddit
This article is part of the DZone NoSQL Resource Portal, which is brought to you in collaboration with Neo Technology and DataStax. Visit the NoSQL Resource Portal for additional tutorials, videos, opinions, and other resources on this topic.

Wouldn’t it be cool to import Wikipedia into Neo4j?

Mirko Nasato thought so, and built graphipedia using the batch importer that does just that.

It’s written in Java, so if you’re a pure ruby guy, I’ll walk you through the steps.

Let’s clone the project and jump in.

git clone git://github.com/mirkonasato/graphipedia.git
cd graphipedia

If you look in here you’ll see a pom.xml file which means you’ll need to download Maven and build the project.

sudo apt-get install maven2
mvn install

 You’ll see a bunch of stuff flying by, that’s just the dependencies being downloaded. At the end you should see this:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] ------------------------------------------------------------------------
[INFO] Graphipedia Parent .................................... SUCCESS [1:08.932s]
[INFO] Graphipedia DataImport ................................ SUCCESS [1:16.018s]
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2 minutes 25 seconds
[INFO] Finished at: Thu Feb 16 11:36:55 CST 2012
[INFO] Final Memory: 28M/434M
[INFO] ------------------------------------------------------------------------

Ok, so now let’s get the file from wikipedia we need. You can download it with wget.

wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

 Whoa, hold up. That’s a 7.6 G file… can we try a smaller data set first?

Sure. Let’s go with Lea faka-Tonga ’cause it just sounds cool…and we’ll unzip it.

wget http://dumps.wikimedia.org/towiki/latest/towiki-latest-pages-articles.xml.bz2
bzip2 -d towiki-latest-pages-articles.xml.bz2

It is a two step process, so first lets create a smaller intermediate XML file containing page titles and links only:

java -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.ExtractLinks towiki-latest-pages-articles.xml towiki-links.xml

 You should see:

Parsing pages and extracting links...
..
2835 pages parsed in 0 seconds.

 Then we run the batch importer on this file and dump the contents on to the graphdb directory:

java -Xmx3G -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.neo4j.ImportGraph towiki-links.xml graph.db

You should see:

Importing pages...
..
2835 pages imported in 0 seconds.
Importing links...
.....
5799 links imported in 0 seconds; 6383 broken links ignored

 Go inside and take a look and you’ll see our neostore files.

cd graph.db
ls

 You can copy this folder over any existing neo4j database by overwriting the /neo4j/data/graph.db folder and enjoy.

 Source: http://maxdemarzi.com/2012/02/16/importing-wikipedia-into-neo4j-with-graphipedia/

Published at DZone with permission of Max De Marzi, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Neo Technology and DataStax are leading the charge for the NoSQL movement.  You can learn more about the Neo4j Graph Database in the project discussion forums and try out the new Spring Data Neo4j, which enables POJO-based development.  You can also see how Apache Cassandra, a ColumnFamily data store, is pushing the boundaries of persistence with cloud capabilities and deployments at SocialFlow and Netflix.

Comments

Goel Yatendra replied on Thu, 2012/03/15 - 3:55pm

$ java -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.ExtractLinks towiki-latest-pages-articles.xml towiki-links.xml

Exception in thread “main” java.lang.NoClassDefFoundError: org/graphipedia/dataimport/ExtractLinks

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.