NoSQL Zone is brought to you in partnership with:

Max De Marzi, is a seasoned web developer. He started building websites in 1996 and has worked with Ruby on Rails since 2006. The web forced Max to wear many hats and master a wide range of technologies. He can be a system admin, database developer, graphic designer, back-end engineer and data scientist in the course of one afternoon. Max is a graph database enthusiast. He built the Neography Ruby Gem, a rest api wrapper to the Neo4j Graph Database. He is addicted to learning new things, loves a challenge and finding pragmatic solutions. Max is very easy to work with, focuses under pressure and has the patience of a rock. Max is a DZone MVB and is not an employee of DZone and has posted 60 posts at DZone. You can read more from them at their website. View Full User Profile

A Step-By-Step Tutorial on How to Use Graphipedia to Import Wikipedia into Neo4j

02.16.2012
| 5551 views |
  • submit to reddit

Wouldn’t it be cool to import Wikipedia into Neo4j?

Mirko Nasato thought so, and built graphipedia using the batch importer that does just that.

It’s written in Java, so if you’re a pure ruby guy, I’ll walk you through the steps.

Let’s clone the project and jump in.

git clone git://github.com/mirkonasato/graphipedia.git
cd graphipedia

If you look in here you’ll see a pom.xml file which means you’ll need to download Maven and build the project.

sudo apt-get install maven2
mvn install

 You’ll see a bunch of stuff flying by, that’s just the dependencies being downloaded. At the end you should see this:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] ------------------------------------------------------------------------
[INFO] Graphipedia Parent .................................... SUCCESS [1:08.932s]
[INFO] Graphipedia DataImport ................................ SUCCESS [1:16.018s]
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2 minutes 25 seconds
[INFO] Finished at: Thu Feb 16 11:36:55 CST 2012
[INFO] Final Memory: 28M/434M
[INFO] ------------------------------------------------------------------------

Ok, so now let’s get the file from wikipedia we need. You can download it with wget.

wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

 Whoa, hold up. That’s a 7.6 G file… can we try a smaller data set first?

Sure. Let’s go with Lea faka-Tonga ’cause it just sounds cool…and we’ll unzip it.

wget http://dumps.wikimedia.org/towiki/latest/towiki-latest-pages-articles.xml.bz2
bzip2 -d towiki-latest-pages-articles.xml.bz2

It is a two step process, so first lets create a smaller intermediate XML file containing page titles and links only:

java -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.ExtractLinks towiki-latest-pages-articles.xml towiki-links.xml

 You should see:

Parsing pages and extracting links...
..
2835 pages parsed in 0 seconds.

 Then we run the batch importer on this file and dump the contents on to the graphdb directory:

java -Xmx3G -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.neo4j.ImportGraph towiki-links.xml graph.db

You should see:

Importing pages...
..
2835 pages imported in 0 seconds.
Importing links...
.....
5799 links imported in 0 seconds; 6383 broken links ignored

 Go inside and take a look and you’ll see our neostore files.

cd graph.db
ls

 You can copy this folder over any existing neo4j database by overwriting the /neo4j/data/graph.db folder and enjoy.

 Source: http://maxdemarzi.com/2012/02/16/importing-wikipedia-into-neo4j-with-graphipedia/

Published at DZone with permission of Max De Marzi, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Goel Yatendra replied on Thu, 2012/03/15 - 3:55pm

$ java -classpath ./graphipedia-dataimport/target/graphipedia-dataimport.jar org.graphipedia.dataimport.ExtractLinks towiki-latest-pages-articles.xml towiki-links.xml

Exception in thread “main” java.lang.NoClassDefFoundError: org/graphipedia/dataimport/ExtractLinks

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.