Big Data/Analytics Zone is brought to you in partnership with:

Mitch Pronschinske is the Lead Research Analyst at DZone. Researching and compiling content for DZone's research guides is his primary job. He likes to make his own ringtones, watches cartoons/anime, enjoys card and board games, and plays the accordion. Mitch is a DZone Zone Leader and has posted 2576 posts at DZone. You can read more from them at their website. View Full User Profile

Tools for Moving from an SQL Database to Hadoop

03.24.2010
| 15746 views |
  • submit to reddit
More organizations, small and large, are interested in moving their data from SQL relational databases to an Apache Hadoop back end.  Although SQL is always a solid option in many cases, Hadoop has advantages in data processing that are battle-tested by several major web properties.  These days it seems that Hadoop is the best game in town for "Big Data" handling.  For users that want to either experiment with Hadoop or put it into production-use, there are a couple of tools available for migrating SQL data to Hadoop.  They include: Cloudera's Sqoop, cascading.jdbc, and cascading-dbmigrate.

Sqoop
Cloudera knows Hadoop - that's for sure.  Doug Cutting, who created Hadoop, now works for Cloudera.  Sqoop is a FOSS tool that was contributed to the Apache Hadoop project.  It takes your database table and automatically generates the necessary classes that represent the rows from the table.  It finds the most efficient way to divide portions of the table for the MapReduce framework.  Once this is done, you can specify directions for importing data into the Hadoop Distributed File System (HDFS).  Sqoop has performance enhancements for loading and exporging data from MySQL and Oracle.  It supports both SequenceFile and text-based targets.  However, Sqoop doesn't detect unsigned columns and it will try to fit unsigned integers into a Java int because of MySQL's deviation from the JDBC standard.  Cloudera is planning to close these issues.

cascading.jdbc
Cascading is an open source API for defining and executing complex data processing workflows on a Hadoop cluster without having to 'think' in MapReduce.  It also allows developers to efficiently schedule complex distributed processes based on dependencies and other metadata.  cascading.jdbc (you might've guessed) is a JDBC adapter for Cascading.  It allows you to read and write data to and from an RDBMS using JDBC drivers that are attached to a Cascading data processing flow.  The adapter supports INSERT and UPDATE during sinking and can use custom SELECTs during sourcing.  Essentially, you can read what's in the database and run custom migration logic all in one job.  Performance, however, is hampered because cascading.jdbc uses LIMIT and OFFSET in its SQL queries.

cascading-dbmigrate
Nathan Marz liked the approach of cascading.jdbc and wanted to make his own tool for migrating large relational databases into Hadoop clusters.  His tool tries to relieve the performance issues of cascading.jdbc with a redesigned tap called DBMigrateTap to read data from a database in Cascading flow.  The tap reads data for each task with range queries over the table's primary key column.   It will emit tuples containing one field for each of the column reads, where the field names are the column names.  cascading-dbmigrate works on SQL tables that have a primary key column that is either an int or a long.  If this is your situation, you should try using cascading-dbmigrate to move your data since it's a relatively young open source project that needs feedback.  Marz's company, BackType, is using this solution now to migrate their data from SQL databases to HDFS.   

Comments

Sonal Goyal replied on Sat, 2010/05/08 - 4:02am

Hi Mitchell,
You can also check the HIHO framework for moving data between RDBMS and Hadoop. Its Apache licensed open source project with support for querying data, getting results in Avro, fast loads to MySQL and a variety of other easy to use features. More details can be found at: http://code.google.com/p/hiho/
Thanks, Sonal

Hadoop Fan replied on Mon, 2010/05/10 - 1:59am

HIHO for Hadoop is piece of junk. Don't bother trying it out. There are better tools out there.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.