Read and Write MapReduce Jobs with 10gen's MongoDB Hadoop Connecter
The content of this article was written by Brendan McAdams on the 10gen blog.
10gen is pleased to announce the availability of our first GA release of the MongoDB Hadoop Connector, version 1.0. This release was a long-term goal, and represents the culmination of over a year of work to bring our users a solid integration layer between their MongoDB deployments and Hadoop clusters for data processing. Available immediately, this connector supports many of the major Hadoop versions and distributions from 0.20.x and onwards.
The core feature of the Connector is to provide the ability to read MongoDB data into Hadoop MapReduce jobs, as well as writing the results of MapReduce jobs out to MongoDB. Users may choose to use MongoDB reads and writes together or separately, as best fits each use case. Our goal is to continue to build support for the components in the Hadoop ecosystem which our users find useful, based on feedback and requests.
For this initial release, we have also provided support for:
- writing to MongoDB from Pig (thanks to Russell Jurney for all of his patches and improvements to this feature)
- writing to MongoDB from the Flume distributed logging system
- using Python to MapReduce to and from MongoDB via Hadoop Streaming.
Hadoop Streaming was one of the toughest features for the 10gen team to build. To that end, look for a more technical post on the MongoDB blog in the next week or two detailing the issues we encountered and how to utilize this feature effectively.
This release involved hard work from both the 10gen team, as well as our community. Testing, pull requests, email ideas and support tickets have all contributed to moving this product forward. One of the most important contributions was from a team of students participating in a New York University class in Information Technology Projects which is designed to have students apply their skills to real world projects. Under the guidance of Professor Evan Korth, four students worked closely with 10gen to test and improve the functionality of the Hadoop Connector. Joseph Shraibman, Sumin Xia, Priya Manda, and Rushin Shah all worked to enhance and improve support for splitting up MongoDB input data, as well as adding a number of testing improvements and consistency checks.
Thanks to the work done by the NYU team as well as improvements to the MongoDB server, the MongoDB Hadoop Connector is capable of efficiently splitting input data in a variety of situations - in both sharded and unsharded setups - to parallelize the Hadoop input as efficiently as possible for maximum performance.
In the next few months we will be working to add additional features and improvements to the Hadoop Connector including Ruby support for Streaming, Pig input support, and support for reading and writing MongoDB Backup Files for offline batch processing. As with all of our MongoDB projects, you can always monitor the roadmap, request features, and report bugs via the MongoDB Jira and let us know on the MongoDB User Forum if you have any questions.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)