Big Data/BI Zone is brought to you in partnership with:

Enterprise Architect in HCL Technologies a $7Billion IT services organization. My role is to work as a Technology Partner for large enterprise customers providing them low cost opensource solutions around Java, Spring and vFabric stack. I am also working on various projects involving, Cloud base solution, Mobile application and Business Analytics around Spring and vFabric space. Over 23 yrs, I have build repository of technologies and tools I liked and used extensively in my day to day work. In this blog, I am putting all these best practices and tools so that it will help the people who visit my website. Krishna is a DZone MVB and is not an employee of DZone and has posted 64 posts at DZone. You can read more from them at their website. View Full User Profile

Hadoop and Mahout in Data Mining

12.13.2012
| 7790 views |
  • submit to reddit

Introduction: Data Mining

This blog talks about a typical solution architecture where we use various BigData components like HadoopMahoutSqoopFlume and Mongo-Hadoop Connector. Lots of successful companies have been around for atleast 20 years and have invested in IT infrastructure. They have over a period of time, accumulated data in the form of transaction data, log archive and are eager to analyse these data and see if they can improve the business efficiency out of it.

Below is the Component Diagram of how each of the above frameworks fit into the ecosystem.

Data Mining

In a typical scenario, as the customer are using the IT systems of the company for buying companies products the customer data like , their purchasing patterns, their location, their Gender, how are other similar customers are purchasing are being stored for data mining. Based on this data, the company can helping customers make buying decisions. If you notice carefullyNetflixAmazoneBayGilt are already doing this.

Mahout is one of the open source framework which has a built in Recommender engine. It also runs on Hadoop using Map/Reduce paradigm. In a typical scenario,

  1. User Item data that is stored in MySQL is Data Mining into Hadoop Distributed File System (HDFS) using a ETL tool like Sqoop
  2. User click information that is stored in Log files is exported into HDFS using log migration tools like Flume
  3. If the transaction data is stored in NoSQL, there are connectors to export the data to HDFS

Once the data is in HDFS, we can use Mahout Batch job to run the data analysis and import the processed data back to transactional database.

Published at DZone with permission of Krishna Prasad, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)