Big Data/Analytics Zone is brought to you in partnership with:

Treasure Data's Big Data as-a-Service cloud platform enables data-driven businesses to focus their precious development resources on their applications, not on mundane, time-consuming integration and operational tasks. Our pre-built, multi-tenancy cloud platform is already in use by over 50 customers worldwide and is managing more than 200 billion rows of data and processing 130,000 jobs per day. Discover how Treasure Data can help you focus on your core business and benefit from the fastest time-to-answer service available. Sadayuki is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

How Japan's Top Recipe Site Solved Its Data Management Problem

10.25.2012
| 7165 views |
  • submit to reddit

Cookpad.com (TYO:2193) is Japan’s No.1 recipe website. With more than fifteen million users and one million online recipes, Cookpad.com dominates the market of recipe search and sharing. In fact, according to Cookpad.com’s internal research, 50% of Japanese women in their 20s and 30s use Cookpad.com as their primary source of recipes and cooking advice.

So, how does Cookpad.com serve up the right recipes at the right time? The short answer is “because they collect all sorts of data about their users.” The long answer is the rest of this article.

Sharded MySQL for Analytics

Cookpad.com’s web application is written in the popular open-source framework Ruby on Rails. They have over one hundred instances of Rails applications to serve 20 million monthly unique users.

Rails logs data into a local file system by default, and Cookpad.com used to consolidate these logs into a cluster of MySQL servers every night via rsync, a standard Unix utility.

The cluster of MySQL servers then computed various key metrics, such as page views per recipe, and the computed key metrics were copied over to another MySQL instance daily.

This last MySQL instance was responsible for answering questions about internal key metrics. This MySQL instance served as a data mart for Cookpad.com’s in-house dashboard and Google Spreadsheet on which various non-engineering organizations relied everyday for decision-making.

The Problems

Cookpad.com’s original architecture was based on shared MySQL servers and had severe scaling and maintainability limitations.

  1. Limited Scalability: MySQL is great software. It’s been deployed extensively for more than a decade and the support community is active and mature. However, MySQL was not designed to support petabytes of data. To make their MySQL servers more scalable, Cookpad.com, like most other large-scale MySQL users, aggressively sharded their MySQL databases to alleviate the load on each instance. But this is not a robust solution and usually results in systems that are brittle and hard to scale.

  2. Rigid Schema: Relational databases, including MySQL, require a well-defined schema upfront. While a rigid schema can help you organize and document data, it is not well-suited to a fast-moving, data-driven company like Cookpad.com where the underlying data can change weekly if not daily.

  3. Up to 24 Hours of Delay: Because the first step of the ETL process (copying log files from Rails server to MySQL servers) was run once a day, data refreshes could take up to 24 hours. Because of this delay, they couldn’t evaluate the effectiveness of new features or the popularity of new content in a timely fashion. The long feedback loop meant slower product development.

The Solution: Treasure Data

By introducing Treasure Data [1], Cookpad.com transformed their infrastructure in two fundamental ways.

  • Treasure Data’s Cloud Data Warehouse replaced the cluster of MySQL servers. Now, they run scheduled jobs on Treasure Data that update the MySQL aggregation server that powers their in-house dashboard.
  • Instead of using the default file-based logging, td-agent has been installed on each Rails server to automatically forward logs to Treasure Data.

Those changes essentially solved all three of the problems that Cookpad.com was facing. Let’s look at them in detail.

  1. Scalability is no longer an issue: unlike MySQL, Treasure Data was designed from the ground up to scale. For us, adding more storage or CPU is only a few keystrokes away.
  2. Flexible Schema: Treasure Data’s proprietary columnar database implements a flexible schema model that lets you add or remove a schema at any given time. This means Cookpad.com no longer has to worry about changes in the underlying data model breaking their ETL process.
  3. Updates every 5 Minutes not every 24 Hours: td-agent is a versatile, robust logger that can handle up to 17,000 messages per second per instance. Furthermore, it is far closer to real-time than uploading data in nightly batches: By default, td-agent buffers data locally for reliability and transfers it to Treasure Data every 5 minutes, so the data is never behind by more than 5 minutes. Of course, the size of the buffer window is configurable, so you can bring it as close to real-time as you need to.

Cookpad.com now has a scalable, robust, high-performance data warehousing and analytics solution and all of this was achieved in less than three weeks.

(Side note: You might raise questions about uploading access logs to a third-party service like Treasure Data. Don’t fret, we worked closely with Cookpad’s data infrastructure team to ensure identity-related information is anonymized or filtered out to stay compliant with data privacy guidelines.)

 

Published at DZone with permission of Sadayuki Furuhashi, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)