Big Data/Analytics Zone is brought to you in partnership with:

Eric is living in Chapel Hill, NC. By night, he writes and edits science fiction. On weekends, he spends too much time making plumbers hop on things. Eric has posted 249 posts at DZone. You can read more from them at their website. View Full User Profile

HDFS for the Batch Layer

07.29.2013
| 5290 views |
  • submit to reddit

HDFS for the Batch Layer

By Nathan Marz and James Warren, authors of Big Data: Principles and best practices of scalable realtime data systems

The high-level requirements for storing data in the Lambda Architecture batch layer are straightforward. In fact, you can map these requirements to a required checklist for a storage solution. In this article, Big Data authors Nathan Marz and James Warren walk you through a batch layer storage solution based on Hadoop Distributed File System (HDFS) and explain how it meets their requirements checklist.

After you learn how to design a data model for the master dataset, the next step is learning how to physically store the master dataset in the batch layer so that it can be processed easily and efficiently. The master dataset is typically too large to exist on a single server, so you must choose how to distribute your data across multiple machines. The way you store your master dataset will impact how you consume it, so it's vital to devise your storage strategy with your usage patterns in mind.

In this article, you'll learn how to store your master dataset using Hadoop Distributed File System (HDFS). We'll begin with an introductory look at HDFS, and we'll conclude with a checklist that summarizes how HDFS meets the storage requirements of the batch layer.

Introducing HDFS

HDFS and Hadoop MapReduce are the two prongs of the Hadoop project: a Java framework for distributed storage and distributed processing of large amounts of data. Hadoop is deployed across multiple servers, typically called a cluster, and HDFS is a distributed and scalable filesystem that manages how data is stored across the cluster. Hadoop is a project of significant size and depth, and we'll provide only a high-level description in this article.

A Hadoop cluster has two types of HDFS nodes:

  • Multiple datanodes
  • A single namenode

When you upload a file to HDFS, it undergoes the process shown in figure 1.

Figure 1 Files are chunked into blocks, which are dispersed to datanodes in the cluster.

The file is first chunked (#1) into blocks of a fixed size, typically between 64MB and 256 MB. Each block is then replicated across multiple (typically three) datanodes that are chosen at random (#2). The namenode (#3) keeps track of the file-to-block mapping and where each block is located.

Distributing a file across many nodes allows it to be easily processed in parallel as shown in figure 2.

Figure 2 The namenode provides the client application with the block locations of a distributed file stored in HDFS.

When a program needs to access a file stored in HDFS, it contacts the namenode (#1) to determine which datanodes host the file contents. The program then accesses the file (#2) for processing. By replicating each block across multiple nodes, your data remains available even when individual nodes are offline.

Getting started with Hadoop
Setting up Hadoop can be an arduous task. Hadoop has numerous configuration parameters that should be tuned for your hardware to perform optimally. To avoid getting bogged down in details, we recommend downloading a preconfigured virtual machine for your first encounter with Hadoop. A virtual machine accelerates your learning of HDFS and MapReduce, and you'll have a better understanding when setting up your own cluster. At the time of this writing, Hadoop vendors Cloudera, Hortonworks, and MapR all have made images publicly available.

Implementing a distributed filesystem is a difficult task; this article covers the basics from a user perspective. Let's now explore how to store a master dataset using HDFS.

Storing a master dataset with HDFS

As a filesystem, HDFS offers support for files and directories, which makes storing a master dataset on HDFS straightforward. You store data units sequentially in files, with each file containing megabytes or gigabytes of data. All of the files of a dataset are then stored together in a common folder in HDFS.

To add new data to the dataset, you create and upload another file containing the new information. For example, suppose you want to store all logins on a server. The following are example logins:

$ cat logins-2012-10-25.txt 
alex     192.168.12.125    Thu Oct 25 22:33 - 22:46 (00:12) 
bob      192.168.8.251     Thu Oct 25 21:04 - 21:28 (00:24) 
charlie  192.168.12.82     Thu Oct 25 21:02 - 23:14 (02:12) 
doug     192.168.8.13      Thu Oct 25 20:30 - 21:03 (00:33) 
... 

To store this data on HDFS, you create a directory for the dataset and upload the logins file (logins-2012-10-25.txt):

$ hadoop fs -mkdir /logins 
$ hadoop fs -put logins-2012-10-25.txt /logins

The hadoop fs commands are Hadoop shell commands that interact directly with HDFS. A full list of commands is available at http://hadoop.apache.org/. In the second line, uploading a file automatically chunks and distributes the blocks across the datanodes. You can list the directory contents:

$ hadoop fs -ls -R /logins 
-rw-r--r--   3 hdfs hadoop  175802352 2012-10-26 01:38
    /logins/logins-2012-10-25.txt 

The ls command is based on the UNIX command of the same name.

And you can verify the contents of the file:

$ hadoop fs -cat /logins/logins-2012-10-25.txt 
alex     192.168.12.125   Thu Oct 25 22:33 - 22:46 (00:12)
bob      192.168.8.251    Thu Oct 25 21:04 - 21:28 (00:24)
...

As shown in figure 1, the file was automatically chunked into blocks and distributed among the datanodes when it was uploaded. To identify the blocks and their locations, use the following command, which runs a HDFS filesystem checking utility:

$ hadoop fsck /logins/logins-2012-10-25.txt -files -blocks -locations 
/logins/logins-2012-10-25.txt 175802352 bytes, 2 block(s):                         #A
OK 
0. blk_-1821909382043065392_1523 len=134217728                                     #B
   repl=3 [10.100.0.249:50010, 10.100.1.4:50010, 10.100.0.252:50010] 
1. blk_2733341693279525583_1524 len=41584624 
   repl=3 [10.100.0.255:50010, 10.100.1.2:50010, 10.100.1.5:50010] 

#A File stored in two blocks #B IP addresses and port numbers of the datanodes hosting each block

Nested folders provide an easy implementation of vertical partitioning. For this logins file example, you may want to partition your data by login date. The layout shown in figure 3 stores each day's information in a separate subfolder so that a function can pass over data not relevant to its computation.

Figure 3 A vertical partitioning scheme for login data. By separating information for each date in a separate folder, a function can select only the folders containing data relevant to its computation.

If you're new to HDFS, it's a platform worth investigating for several reasons: it's an open-source project with an active developer community; it's tightly coupled with Hadoop MapReduce, a distributed computing framework; and it's widely adopted and deployed in production systems by hundreds of companies.

Meeting the batch layer storage requirements with HDFS

Table 1 summarizes how HDFS meets the storage requirements of the batch layer. We covered most of the checklist discussion points in this article, so they should look familiar.

Table 1 HDFS checklist of storage requirements

Operations Criteria Discussion
writes Efficient appends of new data Appending new data is as simple as adding a new file to the folder containing the master dataset.
Scalable storage HDFS evenly distributes the storage across a cluster of machines. You increase storage space and I/O throughput by adding more machines.
reads Support for parallel processing HDFS integrates with Hadoop MapReduce, a parallel computing framework that can compute nearly arbitrary functions on the data stored in HDFS.
Ability to vertically partition data Vertical partitioning is done grouping data into subfolders. A function can read only the select set of subfolders needed for its computation.
both Tunable storage/processing costs You have full control over how you store your data units within the HDFS files. You choose the file format for your data as well as the level of compression.

Beyond storing your master dataset, HDFS can also help you maintain it, primarily through two operations:

  • Appending new files to the master dataset folder
  • Consolidating data to remove small files

These operations must be bulletproof to preserve the integrity and performance of the batch layer. See chapter 3 of Big Data for a discussion of these tasks as well as potential pitfalls.

Summary

HDFS is a powerful tool for storing your data. This article explored how to store a master dataset using HDFS and explained how it meets the storage requirements of the batch layer.

Among the many candidates available for storing the master dataset, HDFS is most commonly chosen because of its integration with Hadoop MapReduce.

Here are some other Manning titles you might be interested in:

Hadoop in Action

Hadoop in Action
Chuck Lam

Pig in Action

Pig in Action
M. Tim Jones

Linked Data

Linked Data
Dan McCreary and Ann Kelly


Save 50% on Big DataThe Programmer's Guide to Thrift, and Hadoop in Action with promo code dzwkd1 only at manning.com. Offer expires midnight, July 30th EST.

Published at DZone with permission of its author, Eric Gregory.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)