Cloud Zone is brought to you in partnership with:

I'm a Windows Azure MVP, the principal developer for OakLeaf Systems and the author of 30+ books about Microsoft software. The books have more than 1.25 million English copies in print and have been translated into 20+ languages. Forbes Magazine ranked me seventh in their "Who Are The Top 20 Influencers in Big Data?" article of 2/3/2012. Roger has posted 37 posts at DZone. You can read more from them at their website. View Full User Profile

How to Use Windows Azure Blobs Data w/ Hadoop on Azure CTP

04.09.2012
| 5877 views |
  • submit to reddit
Introduction

imageAvkash Chauhan (@avkashchauhan) and Denny Lee (@dennylee) have written several blog posts about the use of Windows Azure blobs as data sources for the Apache Hadoop on Windows Azure Community Technical Preview (CTP). Both authors assume readers have some familiarity with Hadoop, Hive or both and use sample files with only a few rows to demonstrate Hadoop operations. The following tutorial assumes familiarity with Windows Azure blob storage techniques, but not with Hadoop on Azure features. Each of this tutorial’s downloadable sample flightdata text files contains about 500,000 rows. An alternative flighttest_2012_01.txt file with 100 rows is provided to speed tests with Apache Hive in the Interactive Hive console.

Note: As mentioned later, an initial set of these files is available from my SkyDrive account here.

imageMy (@rogerjenn) Introducing Apache Hadoop Services for Windows Azure article, updated 4/2/2012 provides an overview of the many facets of Microsoft’s recent replacement for it’s original big-data analytic offerings from Microsoft Research, Dryad, DryadLINQ and LINQ to HPC.

The Hadoop On Azure Elastic Map Reduce (EMR) portal’s Manage Cluster page offers two methods for substituting imageremote blob storage for HDFS storage:

  1. Enabling the ASV protocol to connect to a Windows Azure Blob storage account or
  2. Setting up a connection to an Amazon Simple Storage Services (S3) account

image

Figure 1. The Hadoop on Azure EMR portal’s Manage Cluster page automates substituting Windows Azure or Amazon S3 storage for the HDFS database.

Note: My Importing Windows Azure Marketplace DataMarket DataSets to Apache Hadoop on Windows Azure’s Hive Databases tutorial of 4/3/2012 describes how to take advantage of the import from the Windows Azure Marketplace DataMarket option.

This tutorial uses the ASV protocol in the following three scenarios:

  1. Hadoop/MapReduce with the Remote Desktop Protocol (RDP) and Hadoop Command Shell
  2. Hadoop/MapReduce with the Interactive JavaScript console
  3. Apache Hive with the Interactive Hive console

Note: According to Denny Lee, ASV is an abbreviation for Azure Storage Vault.

Scenarios 1 and 2 use the Hadoop FileSystem (FS) Shell Commands to display and manipulate files. According to the Guide for these commands:

The FileSystem (FS) shell is invoked by bin/hadoop fs <args>. All the FS shell commands take path URIs as arguments. The URI format is scheme://authority/path.

  • For HDFS the scheme is hdfs
  • For the local filesystem the scheme is file
  • [For Azure blobs, the scheme is asv]

The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply as /parent/child (given that your configuration is set to point to hdfs://namenodehost). Most of the commands in FS shell behave like corresponding Unix commands. Differences are described with each of the commands. Error information is sent to stderr and the output is sent to stdout.

imageScenario 3 executes Apache Hive’s CREATE EXTERNAL TABLE instruction to generate a Hive table having Windows Azure blobs, rather than HDFS, as its data source.

Note: The technical support forum for Apache Hadoop on Windows Azure is the Apache Hadoop on Azure CTP Yahoo! Group.

Creating the Azure Blob Source Data

The source data is a set of 32 tab-delimited text files derived from the FAA’s On-Time Performance data sets, which are available in zipped *.csv format for each month of every year since 1987 from the Bureau of Transportation’s Research and Innovative Technology Administration site. For more information about these files, see The FAA On_Time_Performance Database’s Schema and Size section of my Analyzing Air Carrier Arrival Delays with Microsoft Codename “Cloud Numerics” article of 3/26/2012. Each *.csv file has 83 columns, about 500,000 rows and an average size of about 225 MB.

These files are processed as follows:

  1. Download On_Time_On_Time_Performance_YYYY_M.zip files
  2. Extract *.zip files to On_Time_On_Time_Performance_YYYY_M.csv files
  3. Open each *.csv file in in Excel 2010
  4. Delete all columns except Year, Month, DayofMonth, Carrier, Origin, Dest, DepDelayMinutes, ArrDelayMinutes
  5. Save the abbreviated file as flightdata_YYYY_MM.txt with 0 prepended to single month numbers but don’t save changes to the *.csv file
  6. Open the abbreviated *.txt file in TextPad, delete the first (column name) row (see Figure 2) and save changes
  7. Open the *.txt file in Excel and replace empty DepDelayMinutes and ArrDelayMinutes cells with 0 (see the section below for the process)
  8. Upload the files to a flightdata folder of an aircarrier container in a Windows Azure storage account with Cerebrata Cloud Storage Studio or equivalent (see Figure 3)

image

Figure 2. TextPad displaying the latest available flightdata_YYYY_MM.txt file after deleting the column names on the first line. Blank cells have been replaced with zeroes as described below.

Note: The *.txt files have an average size of about 13 MB and are available to download in *.zip format from the FlightData Files for Hadoop on Azure folder of my SkyDrive account. TextPad is recommended for editing because it opens and saves the files much more quickly than Notepad.

image

Figure 3. Cerebrata Cloud Storage Studio displaying the five files of Figure 2 after uploading them to the \oakleaf\flightdata Windows Azure blob container.

Note: Adding the *.txt files to a folder within the container enables the Hive table to use all files in the folder as its data source.

Replacing Empty DepDelayMinutes and ArrDelayMinutes Cells with Zeroes

Records for canceled and diverted flights have empty cells for departure, arrival or both delay times, as shown for rows 1 and 22 below:

image

Figure 4. Empty cells in numeric (integer) fields occur for canceled and diverted flights.

Empty (null) values in integer columns might cause unexpected results in MapReduce or HiveQL operations that depend on these values. To replace empty cells with zeroes in Excel, do the following:

1. Open the *.txt file in Excel, select the columns whose empty cells you want to fill, and press Ctrl+G to open the Go To dialog, click the Special button to open the Go To Special dialog and select Blanks:

image

Figure 5. Use the Go To Special dialog to select all empty cells in the selected columns.

2. Click OK to dismiss the dialog and select the Empty cells, and type a 0 in the first empty cell:

image

Figure 6. Type a 0 in the first selected empty cell to prepare for propagation to all empty cells.

3. Press Ctrl+Enter to replicate the replacement throughout the worksheet:

image

Figure 7. Pressing Ctrl+Enter replaces all blank cells with zeroes.

4. Save the modified worksheet.

Configuring the ASV Data Source

1. Click the Manage Cluster page’s Set Up ASV button (refer to Figure 1) to open the Configre Azure Blob Storage page. Type the Storage Account Name and its Primary Access Key value in the two text boxes:

image

Figure 8. Data required to complete configure a storage account as an Azure Storage Vault.

2. Click the Save Settings button to create the ASV connection and display this confirmation page:

image

Figure 9. Confirming the new ASV connection.

Scenario 1: Working with ASV Files in the Hadoop Command Shell via RDP

1. Click the Azure on Hadoop EMR portal landing page’s Remote Desktop tile in the Your Cluster group to start the connection and display the download bar for your connection:

image

Figure 10. The download bar for a Remote Desktop (RDP) connection.

2. Click the Open button to open the Windows Security dialog to your Hadoop on Azure instance and type your password:

image

Figure 11. Providing the Password for your Hadoop on Azure cluster.

3. Click OK to display the Remote Desktop Connection warning dialog:

image

Figure 12. The primary Remote Desktop Connection dialog to your Hadoop on Azure cluster

4. Click Yes to start (see Figure 13) and complete (see Figure 14) the connection process:

image

Figure 13. The RDP’s Connecting dialog.

image

Figure 14. A Remote Desktop Connection to a Hadoop on Azure cluster.

5. Double-click the Hadoop Command Shell icon to open the Hadoop Command Shell window.

6. To list the files you uploaded previously, change to the c:\apps\dist folder if necessary, type hadoop fs -lsr asv://aircarrier/flightdata/, and press Enter:

image

Figure 15. Using the Hadoop FileSystem -lsr command to list files in a designated container’s folder.

7. To verify that a text file contains content, for this example execute hadoop fs -tail asv://aircarrier/flightdata/flightdata_2012_01.txt or substitute another file name to display the last 1 kB of the file:

image

Figure 16. Displaying the last 1 kB of a text file with the FileSystem -tail command.

Note: Data for the last three days of January 2012 is missing because monthly data is based on full weeks.

Scenario 2: Working with ASV Files in the Interactive JavaScript Console

The Interactive JavaScript console handles most FileSystem commands you can execute in the Hadoop Command Shell. To emulate the preceding scenario, do the following:

1. On the Hadoop on Azure EMR portal’s landing page, click the Interactive Console tile and sign in if requested to open the Interactive JavaScript console by default.

2. Type #ls asv://aircarrier/flightdata/ or #lsr asv://aircarrier/flightdata/ (the recursive version) to display a list of the six files uploaded above.

3. Type #tail asv://aircarrier/flightdata/flightdata_2012_01.txt to display the the last 1 kB of the file:

image

Figure 17. Displaying a list of files and the last 1 kB of a specified file’s content in the Interactive JavaScript console.

Scenario 3: Working with ASV Files in the Interactive Hive Console

Hive tables require a schema to define their data structure before adding records from the ASV data source. Hive’s CREATE [EXTERNAL] TABLE syntax is similar to T-SQL’s, as demonstrated by the Apache Hive Language Manual’s Create Table Syntax topic. For this example, the Data Manipulation Language (DML) statement to create a table with abbreviated names of columns listed in step 4 of the file processing instructions is:

CREATE EXTERNAL TABLE flightdata_asv (
year INT,
month INT,
day INT,
carrier STRING,
origin STRING,
dest STRING,
depdelay INT,
arrdelay INT
)
COMMENT 'FAA on-time data'
ROW FORMAT DELIMITED FIELDS TERMINATED by '9'
STORED AS TEXTFILE
LOCATION 'asv://aircarrier/flightdata';

To add the table to the Hive database in the same folder as its source data, do this:

1. Click the Interactive Console tile to open the Interactive JavaScript page and click the Hive button to open the Interactive Hive Console.

2. Copy and paste the preceding CREATE EXTERNAL TABLE command in the text box at the bottom of the page, and click the Evaluate button to create the flightdata_asv table.

image

Figure 18. A multiple-exposure screen capture displaying available Hive tables and the columns of a selected table.

Note: The flighttest_asv table uses the \aircarrier\flighttest\flighttest_2012_01.txt file which contains 100 rows. The smaller table speeds tests for live demos. Clicking a >> button adds the table name or column name to the text box so you don’t need to type it.

3. To verify the table schema, click the Clear Screen button, type DESCRIBE flightdata_asv in the text box and click Evaluate:

image

Figure 19. The DESCRIBE command lists column names and data types.

4. Verify that the Windows Azure blog data is accessible by typing SELECT * FROM flightdata_asv LIMIT 10 in the text box and click the Evaluate button:

image

Figure 20. The first 10 rows return quickly.

5. To verify that the Hive table uses all six text files currently in the flightdata folder, type SELECT COUNT(*) FROM flightdata_asv in the text box and click the Evaluate button:

image

Figure 21. Each *.txt file has an average of about 500,000 rows, so a count of ~3 million (returned in about 74 seconds) verifies that the Hive table uses all tables in the flightdata folder.

The full text of Hive History is:

Hive history file=C:\Apps\dist\logs\history/hive_job_log_RogerJ_201204071855_304201235.txt
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201204071422_0001, Tracking URL = http://10.186.132.33:50030/jobdetails.jsp?jobid=job_201204071422_0001
Kill Command = C:\Apps\dist\bin\hadoop.cmd job -Dmapred.job.tracker=10.186.132.33:9010 -kill job_201204071422_0001
2012-04-07 18:55:42,800 Stage-1 map = 0%, reduce = 0%
2012-04-07 18:55:57,878 Stage-1 map = 17%, reduce = 0%
2012-04-07 18:56:01,910 Stage-1 map = 33%, reduce = 0%
2012-04-07 18:56:04,925 Stage-1 map = 67%, reduce = 0%
2012-04-07 18:56:06,941 Stage-1 map = 72%, reduce = 0%
2012-04-07 18:56:10,003 Stage-1 map = 100%, reduce = 0%
2012-04-07 18:56:13,019 Stage-1 map = 100%, reduce = 22%
2012-04-07 18:56:22,050 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201204071422_0001
OK
Time taken: 74.423 seconds

6. Click the Jobs button to display the result of the previous steps (and a few for the flighttest_asv table):

image

Figure 22. The Job History page displays items for correct and incorrect syntax, including execution times of HiveQL statements.

7. To return the average and standard deviation of arrival delays, execute a SELECT avg(arrdelay), stddev_pop(arrdelay) FROM flightdata_asv query:

image

Figure 23. Obtaining the average and standard deviation of the arrival delay column in minutes. Execution time was 77.2 seconds.

The 9.15 and 28.52 values returned are in the same range as those calculated by the Codename “Cloud Numerics” project using 32 months (~16 million rows) of data: 4.51 and 34.97 minutes reported in the Interpreting FlightDataResult.csv’s Data section of my Analyzing Air Carrier Arrival Delays with Microsoft Codename “Cloud Numerics” article of 3/26/2012.

Note: avg(column) and stddev_pop(column) are members of HiveQL’s built-in aggregate functions (UDAF).

8. To generate an array representing coordinates of a histogram having 20 unevenly spaced bins of arrival delays (x) of height (y) in total minutes, execute SELECT histogram_numeric(arrdelay, 20) FROM flightdata_asv:

image

Figure 24. Data for a 20-bin histogram of total flight delay minutes.

Using the Interactive JavaScript Console to Create a Simple Graph from Histogram Tail Data

Avkash Chauhan described how to create a graph from HDFS file data in his Apache Hadoop on Windows Azure Part 9 – Using Interactive JavaScript for Data Visualization post of 1/2/2012. The following procedure uses the Interactive JavaScript console and the JavaScript native graph class with edited data from a modified version of Figure 24’s HiveQL statement that returns data for 10 bins. JavaScript native graphs don’t have an option for a logarithmic y axis, which would be required to accommodate very large hour counts for delays less than a few hours. Therefore, the data is edited to show the last 5 bins of the tail, which contains counts for delays of 12 hours or longer. Decimal fractions are removed for clarity.

To generate a five-bin member histogram, do the following:

1. In the Interactive Hive console, execute a SELECT histogram_numeric(arrdelay, 10) FROM flightdata_asv command.

2. Select the returned data, copy it to the clipboard, and paste the array to Notepad.

3. Delete the first five members and remove the decimal fractions, leaving this array:

image

Figure 25. Reducing the number of histogram bins and removing decimal fractions from the data.

4. Select and copy the array data to the clipboard. Alternatively, select and copy this data: [{"x":726,"y":117}, {"x":892,"y":78}, {"x":1046,"y":28}, {"x":1243,"y":10},{"x":1403,"y":9}] 

5. Click the console’s JavaScript button, type data = at the js> prompt, paste the array data and press Enter to provide the graph’s data argument value (see Figure 26).

6. Set the graph options by typing options = { title: "Air Carrier Arrival Delay Histogram", orientation: 20, x: "x", y: "y" } at the prompt and pressing Enter:

image

Figure 26. Providing data and options argument values for the histogram tail graph.

7. Type graph.bar(data, options) at the prompt and press Enter to display a bar chart with departure delay hours on the x-axis and the count of departure delays for the period on the y-axis:

image

Figure 27: A histogram of the number of air-carrier arrival delays of 12 hours and longer
from August 2011 through January 2012.

Viewing Hadoop/MapReduce Job and HDFS Head Node Details The default remote desktop window includes a Hadoop MapReduce icon that opens a Hadoop/MapReduce Administration page and a Head Node icon that opens a NameNode ‘10.186.132.33:9000’ page (the Head Node for this example). MapReduce Job History in the Hadoop/MapReduce Administration Page

Here’s the page’s Completed Jobs section after running four MapReduce jobs from the Interactive Hive console:

image

Head Node Details in the NameNode ‘10.186.132.33:9000’ Page

Here’s most of the NameNode page:

image

You can browse the filesystem and Namenode logs, as well as view details of live and other nodes from links on this page. Most of this information isn’t germane to MapReduce operations generated by interactive Hive queries against ASV data.

This post will be updated when more flightdata*.txt files have been uploaded to a replacement for the oakleaf storage account. The North Central US hosts Hadoop on Azure clusters so I will create a new storage account there to reduce latency. The oakleaf storage account is in the South Central US (San Antonio, TX) data center.

Note: According to a message of 4/8/2012 from Microsoft’s Brad Sarsfield: “Plans are not final, but this may change in the next few months, likely to W[est] US or E[ast] US.”

Stay tuned.

Published at DZone with permission of its author, Roger Jennings. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)