Big Data/BI Zone is brought to you in partnership with:

Scott is a Senior Software Architect at Altamira Corporation. He has been developing enterprise and web applications for over 15 years professionally, and has developed applications using Java, Ruby/Rails, Groovy/Grails and Python. His main areas of interest include object-oriented design, system architecture, testing, and frameworks of all types including Spring, Hibernate, Ruby on Rails, Grails, and Django. In addition, Scott enjoys learning new languages to make himself a better and more well-rounded developer a la The Pragmatic Programmers' advice to "learn one language per year." Scott is a DZone MVB and is not an employee of DZone and has posted 43 posts at DZone. You can read more from them at their website. View Full User Profile

Writing Hive Tables from MapReduce

01.09.2013
| 4239 views |
  • submit to reddit

This article is by Stephen Mouring Jr, appearing courtesy of Scott Leberknight.

This is part one of a two part blog series on how to read/write Apache Hive data from MapReduce jobs.

Hive is an awesome tool that integrates with HDFS and MapReduce by providing a SQL syntax and schema layer on top of HDFS files. It allows you to treat HDFS files like SQL tables with a row / column format. You can run SQL queries against HDFS files or create new files by running SQL insert statements. 

Hive translates the SQL queries into a series of MapReduce jobs that emulate that query's behavior. While Hive is very useful, it is not always efficient (or even possible) to represent your business logic as a Hive query. Like SQL, Hive is sometimes limited in what it can do. 

So this leaves you with the option of writing your own MapReduce job to perform the business logic you want in an efficient manner. But what if you want to have your data in a form that Hive can still understand so you can query it in Hive in the future? 

Thankfully you can easily emulate Hive's expected file format and output files so that Hive will recognize them. 

MapReduce operates on keys and values. When you write output from a MapReduce job, you have to emit a key and a value. If you want to emit multiple values you have to emit a value (or a key in some cases) that is delimited. Hive adds a layer on top of this paradigm by defining standard delimiters (which you can choose to override) and treating each delimited value in a file as a value of a column. 

Hive by default use the "char 1" as the field delimiter, since it is a nonprintable character not likely to occur in text data. Hive also supports "array fields" which are multivalued fields that are still considered a single column value. Array fields have a separate delimiter for their individual values. Here is a sample Hive table definition where I have, for the sake of clarity, explicitly defined the delimiters: 

create table test (
foo string,
bar string,
baz array<string>
)
row format delimited
fields terminated by '\001'
collections terminated by '\002'


So how do we write to this table from a MapReduce job? For convenience, I like to first define a few constants: 

public static final String SEPARATOR_FIELD = new String(new char[] {1});
public static final String SEPARATOR_ARRAY_VALUE = new String(new char[] {2});




public static final BytesWritable NULL_KEY = new BytesWritable(null);


The first two constants are the delimitors. Note the use of the char constants 1 and 2. You do NOT want to do "new String(""+1)" because that would give you the ASCII value for the digit "1" (which is actually ASCII 49). You want the ASCII 1 value (which, for those who are interested, is the nonprintable "start of heading" character.) Same logic applies for ASCII 2. 

Remember that MapReduce emits files as key value pairs. Hive expects the key to be null, and the value to be a MapReduce Text object with delimited values inside. Hence the NULL_KEY constant above. So to emit a Hive row you first need to aggregate the values into a Text object and then emit that Text object as the value and the NULL_KEY as the key. If any of your values are of an array type (as "baz" is in our example), the individual array values must also be aggregated together using the appropriate delimiter.

// Build up the array values as a delimited string.
StringBuilder bazValueBuilder = new StringBuilder();
int i = 0;
for (String bazValue : bazValues) {
bazValueBuilder.append(bazValue);
++i;
if (i < bazValues.size()) {
bazValueBuilder.append(SEPARATOR_ARRAY_VALUE);
}
}




// Build up the column values / fields as a delimited string.
StringBuilder hiveRow = new StringBuilder();
hiveRow.append("fooValue");
hiveRow.append(SEPARATOR_FIELD);
hiveRow.append("barValue");
hiveRow.append(SEPARATOR_FIELD);
hiveRow.append(bazValueBuilder.toString());




// Emit a null key and a Text object containing the delimited fields
context.write(NULL_KEY, new Text(hiveRow))


When configuring your MapReduce job output path you have two choices. You can either write this output to a temporary directory. And then use the "load data" hive command: 

LOAD DATA LOCAL INPATH your_jobs_output_path OVERWRITE INTO TABLE yourtable;


Or, if your Hive table is partitioned, you can write the file directly into Hive's HDFS directory structure into a partition directory. (For example: /user/hive/warehouse/yourdatabase.bd/yourtable/yourpartition=yourvalue) and then execute a Hive alter table command: 

alter table yourtable add partition (yourpartition='yourvalue');


Thanks for reading! Please follow me on Twitter (@marlhammer) if you like my blogs! 

Published at DZone with permission of Scott Leberknight, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)