NoSQL Zone is brought to you in partnership with:

Enterprise Architect in HCL Technologies a $7Billion IT services organization. My role is to work as a Technology Partner for large enterprise customers providing them low cost opensource solutions around Java, Spring and vFabric stack. I am also working on various projects involving, Cloud base solution, Mobile application and Business Analytics around Spring and vFabric space. Over 23 yrs, I have build repository of technologies and tools I liked and used extensively in my day to day work. In this blog, I am putting all these best practices and tools so that it will help the people who visit my website. Krishna is a DZone MVB and is not an employee of DZone and has posted 64 posts at DZone. You can read more from them at their website. View Full User Profile

Datawarehouse with Hadoop+Hbase+Hive+SpringBatch – Part 2

02.22.2013
| 5005 views |
  • submit to reddit

The svn codebase for this article is here.

In continuation to part 1, this section covers,

  1. Setup of a Hadoop, Hbase, Hive on a new Ubuntu VM
  2. Run Hadoop, Hbase and Hive as services
  3. Setup the Spring batch project and run the tests
  4. Some useful commands/tips

To begin with let me tell you the choice of using Hive was to understand not to use Hive as a JDBC equivalent. It was more to understand how to use Hive as a powerful datawarehouse analytics engine.

Setup of a Hadoop, Hbase, Hive on a new Ubuntu VM

Download the latest Hadoop, Hbase and Hive from the apache websites. You can also go to Cloudera website and get the Cloudera UBuntu VM and use apt-get install hadoop, hbase and hive. It did not work for me, if you are adventurous you can try that. You can also try MapR’s VMs. Both Cloudera and MapR have good documentation and tutorials.

Unzip the file in the home directory and go to .profile file and add the bin directories to the path as below,

export HADOOP_HOME=<HADOOP HOME>
export HBASE_HOME=<HBASE HOME>
export HIVE_HOME=<HIVE HOME>
export PATH=$PATH:$HADOOP_HOME/bin:$HBASE_HOME/bin:$HIVE_HOME/bin

sudo mkdir -p /app/hadoop/tmp

sudo chown <login user>:<machine name>/app/hadoop/tmp

hadoop namenode -format

Set HADOOP_HOME, HBASE_HOME and HIVE_HOME environment variables.
Run the ifconfig and get the ip address it will be something like, 192.168.45.129

Go to etc/hosts file and add an entry like,

192.168.45.129 <machine name>

Run Hadoop, Hbase and Hive as services

Go to hadoop root folder and run the below command,

start-all.sh

Open a browser and access http://localhost:50060/ it will open the hadoop admin console. If there are some issues, execute below command and see if there are any exceptions

tail -f $HADOOP_HOME/logs/hadoop-<login username>-namenode-<machine name>.log

Hadoop is running in 54310 port by default.
Go to hbase root folder and run the below command,

start-hbase.sh
tail -f $HBASE_HOME/logs/hbase--master-.log

See if there are any errors. Hbase is runing in port 60000 by default
Go to hive root folder and run the below command,

hive --service hiveserver -hiveconf hbase.master=localhost:60000 mapred.job.tracker=local

Notice, by giving the hbase reference we have integrated hive with hbase. Also hive default port is 10000. Now run hive as a command line client as follow,

hive -h localhost

Create the seed table as below,

CREATE TABLE weblogs(key int, client_ip string, day string, month string, year string, hour string,
  minute string, second string, user string, loc  string) row format delimited fields terminated by '\t';

CREATE TABLE hbase_weblogs_1(key int, client_ip string, day string, month string, year string, hour string,
  minute string, second string, user string, loc  string)
  STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
  WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, cf1:client_ip, cf2:day, cf3:month, cf4:year, cf5:hour, cf6:minute, cf7:second, cf8:user, cf9:loc")
  TBLPROPERTIES ("hbase.table.name" = "hbase_weblog");

LOAD DATA LOCAL INPATH '/home/hduser/batch-wordcount/weblogs_parse1.txt' OVERWRITE INTO TABLE weblogs;

INSERT OVERWRITE TABLE hbase_weblogs_1 SELECT * FROM weblogs;

Setup the Spring batch project and run the tests
To setup this project get the latest code from SVN mentioned in the beginning. Download gradle and setup the path in the .profile. Now run the below command to load the data,

gradle -Dtest=org.springframework.data.hadoop.samples.DataloadWorkflowTests test

Run the below junit to get the analysis data,

gradle -Dtest=org.springframework.data.hadoop.samples.AnalyzeWorkflowTests test

hadoopVersion is 1.0.2. build.gradle file looks as below,

repositories {     // Public Spring artefacts
  mavenCentral()
  maven { url "http://repo.springsource.org/libs-release" }
  maven { url "http://repo.springsource.org/libs-milestone" }
  maven { url "http://repo.springsource.org/libs-snapshot" }
  maven { url "http://www.datanucleus.org/downloads/maven2/" }
  maven { url "http://oss.sonatype.org/content/repositories/snapshots" }
  maven { url "http://people.apache.org/~rawson/repo" }
  maven { url "https://repository.cloudera.com/artifactory/cloudera-repos/"}
}

dependencies {
  compile ("org.springframework.data:spring-data-hadoop:$version")
  { exclude group: 'org.apache.thrift', module: 'thrift' }
  compile "org.apache.hadoop:hadoop-examples:$hadoopVersion"
  compile "org.springframework.batch:spring-batch-core:$springBatchVersion"
  // update the version that comes with Batch
  compile "org.springframework:spring-tx:$springVersion"
  compile "org.apache.hive:hive-service:0.9.0"
  compile "org.apache.hive:hive-builtins:0.9.0"
  compile "org.apache.thrift:libthrift:0.8.0"
  runtime "org.codehaus.groovy:groovy:$groovyVersion"
  // see HADOOP-7461
  runtime "org.codehaus.jackson:jackson-mapper-asl:$jacksonVersion"
  testCompile "junit:junit:$junitVersion"
  testCompile "org.springframework:spring-test:$springVersion"
}

Spring Data Hadoop configuration looks as below,

<configuration>
<!-- The value after the question mark is the default value if another value for hd.fs is not provided -->
  fs.default.name=${hd.fs:hdfs://localhost:9000}
  mapred.job.tracker=local</pre>
</configuration>

<hive-client host="localhost" port="10000" />

Spring Batch job looks as below,

<batch:job id="job1">
  <batch:step id="import">
    <batch:tasklet ref="hive-script"/>
  </batch:step>
</batch:job>

Spring Data Hive script for loading the data is as below,

<hive-tasklet id="hive-script">
  <script>
LOAD DATA LOCAL INPATH '/home/hduser/batch-analysis/weblogs_parse.txt' OVERWRITE INTO TABLE weblogs;
INSERT OVERWRITE TABLE hbase_weblogs_1 SELECT * FROM weblogs;
</script>
</hive-tasklet>

Spring Data Hive script for analyzing the data is as below,

<hive-tasklet id="hive-script">
<script>    SELECT client_ip, count(user) FROM hbase_weblogs_1 GROUP by client_ip;   </script>
</hive-tasklet>

Some useful commands/tips

For querying hadoop dfs you can use any file based unix commands like,

hadoop dfs -ls /
hadoop dfs -mkdir /hbase

If you have entered safemode in hadoop and it is not starting up you can execute below command,

hadoop dfsadmin -safemode leave

If you want find some errors in the file hadoop filesystem you can execute below command,

hadoop fsck /


Published at DZone with permission of Krishna Prasad, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)