Big Data/BI Zone is brought to you in partnership with:

Istvan Szegedi is an IT Technical Architect at Vodafone UK. He has been working at Hewlett-Packard, Nokia Networks, Google, Morgan Stanley and Vodafone. He holds certificates such as Sun Certified System Administrator, Sun Certified Java Programmer, Sun Certified Web Component Developer, Salesforce.com Certified Force.com Developer, TOGAF Certified Enterprise Architect. As a big fan of mobile and cloud computing, he likes to believe that these technologies will eventually push aside the desktop/client-server architecture Istvan is a DZone MVB and is not an employee of DZone and has posted 37 posts at DZone. You can read more from them at their website. View Full User Profile

Big Data on Heroku – Hadoop From Treasure Data

08.24.2012
| 5084 views |
  • submit to reddit

       

This time I write about Heroku and Treasure Data Hadoop solution – I found it really to be  a ‘gem’ in the Big Data world.

Heroku is a cloud platform as a service (PaaS) owned by Salesforce.com. Originally it started with supporting Ruby as its main programming language but it has been extended to Java, Scala, Node.js, Python and Clojure, too. It also supports a long list of addons including – among others –  RDBMS and NoSQL capabilities and Hadoop-based data warehouse developed by Treasure Data.

Treasure Data Hadoop Architecture

The architecture of Treasure Data Hadoop solution is as as follows:

Heroku Toolbelt

Heroku toolbelt is a command line tooling that consists of heroku, foreman and git packages. As it is described in heroku toolbelt website: it is “everything you need to get started using heroku”. (heroku CLI is based on ruby so you need ruby under the hood, too). Once you have signed up for heroku (you need a verified account meaning that you provided your bank details for potential service charges) and you have installed the heroku toolbelt, you can start right away.

Depending on you environment – I am using Ubuntu 12.04 LTS – you can use alternative installation method like:

$ sudo apt-get install git
$ gem install heroku
$ gem install foreman

Heroku and Treasure Data add-on

If you want to use Treasure Data on Heroku, you need to add the Treasure Data Hadoop add-on –  you need to login, create an application (heroku will generate a fancy name like boiling-tundra for you) and then you need to add your particular add-on to the application you just created:

$ heroku login
Enter your Heroku credentials.
Email: xxx@mail.com
Password (typing will be hidden): 
Found existing public key: /home/istvan/.ssh/id_dsa.pub
Uploading SSH public key /home/istvan/.ssh/id_dsa.pub... done
Authentication successful.

$ heroku create
Creating boiling-tundra-1234... done, stack is cedar
http://boiling-tundra-1234.herokuapp.com/ | git@heroku.com:boiling-tundra-1234.git

$ heroku addons:add treasure-data:nano --app boiling-tundra-1234
Adding treasure-data:nano on boiling-tundra-1234... done, v2 (free)
Use `heroku addons:docs treasure-data:nano` to view documentation.

I just love the coloring scheme and the graphics used in heroku console, it is simply brilliant.

Treasure Data toolbelt

To manage Treaure Data Hadoop on Heroku you need to install Treasure Data toolbelt – it fits very much to heroku CLI,  it is also based on ruby:

$ gem install td

Then you need to install heroku plugin to support heroku commands:

$ heroku plugins:install https://github.com/treasure-data/heroku-td.git
Installing heroku-td... done

To verify that everything is fine, just run:

$ heroku plugins
=== Installed Plugins
heroku-td

and

$ heroku td
usage: heroku td [options] COMMAND [args]

options:
  -c, --config PATH                path to config file (~/.td/td.conf)
  -k, --apikey KEY                 use this API key instead of reading the config file
  -v, --verbose                    verbose mode
  -h, --help                       show help
...

Treasure Data Hadoop – td commands

Now we are ready to execute td commands from heroku. td commands are used to create database and tables, import data, run queries, drop tables, etc. Under the hood td commands are basically HiveQL queries. (According to their website, Treasure Data plans to support Pig as well in the future).

By default Treasure Data td-agent prefers json formatted data, though they can process various other formats (apache log, syslog, etc) and you can write your own parser to process the  uploaded data.

Thus I converted my AAPL stock data (again thanks to http://finance.yahoo.com) into json format:

{"time":"2012-08-20", "open":"650.01", "high":"665.15", "low":"649.90", "close":"665.15", "volume":"21876300", "adjclose":"665.15"}
{"time":"2012-08-17", "open":"640.00", "high":"648.19", "low":"638.81", "close":"648.11", "volume":"15812900", "adjclose":"648.11"}
{"time":"2012-08-16", "open":"631.21", "high":"636.76", "low":"630.50", "close":"636.34", "volume":"9090500", "adjclose":"634.64"}
{"time":"2012-08-15", "open":"631.30", "high":"634.00", "low":"625.75", "close":"630.83", "volume":"9190800", "adjclose":"630.83"}
{"time":"2012-08-14", "open":"631.87", "high":"638.61", "low":"630.21", "close":"631.69", "volume":"12148900", "adjclose":"631.69"}
{"time":"2012-08-13", "open":"623.39", "high":"630.00", "low":"623.25", "close":"630.00", "volume":"9958300", "adjclose":"630.00"}
{"time":"2012-08-10", "open":"618.71", "high":"621.76", "low":"618.70", "close":"621.70", "volume":"6962100", "adjclose":"621.70"}
{"time":"2012-08-09", "open":"617.85", "high":"621.73", "low":"617.80", "close":"620.73", "volume":"7915800", "adjclose":"620.73"}
{"time":"2012-08-08", "open":"619.39", "high":"623.88", "low":"617.10", "close":"619.86", "volume":"8739500", "adjclose":"617.21"}
{"time":"2012-08-07", "open":"622.77", "high":"625.00", "low":"618.04", "close":"620.91", "volume":"10373100", "adjclose":"618.26"}

The first step is to create the database called aapl:

$ heroku td db:create aapl --app boiling-tundra-1234
 !    DEPRECATED: Heroku::Client#deprecate is deprecated, please use the heroku-api gem.
 !    DEPRECATED: More information available at https://github.com/heroku/heroku.rb
 !    DEPRECATED: Deprecated method called from /usr/local/heroku/lib/heroku/client.rb:129.
Database 'aapl' is created.

Then create the table called marketdata

$ heroku td table:create aapl marketdata --app boiling-tundra-1234
 !    DEPRECATED: Heroku::Client#deprecate is deprecated, please use the heroku-api gem.
 !    DEPRECATED: More information available at https://github.com/heroku/heroku.rb
 !    DEPRECATED: Deprecated method called from /usr/local/heroku/lib/heroku/client.rb:129.
Table 'aapl.marketdata' is created.

Check if the tables has been created successfully:

$ heroku td tables --app boiling-tundra-1234
! DEPRECATED: Heroku::Client#deprecate is deprecated, please use the heroku-api gem.
! DEPRECATED: More information available at https://github.com/heroku/heroku.rb
! DEPRECATED: Deprecated method called from /usr/local/heroku/lib/heroku/client.rb:129.
+----------+------------+------+-------+--------+
| Database | Table | Type | Count | Schema |
+----------+------------+------+-------+--------+
| aapl | marketdata | log | 0 | |
+----------+------------+------+-------+--------+
1 row in set

Import data:

$ heroku td table:import aapl marketdata --format json --time-key time aapl.json --app boiling-tundra-1234
 !    DEPRECATED: Heroku::Client#deprecate is deprecated, please use the heroku-api gem.
 !    DEPRECATED: More information available at https://github.com/heroku/heroku.rb
 !    DEPRECATED: Deprecated method called from /home/istvan/.rvm/gems/ruby-1.9.2-p320/gems/heroku-2.30.3/lib/heroku/client.rb:129.
importing aapl.json...
  uploading 364 bytes...
  imported 10 entries from aapl.json.
done.

Check if the data import was successful – you shoud see count column indicating the number of rows loaded into the table:

$ heroku td tables --app boiling-tundra-1234
 !    DEPRECATED: Heroku::Client#deprecate is deprecated, please use the heroku-api gem.
 !    DEPRECATED: More information available at https://github.com/heroku/heroku.rb
 !    DEPRECATED: Deprecated method called from /home/istvan/.rvm/gems/ruby-1.9.2-p320/gems/heroku-2.30.3/lib/heroku/client.rb:129.
+----------+------------+------+-------+--------+
| Database | Table      | Type | Count | Schema |
+----------+------------+------+-------+--------+
| aapl     | marketdata | log  | 10    |        |
+----------+------------+------+-------+--------+
1 row in set

Now we are ready to run HiveQL (td query) against the dataset – this particular query lists the highest prices of AAPL stock on the top and shows the prices in ascending order. (time value is based on UNIX epoch):

$ heroku td query -d aapl -w "SELECT v['time'] as time, v['high'] as high, v['low'] as low FROM marketdata ORDER BY high DESC" --app boiling-tundra-1234
 !    DEPRECATED: Heroku::Client#deprecate is deprecated, please use the heroku-api gem.
 !    DEPRECATED: More information available at https://github.com/heroku/heroku.rb
 !    DEPRECATED: Deprecated method called from /home/istvan/.rvm/gems/ruby-1.9.2-p320/gems/heroku-2.30.3/lib/heroku/client.rb:129.
Job 757853 is queued.
Use 'heroku td job:show 757853' to show the status.
queued...
  started at 2012-08-21T21:06:54Z
  Hive history file=/mnt/hive/tmp/617/hive_job_log_617_201208212106_269570447.txt
  Total MapReduce jobs = 1
  Launching Job 1 out of 1
  Number of reduce tasks determined at compile time: 1
  In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=
  In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=
  In order to set a constant number of reducers:
    set mapred.reduce.tasks=
  Starting Job = job_201207250829_556135, Tracking URL = http://domU-12-31-39-0A-56-11.compute-1.internal:50030/jobdetails.jsp?jobid=job_201207250829_556135
  Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=10.211.85.219:8021 -kill job_201207250829_556135
  2012-08-21 21:07:21,455 Stage-1 map = 0%,  reduce = 0%
  2012-08-21 21:07:28,480 Stage-1 map = 100%,  reduce = 0%
  2012-08-21 21:07:37,965 Stage-1 map = 100%,  reduce = 100%
  Ended Job = job_201207250829_556135
  OK
  MapReduce time taken: 42.536 seconds
  finished at 2012-08-21T21:07:53Z
  Time taken: 53.781 seconds
Status     : success
Result     :
+------------+--------+--------+
| time        | high   | low   |
+------------+--------+--------+
| 1345417200 | 665.15 | 649.90 |
| 1345158000 | 648.19 | 638.81 |
| 1344898800 | 638.61 | 630.21 |
| 1345071600 | 636.76 | 630.50 |
| 1344985200 | 634.00 | 625.75 |
| 1344812400 | 630.00 | 623.25 |
| 1344294000 | 625.00 | 618.04 |
| 1344380400 | 623.88 | 617.10 |
| 1344553200 | 621.76 | 618.70 |
| 1344466800 | 621.73 | 617.80 |
+------------+--------+--------+
10 rows in set

Finally you can delete the marketdata table:

$ heroku td table:delete aapl marketdata --app boiling-tundra-1234
 !    DEPRECATED: Heroku::Client#deprecate is deprecated, please use the heroku-api gem.
 !    DEPRECATED: More information available at https://github.com/heroku/heroku.rb
 !    DEPRECATED: Deprecated method called from /home/istvan/.rvm/gems/ruby-1.9.2-p320/gems/heroku-2.30.3/lib/heroku/client.rb:129.
Do you really delete 'marketdata' in 'aapl'? [y/N]: y
Table 'aapl.marketdata' is deleted.

More details on how to use Treasure Data Hadoop can be found here.

Published at DZone with permission of Istvan Szegedi, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Tags: