Same is the case with Hadoop! Past few months we saw new releases on the table.
I would like to suggest the following amazing article by Jim, explaining the Evolving APIs.
A Good read and a MUST read!
Releases?? Here it goes:
- 1.x series is the continuation of 0.20 release series
- 2.x series is the continuation of 0.23 release series
The 1.x series are the stable ones and majority of commercial distribution market carry forward the same. This release is more matured in terms of authentication.
In security parlance, there are two major buzzwords:
- authentication means that a person who is performing an operation is the person who claims to be.
- authorization means what a person can do with a particular file or directory.
Previously, it was just the authorization supported, giving a wide way for malicious users spoofing and gaining a free entry pass into the network cluster! The 1.x series now is more stable regarding this with the introduction of Kerberos, open-source network authentication protocol, to authenticate the user. Now, here there is a quite good handshake between these two open source fellows.
Kerberos, authenticates that the person is the one who he/she claims to be to Hadoop and then it is the work of Hadoop to do authorization checking on the permissions. It was in 0.20.20x series, the whole Kerberos was implemented.
To be known that its not so easy with this whole lotta Kerberos thingy, too many stuffs happening deep underneath! :)
The 2.x series and 0.22 series are not greatly stable as 1.x are! But, they are good in their own ways. They own lots of new features.
1. MapReduce2: The classic runtime is replaced by a new runtime MapReduce2 also known as YARN [Yet Another Resource Negotiator], which is basically a resource management system running in your cluster in the distributed environment.
2. High Availability: The namenode is no more a SPOF, Single Point Of Failure. There are back up nodes to avoid it. Meaning, there are couple of namenode active at any moment.
3. HDFS Federation: As the namenode is the main repository to keep whole lots of metadata and to imagine a cluster can be of so many nodes and its storage is huge equally! So, this metadata which has the information about each file and the block, becomes a limiting factor in terms of memory! Thus, we have a pair of namenodes, managing part of the filesystem namespace.
MapReduce2 or YARN:
If you have a large cluster it is very obvious that you can process large data sets. However, there was an issue with the classic runtime in the previous releases, wherein a cluster of >4000 machines hit a scalability issue. It is costly to have more number of small clusters. Thus, we need large number of machines in a cluster. This architecture will tend to increase the hardware utilization, need for other paradigms other than MapReduce, more agile.
To YARN is a beauty ;)
Actually, this architectural design is now more close to the Google Map Reduce paper. Reference: http://research.google.com/archive/mapreduce.html
Its an awesome Next Generation Hadoop Framework and MapReduce is just one application in that. There are many such applications written upon YARN.
- Open MPI
- Apache Giraph
- Apache Hama
- Apache HBase [Generic Co porocessors] etc.
So, it does makes sense that it can handle other paradigms and also different versions of MapReduce. We all know how complex MapReduce can be and solving every problem in terms of MR is an uphill task!
The classic runtime of previous release had a JobTracker, which had two important functions.
- Resource Management
- Job Scheduling and Job Monitoring
But, this architecture is a little bit of re-design of the previous dealing with the splitting of JobTracker into two major separate components.
- Resource Manager: It will compute the resources required for the applications globally and assign/ manage them.
- Per-Application Application Master: The scheduling and monitoring done by JobTracker previously is done per application by the application master.
These two are separate demons.What does application mean here?
Application is nothing but a job in the classic runtime.
The Application master will talk to resource manager for the resources in the cluster ie. also termed as "Container" (i/o, CPU, Memory etc.). Container is the home where all the application processes run and its the heavy duty of the node managers to check on the applications not to rob more resources from the container.
- The MapReduce program will run the job at the client node in the client JVM.
- The Job will now get the application from the resource manager and then copy all those resources to the file system which is shared by the cluster and submit it back to the resource manager.
- It is now the duty of the resource manager to start the container at the node manager and that launching the Application Master.
- The Application Master will initialise the job at the node manager and get the input splits from the shared file system of the cluster, allocating resources at the resource manager node.
- The Application Master will start the container at the node manager where task are to be run, in parallel.
- The Node Manager will start the task JVM at the node manager node, where the YarnChild will retrieve all the resources from the shared file system for it to run the corresponding map or reduce task at the node manager node.
However, its not so simple as listed above. Too many stuffs happening!
Well, this was just an introduction and developers who wanna play with YARN, do have a look at Kitten ( released just 5days before the day of writing this), Link: https://github.com/cloudera/kitten, having a set of Java libraries and Lua-based configuration files that handle configuring, launching, and monitoring YARN applications, allowing developers to spend most of their time writing the logic of an application, not its configuration.
Will see more of all these in the coming posts.
Until then, Happy Learning!! :):)