A SMALL Cross-section of BIG Data
Big data is a term applied to data sets whose size is beyond the ability
of commonly used software tools to capture, manage, and process the
data within a tolerable elapsed time. Big data sizes are a constantly
moving target currently ranging from a few dozen terabytes to many
petabytes of data in a single data set.
IDC estimated the digital universe to be around 1.8 zettabytes by 2011.
How big is a zettabyte? It's one billion terabytes. The current world
population is 7 billion - that is, if you give a hard disk of 250
billion GB for each person on the earth - still that storage won't be
sufficient.
Many sources contribute to this flood of data...
1. The New York Stock Exchange generates about one terabyte of new trade data per day.
2. Facebook hosts approximately 10 billion photos taking up one petabytes of storage.
3. Ancestry.com, the genealogy site, store around 2.5 petabytes of data.
4. The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month.
5. The Large Harden Collider near Geneva will produce about 15 petabytes of data per year.
6. Everyday people create the equivalent of 2.5 trillion bytes of data
from sensors, mobile devices, online transactions & social networks.
Facebook, Yahoo! and Google found themselves collecting data on an
unprecedented scale. They were the first massive companies collecting
tons of data from millions of users.
They quickly overwhelmed traditional data systems and techniques like
Oracle and MySql. Even the best, most expensive vendors using the
biggest hardware could barely keep up and certainly couldn’t give them
tools to powerfully analyze their influx of data.
In the early 2000’s they developed new techniques like MapReduce, BigTable and Google File System
to handle their big data. Initially these techniques were held
proprietary. But they realized making the concepts public, while keeping
the implementations hidden, will benefit them - since more people will
contribute to those and the graduates they hire will have a good
understanding prior to joining.
Around 2004/2005 Facebook, Yahoo! and Google started sharing research papers describing their big data technologies.
In 2004 Google published the research paper "MapReduce: Simplified Data Processing on Large Clusters".
MapReduce is a programming model and an associated implementation for
processing and generating large data sets. Users specify a map function
that processes a key/value pair to generate a set of intermediate
key/value pairs, and a reduce function that merges all intermediate
values associated with the same intermediate key. Many real world tasks
are expressible in this model, as shown in this paper.
Programs written in this functional style are automatically parallelized
and executed on a large cluster of commodity machines. The run-time
system takes care of the details of partitioning the input data,
scheduling the program's execution across a set of machines, handling
machine failures, and managing the required inter-machine communication.
This allows programmers without any experience with parallel and
distributed systems to easily utilize the resources of a large
distributed system. Google's implementation of MapReduce runs on a large
cluster of commodity machines and is highly scalable.
A typical MapReduce computation processes many terabytes of data on
thousands of machines. Programmers and the system easy to use. Hundreds
of MapReduce programs have been implemented and upwards of one thousand
MapReduce jobs are executed on Google's clusters every day.
Doug Cutting who worked for Nutch,
an open-source search technology project which are now managed through
the Apache Software Foundation, read this paper published by Google and
also another paper
published by Google on Google's distributed file system [GFS]. He
figured out GFS will solve their storage needs and MapReduce will solve
the scaling issues they encountered with Nutch and implemented MapReduce
and GFS. They named the GFS implementation for Nutch as the Nutch
Distributed Filesystem [NDFS].
NDFS and the MapReduce implementation in Nutch were applicable beyond
the realm of search, and in February 2006 they moved out of Nutch to
form an independent sub project of Lucene called Hadoop and NDFS, became
HDFS
[Hadoop Distributed File System] - which is an implementation of GFS.
During the same time Yahoo! extended their support for Hadoop and hired
Doug Cutting.

At a very high-level, this is how HDFS works. Say we have a 300 MB file.
[Hadoop also does really well with files of petabytes and terabytes.]
The first thing HDFS is going to do is to split this up in to blocks.
The default block size on HDFS right now is 128 MB. Once split-ed in to
blocks we will have two blocks of 128 MB and another of 44 MB. Now HDFS
will make 'n' number of ['n' is configurable - say 'n' is three]
copies/replicas of each of these blocks. HDFS will now store these
replicas in different DataNodes of the HDFS cluster. We also have a
single NameNode, which keeps track of replicas and the DataNodes.
NameNode knows where a given replica resides - whenever it detects a
given replica is corrupted [DataNode keeps on running checksums on
replicas] or the corresponding HDFS node is dowm, it will find out where
else that replica is in the cluster and tells other nodes do 'n'X
replication of that replica. The NameNode is a single point of failure -
and two avoid that we can have secondary NameNode which in sync with
the primary -and when primary is down - the secondary can take control.
Hadoop project is currently working on implementing distributed
NameNodes.
Again in 2006 Google published another paper on "Bigtable: A Distributed Storage System for Structured Data"
Bigtable is a distributed storage system for managing structured data
that is designed to scale to a very large size, petabytes of data across
thousands of commodity servers. Many projects at Google store data in
Bigtable, including web indexing, Google Earth, and Google Finance.
These applications place very different demands on Bigtable, both in
terms of data size (from URLs to web pages to satellite imagery) and
latency requirements (from backend bulk processing to real-time data
serving). Despite these varied demands, Bigtable has successfully
provided a flexible, high-performance solution for all of these Google
products. This
paper describes the simple data model provided by Bigtable, which
gives clients dynamic control over data layout and format, and describes
the design and implementation of Bigtable.
BigTable maps two arbitrary string values (row key and column key) and
timestamp (hence three dimensional mapping) into an associated arbitrary
byte array. It is not a relational database and can be better defined
as a sparse, distributed multi-dimensional sorted map.
Basically BigTable discussed how to build a distributed data store on top of GFS.
HBase by Hadoop is an implementation of BigTable. HBase is a
distributed, column oriented database which is using HDFS for it's
underlying storage and supports both batch-style computation using
MapReduce and point queries.
Amazon, published a research paper in 2007 on "Dynamo: Amazon’s Highly Available Key-value Store".
Dynamo,
is a highly available key-value storage system that some of Amazon’s
core services use to provide an “always-on” experience. Apache Cassandra
— brings together Dynamo's fully distributed design and BigTable's data
model and written in Java - open sourced by Facebook in 2008. It is a
NoSQL solution that was initially developed by Facebook and powered
their Inbox Search feature until late 2010. In fact much of the initial
development work on Cassandra was performed by two Dynamo engineers
recruited to Facebook from Amazon. However, Facebook abandoned Cassandra
in late 2010 when they built Facebook Messaging platform on HBase.
Also, besides using the way of modeling of BigTable, it has properties
like eventual consistency, the Gossip protocol, a master-master way of
serving the read and write requests that are inspired by Amazon's
Dynamo. One of the important properties, the Eventual consistency -
means that given a sufficiently long period of time over which no
changes are sent, all updates can be expected to propagate eventually
through the system and all the replicas will be consistent.
I used the term 'NoSQL' when talking about Cassandra. NoSQL (sometimes
expanded to "not only SQL") is a broad class of database management
systems that differ from the classic model of the relational database
management system (RDBMS) in some significant ways. These data stores
may not require fixed table schemas, usually avoid join operations, and
typically scale horizontally.
The name "NoSQL" was in fact first used by Carlo Strozzi in 1998 as the
name of file-based database he was developing. Ironically it's
relational database just one without a SQL interface. The term
re-surfaced in 2009 when Eric Evans used it to name the current surge in
non-relational databases.
There are four categories of NoSQL databases.
1. Key-value stores : This is based on Amazon's Dynamo paper.
2. ColumnFamily / BigTable clones : Examples are HBase, Cassandra
3. Document Databases : Examples are CouchDB, MongoDB
4. Graph Database : Examples are AllegroGrapgh, Neo4j
As per Marin Dimitrov, following are the use cases for NoSQL databases -
in other words following are the cases where relational databases do
not perform well.
1. Massive Data Volumes
2. Extreme Query Volume
3. Schema Evolution
With NoSQL, we get the advantages like, Massive Scalability, High
Availability, Lower Cost (than competitive solutions at that scale),
Predictable elasticity and Schema flexibility.
For application programmers the major difference between relational databases and the Cassandra is it's data model
- which is based on BigTable. The Cassandra data model is designed for
distributed data on a very large scale. It trades ACID-compliant data
practices for important advantages in performance, availability, and
operational manageability.
If you want to compare Cassandra with HBase, then this is a good one. Another HBase vs Cassandra debate is here.
References :
[1]: MapReduce: Simplified Data Processing on Large Clusters
[2]: Bigtable: A Distributed Storage System for Structured Data
[3]: Dynamo: Amazon’s Highly Available Key-value Store
[4]: The Hadoop Distributed File System
[5]: ZooKeeper: Wait-free coordination for Internet-scale systems
[6]: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics
[7]: Cassandra - A Decentralized Structured Storage System
[8]: NOSQL Patterns
[9]: BigTable Model with Cassandra and HBase
[10]: LinkedIn Tech Talks : Apache Hadoop - Petabytes and Terawatts
[11]: O'Reilly Webcast: An Introduction to Hadoop
[12]: Google Developer Day : MapReduce
[13]: WSO2Con 2011 - Panel: Data, data everywhere: big, small, private, shared, public and more
[14]: Scaling with Apache Cassandra
[15]: HBase vs Cassandra: why we moved
[16]: A Brief History of NoSQL
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)






Comments
Goel Yatendra replied on Thu, 2012/03/15 - 1:24pm
1 zettabyte ~ 1.1 trillion GB
250GB * 7 billion people = ~ 1.750 trillion GB
Joay Sim replied on Sun, 2013/02/17 - 1:54am
It is imperative that we read blog post very carefully. I am already done it and find that this post is really amazing.
Healthy Relationship with Enclavia