Monday, July 22, 2013

Tackling it right – The ‘Big Data’

Big Data is the new buzzword to describe the exponential growth of data, both structured and unstructured. Big data refers to huge data sets characterized by larger volumes, greater variety and complexity, generated at a higher velocity than any organization has faced before. The focus of organizations is to manage the ever increasing volume, variety and the velocity of information that comes from the big data. The goal of all organizations with access to large data collections should be to harness the most relevant data and use it for optimized decision making. Until recently, organizations were limited to use the subset of data set and constrained to use traditional analysis even with the large volumes of data using their data processing platforms.  But the question is, ‘Is there a technology with which the Organizations can make use of most of the exploding Data?’   

A number of technology advancements are now enabling organizations to make the most of big data. These technologies support the ability to collect large amounts of data and also the ability to understand and harness the most relevant data and support the decision making. Industries such as Telecommunication, Manufacturing, Retail, Banking & Finance, Airlines, Health care provider, etc are in great need of Big Data Analytics for their unstructured data sources. The data sources from these industries may typically fall under the following categories: Transaction Data, Machine Generated Data, Human Generated Data, Biometric data, Web and Social media data.

The use of big data will become a key basis of competition and growth for individual firms in the soon coming future. From the standpoint of competitiveness and the potential capture of value, all companies need to take big data seriously. In most industries, established competitors and new entrants alike will leverage data-driven strategies to innovate, compete, and capture value from deep and up-to-real-time information.


According to Gartner, Big Data is a major technology initiative in 2012. Big Data is already a $12 billion market and expected to grow at 40% every year Telecom, Retail, Utilities, Manufacturing, Financial Services, Airlines etc. are evaluating this technology and some have already implemented innovative solutions by mining unstructured information from Social Media.

Hadoop – Tackling Big Data the way it should be

Apache Hadoop (High-availability distributed object-oriented platform) is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.

The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework. It enables applications to work with thousands of computation-independent computers and petabytes of data. The entire Apache Hadoop “platform” is now commonly considered to consist of the Hadoop kernel, MapReduce and Hadoop Distributed File System (HDFS), as well as a number of related projects – including Apache Hive, Apache HBase, and others.

Hadoop is written in the Java programming language and is an Apache top-level project being built and used by a global community of contributors. Hadoop and its related projects (Hive, HBase, Zookeeper, and so on) have many contributors from across the ecosystem. Though Java code is most common, any programming language can be used with "streaming" to implement the "map" and "reduce" parts of the system.

What is HDFS - Hadoop Distributed File System ?

HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication. Clients use Remote procedure call (RPC) to communicate between each other.

An advantage of using HDFS is data awareness between the job tracker and task tracker. The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location. An example of this would be if node A contained data (x,y,z) and node B contained data (a,b,c). Then the job tracker will schedule node B to perform map or reduce tasks on (a,b,c) and node A would be scheduled to perform map or reduce tasks on (x,y,z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer. When Hadoop is used with other file systems this advantage is not always available. This can have a significant impact on job-completion times, which has been demonstrated when running data-intensive jobs.

HDFS was designed for mostly immutable files and may not be suitable for systems requiring concurrent write operations.


The Design of HDFS
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware. Let’s examine this statement in more detail:

Very large files
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.

Streaming data access
HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.

Commodity hardware
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on clusters of commodity hardware (commonly available hardware available from multiple vendors†) for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure. It is also worth examining the applications for which using HDFS does not work so well. While this may change in the future, these are areas where HDFS is not a good fit today:
1.)Low-latency data access
Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. Remember HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. HBase is currently a better choice for low-latency access.

2.)Lots of small files
Since the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory. While storing millions of files is feasible, billions is beyond the capability of current hardware.

3.)Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. There is no support for multiple writers, or for modifications at arbitrary offsets in the file. (These might be supported in the future, but they are likely to be relatively inefficient.


HDFS Concepts
Blocks
A disk has a block size, which is the minimum amount of data that it can read or write.
Filesystems for a single disk build on this by dealing with data in blocks, which are an
integral multiple of the disk block size. Filesystem blocks are typically a few kilobytes
in size, while disk blocks are normally 512 bytes. This is generally transparent to the
filesystem user who is simply reading or writing a file—of whatever length. However,
there are tools to do with filesystem maintenance, such as df and fsck, that operate on
the filesystem block level.
HDFS too has the concept of a block, but it is a much larger unit—64 MB by default.
Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks,
which are stored as independent units. Unlike a filesystem for a single disk, a file in
HDFS that is smaller than a single block does not occupy a full block’s worth of underlying
storage. When unqualified, the term “block” refers to a block in HDFS.

Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.

A quick calculation shows that if the seek time is around 10ms, and the transfer rate is 100 MB/s, then to make the seek time 1% of the transfer time, we need to make the block size around 100 MB. The default is actually 64 MB, although many HDFS installations use 128 MB blocks. This figure will continue to be revised upward as transfer speeds grow with new generations of disk drives.

This argument shouldn’t be taken too far, however. Map tasks in MapReduce normally operate on one block at a time, so if you have too few tasks (fewer than nodes in the cluster), your jobs will run slower than they could otherwise.

Having a block abstraction for a distributed filesystem brings several benefits. The first
benefit is the most obvious: a file can be larger than any single disk in the network.
There’s nothing that requires the blocks from a file to be stored on the same disk, so
they can take advantage of any of the disks in the cluster. In fact, it would be possible,
if unusual, to store a single file on an HDFS cluster whose blocks filled all the disks in
the cluster.
Second, making the unit of abstraction a block rather than a file simplifies the storage
subsystem. Simplicity is something to strive for all in all systems, but is important for
a distributed system in which the failure modes are so varied. The storage subsystem
deals with blocks, simplifying storage management (since blocks are a fixed size, it is
easy to calculate how many can be stored on a given disk), and eliminating metadata
concerns (blocks are just a chunk of data to be stored—file metadata such as permissions
information does not need to be stored with the blocks, so another system can
handle metadata orthogonally).
Furthermore, blocks fit well with replication for providing fault tolerance and availability.


Several issues needs to be addressed to capture the full potential of big data. Policies related to privacy, security, intellectual property, and even liability will need to be addressed in a big data world. Organizations need not only to put the right talent and technology in place but also structure workflows and incentives to optimize the use of big data. Access to data is critical—companies will increasingly need to integrate information from multiple data sources, often from third parties, and the incentives have to be in place to enable this.



Equally important, the desired business impact must drive an integrated approach to data sourcing, model building, and organizational transformation. That’s how you avoid the common trap of starting with the data and simply asking what it can do for you. Leaders should invest sufficient time and energy in aligning managers across the organization in support of the mission.




1 comment:

  1. The overall Big Data market is estimated to be $11.3 billion, a growth of 58% over a revised 2011 figure of $7.2 billion. The Big Data market is projected to grow rapidly, with a growth of 59% in 2013, and 54% in 2014, before the market slows a little.

    ReplyDelete