Analyst Report : The Enterprise Use of Hadoop

The Enterprise
Use of Hadoop
(v1)

Internet Research Group
November 2011

About The Internet Research Group
www.irg-intl.com

The Internet Research Group (IRG) provides market research and
market strategy services to product and service vendors. IRG services
combine the formidable and unique experience and perspective of the
two principals: John Katsaros and Peter Christy, each an experienced
industry veteran. The overarching mission of IRG is to help clients
make faster and better decisions about product strategy, market entry,
and market development. Katsaros and Christy published a book on
high tech business strategy Getting It Right the First Time – Praeger,
2005 www.gettingitrightthefirsttime.com.

© 2011 Internet Research Group – all rights reserved

IRG 2011: The Enterprise Use of Hadoop (v1) page i

Table of Contents

1. Overview .................................................................................................................... 1
2. Background ............................................................................................................... 1
3. What Is Hadoop? ...................................................................................................... 2
4. Why Is Embedded Processing So Important? ....................................................... 3
5. MapReduce Analytics ............................................................................................... 4
6. What is “Big” Data? .................................................................................................. 4
7. The Major Components of Hadoop ......................................................................... 5
8. The Hadoop Application Ecology............................................................................ 6
9. Cloud Economics ..................................................................................................... 6
10. Why Is Hadoop So Interesting? ............................................................................... 8
11. What Are the Interesting Sources of Big Data? ..................................................... 9
12. How Important Is Big Data Analytics? .................................................................. 10
13. Things You Don’t Want to Do with Hadoop .......................................................... 11
14. Horizontal Hadoop Applications ........................................................................... 11
15. Summary ................................................................................................................. 12

© 2011 Internet Research Group – all rights reserved

1. Overview
The last decade has seen amazing continuing progress in computer technology,
systems and implementations, as evidenced by some of the remarkable Web and
Internet systems that have been constructed such as Google and Facebook.

Although most enterprise CIOs yearn to be able to take advantage of the
performance and cost efficiencies that these pioneering Web systems deliver, the
enterprise path to Cloud computing is intrinsically complex because of the need
to bring forward existing applications and evolve organization structure and skill
set, so achieving those economies will take some time.

Hadoop, an Apache Foundation Open Source project, represents a way for
enterprise IT to take advantage of Cloud and Internet capabilities sooner when it
comes to the storage and processing of huge (by enterprise IT standards)
amounts of data. Hadoop provides a means of implementing storage systems
with Internet economics and doing large-scale processing on that data. It is not
a general replacement for existing enterprise data management and analysis
systems, but for many companies an attractive complement to those systems, as
well as a way of making use of the large-volume data sets that are increasingly
available. The Yahoo! Hadoop team argues that in five years, 50% of enterprise
data will be stored in Hadoop – they might well be right.

2. Background
The last decade has been remarkable for the advances in computer technology
and systems:

 There has been continuing, relentless “Moore’s Law” progress in semiconductor
technology (CPUs, DRAM and now SSD).
 There has been even faster progress in disk price/performance improvement.

 Google demonstrated the remarkable performance and cost-effectiveness that
could be achieved using mega-scale systems built from commodity technology,
as well as pioneering the application and operational adaptations needed to take
advantage of such systems.

The compounded impact of these improvements is seen most dramatically in
various Cloud offerings (starting with Google or Amazon Web Services) where
the cost of storage or computation is dramatically (orders of magnitude) cheaper
than in typical enterprise computing.

© 2011 Internet Research Group, all rights reserved
Provided to Clients Under License

IRG 2011: The Enterprise Use of Hadoop (v1) page 2

Hadoop presents an opportunity for enterprises to take advantage of Cloud
economics immediately, especially in terms of storage, as we will sketch below.

3. What Is Hadoop?

Hadoop builds on a massive file system (Google File System or GFS) and a
parallel application model (MapReduce) originally developed at Google. Google
has an unbelievable number of servers compared to typical large enterprises (in
all likelihood more than a million). Search is a relatively easy task to parallelize:
many search requests can be run in parallel because they only have to be loosely
synchronized (the same search done at the same time doesn’t have to get exactly
the same response). GFS was developed as a file system for applications running
at this scale. MapReduce was developed as a means of performing data analysis
using these resources.

Hadoop is an OpenSource reimplementation of GFS and MapReduce. Google’s
systems run a unique and proprietary software “stack” so no one else could run
Google’s MapReduce even if Google permitted it. Hadoop is designed to run on
a conventional LINUX stack. Google has encouraged the development of
Hadoop, recognizing the value in a broader population of people trained in the
methodology and tools. Much of the development of Hadoop has been driven by
Yahoo!. Yahoo! is also a large Hadoop user, internally running a cluster of more
than 40,000 servers.

Operationally we talk about a Hadoop “cluster”: a set of servers dedicated to a
particular instance of Hadoop that may consist of just a few to the clusters of
more than 4,000 servers in use at Yahoo!.

Today a typical Hadoop server might be two sockets, a total of 8 cores (two 4-
core servers), 48 GB of DRAM, and 8-16 directly attached disks, typically cost-
per-byte optimized (e.g., 2 or 3 TB 3.5” SATA drives). When implemented with
high-volume commodity technology, the majority of the server cost is the disk
drive complement, and each server will have 20-50 TB of storage.



4. Why Is Embedded Processing So Important?
A useful way of thinking about a Hadoop cluster is as a very high-capacity
storage system built with “Cloud” economics (using inexpensive, high-capacity
drives), with substantial, general purpose, embedded processing power. The
importance of having local processing capability becomes clear as soon as you
realize that even when using the fastest LAN links (10Gbits/sec), it takes 40
minutes to transfer the contents of a single 3 TB disk drive. Big data sets may be
remarkably inexpensive to store, but they aren’t easy to move around, even
within a data center using high-speed network connections.1

In the past we brought the data to the program: we ran a program on a server,
opened a file on a network-based storage system, brought the file to the server,
processed the data, and then probably wrote new data back out to the storage
system.2 With Hadoop, this is reversed reflecting the fact that it’s much easier
to move the program to the data than the data to the program. Modern servers
and large-capacity disks enable affordable storage systems of enormous
capacity, but you have to process the data in place when possible; you can’t
move it.

Some “Cloud” storage applications require only infrequent access to the stored
data. Almost all the activity in a Cloud-based backup service is writing the
protected data to the disks. Reading the stored data is only done infrequently
(albeit being able to read a backup file when needed is the key value
proposition). The same is true to an only slightly lesser degree when pictures,
videos or music are stored in the Cloud. Only a small percentage of that data is
ever accessed, and that small fraction can (and is) cached on higher
performance, more expensive storage. Analysis is very different; data will be
processed repeatedly as it is used to answer diverse questions. PC backup or
picture storage are write-once/read-never applications. Analysis is write-
once/read-many.

1
A modern SATA drive can transfer data between the disk and server at a
sustained rate of about 1 Gbit/second. On a 12-disk node, the aggregate read rate
could be up to about 10 Gbits/second. On a 50-node cluster the total aggregate
read rate could approach 500 Gbits/second.
2
A 10 MB file (100Mbits) can be transmitted in about 0.1 second over a
Gbit/second link.



5. MapReduce Analytics
The use of Hadoop has created a lot of interest in large-scale analytics (the
MapReduce part of Hadoop). This kind of “divide and conquer” algorithm
methodology has been used for numerical analysis for many years as a way of
dealing with problems that were known to be bigger than the biggest machine
available. MapReduce is an elegant way of structuring this kind of algorithm
that isolates the analyst/programmer from the specific details of managing the
pieces of work that get distributed to the available machines, as well as an
application architecture that doesn’t depend on any specific structuring of the
data.

As Hadoop evolves, the basic ideas will be adapted to more computer system
architectures than just the commodity scale-out systems used by the mega Web
properties like Google and Yahoo. A MapReduce computation cluster could also
be used with data stored in a high-performance, high-bandwidth storage
subsystem which would make a lot of sense if the data was already stored there
for other reasons. We expect many such variants of the original architecture to
emerge over time.

6. What is “Big” Data?
Google and Yahoo! use MapReduce for purposes that are unique to extremely
large-scale systems (e.g., search optimization, ad delivery optimization). That
fact notwithstanding, almost all companies have important sources of big data.
For example:

 World-wide markets: The Internet enables any company, large or small, to
interact with the billions of people world-wide who are connected. Modern
logistics services such as UPS, FedEx and USPS let any company sell to global
markets. A successful company has to think of millions of people and build
business systems capable of running at that scale. That’s big data.

 Machine-generated data: IT infrastructure (the stuff that all modern companies
run on) comprises thousands of devices (PCs and mobile devices, servers,
storage, network and security devices) all of which are capable of generating a
stream of log-data summarizing normal and abnormal activity. In aggregate this
stream is a rich source of business process, operational, security and regulatory
compliance analysis. That’s big data.

We’ll talk more later about how big data will impact enterprises over time.



7. The Major Components of Hadoop
The core of the Hadoop OpenSource projects is HDFS (the Hadoop Distributed
File System), the reimplementation of the Google File System, and MapReduce
defined by the public documents Google has published. HDFS is the basic file
storage, capable of storing a large number of large files. MapReduce is the
programming model by which data is analyzed using the processing resources
within the cluster.

HDFS has these goals:

 Build very large data management systems from commodity parts where
component failure had to be assumed and dealt with as part of the basic design of
the data system (in contrast to most enterprise storage where great attention is
paid to making the components reliable).

 A file system capable of storing huge files by historical standards (many files
larger than 1 GB).

 A file system that was optimized assuming that files typically change by data be
appended to the file (e.g., additions to a log file) rather than by the modification
of internal pieces of the file.

 A system where the file system APIs reflect the needs of these new applications.

The motivation for MapReduce is more complicated. Today’s world of
commodity servers and inexpensive disk drives is completely different from
yesterday’s world of enterprise IT. Historically, analytics ran on expensive,
high-end servers and used expensive, enterprise-class disk drives. Buying a new
database server is a big decision and comes with software licensing costs, as
well as incremental operational needs (e.g., a database administrator). In the
Hadoop world, adding more nodes isn’t a major capital expense (< $10K server)
and doesn’t trigger new software licenses, or additional administrators.
MapReduce was designed for an environment where adding more hardware is a
perfectly reasonable approach to problem solving. MapReduce is designed for
such environments: progress is more easily made by adding hardware than by
thinking about the problem and carefully crafting an optimized solution.

MapReduce allows the scale of the solution to grow with minimal need for the
analyst or programmer to adapt the program. The MapReduce infrastructure
functions to distribute that work among the available processors (the application
programmer shouldn’t have to worry about how big the actual cluster is),
monitor progress, restart work that stalls or fails, or to balance the work among
the available nodes.

Using MapReduce is by no means simple, nor something that many business
analysts would ever want to do directly (or be able to do for that matter). Google
has required all summer college interns to develop a MapReduce application, all



being excellent programmers and having the benefit of colleagues who were
experienced and still found it difficult to do. Google has supported the Hadoop
effort in part so that it could be used in education to train more knowledgeable
individuals. This isn’t a reason why the impact of MapReduce will be limited,
however; it’s the motivation for a software ecology built on top of HDFS and
MapReduce that makes the capability usable to a broader population.

8. The Hadoop Application Ecology
It is useful to think of Hadoop as a platform, like Windows or Linux. Although
Hadoop was developed based on the specific Google application model, the
interest in Hadoop has spawned the creation of a set of related programs. The
Apache OpenSource Project includes these:

 HBase – the Hadoop database

 Pig – a high-level database for data analysis programs

 Hive – a data warehouse system

 Mahout – a set of machine learning tools

There is other software that can be licensed to use with Hadoop including:

 MapR – an alternative storage system

 Cloudera – management tools

Various database and BI vendors offer software for us with Hadoop including
these:

 Various database and BI vendors offer connectors that make it easy to control an
attached Hadoop system and import the output of Hadoop processors

 Similarly the “ETL” vendors offer connectors so that Hadoop can be a source (or
sink) of data in that process.

9. Cloud Economics
Now that we have introduced Hadoop and HDFS, we can explain in more detail
what we mean by “Cloud Economics.” If you walked into any modern large-
scale data center (Google, Yahoo!, Facebook, Microsoft) you would see
something that looked very different from an enterprise data center. The
enterprise data center would be filled with top-of-the-line systems (“enterprise
class”); the Web data center would be filled with something looking more like
what you would find in a thrift shop: inexpensive “white box” servers and
storage. As the cost of the hardware continues to decline, lots of other aspects of



IT have to evolve as well (e.g., software licensing fees, operational costs) if the
value of the hardware is to be exploited. The basic system and application
design have to evolve as well.

Perhaps most importantly, Google recognized that in large-scale computing
failure and reliability had to be reconsidered. In large-scale systems, failure was
the rule rather than the exception (with millions of disk drives, disk drive failure
is ongoing). In large-scale systems, it makes more sense to achieve reliability
and availability in the higher-level system (e.g., HDFS) and application (e.g.,
MapReduce) layers, not by using “enterprise-class” subsystems (e.g., RAID disk
systems). HDFS is a very reliable data storage subsystem because the file data is
replicated and distributed. MapReduce anticipates that individual tasks will fail
on an ongoing basis (because of some combination of software and hardware
failure) and manages the redistribution of work so that the overall job is
completed in a timely manner.

Consider how this plays out with storage. In the enterprise data center, the data
would likely be stored on a shared SAN (storage area networking) system.
Because this SAN system held key data for multiple important applications, the
performance, reliability and availability of the SAN system was critical:

 Redundant disks would be included and the data spread among multiple disks so
that the loss of one or more of the disks wouldn’t result in the loss or
unavailability of the data.

 Critical elements (the controller, SAN switches and links, power supplies, host
adaptors) would all be replicated for availability.

 Because the SAN system supported multiple applications concurrently,
performance was critical, so the fastest (and most expensive) disks would be
used, with the fastest (and most expensive) connection to the controller. The
controller would include substantial RAM memory for caching.

In contrast, a Hadoop cluster of 50 nodes has 500-1000 high-capacity, low-cost
disk drives.

 The disks are selected to be cost optimized – lowest cost per byte stored, least
expensive attachment directly to a server (no storage network, no Fiber Channel
attachment).

 The design has no redundancy at the disk level (no RAID configurations, for
example). The HDFS file system assumes that disk failures are an ongoing issue
and achieves high-availability data storage despite that.

Cloud economics of storage means cost-effective drives directly connected to a
commodity server with the least expensive connection. In a typical Hadoop
node, 70% of the cost of the node is the cost of the disk drives and the disk
drives are the most cost-effective possible. It can’t get any cheaper than that! A
Hadoop cluster is a large data store built in the most cost-effective way possible.



10. Why Is Hadoop So Interesting?
As we noted earlier, big data is something of relevance to essentially all
businesses because of Internet markets and because of machine-generated log
data, if for no other reason. For dealing with big data, Hadoop is unquestionably
a game changer:

 It enables the purchase and operation of very large-scale data systems at a much
lower cost because it uses cost-optimized, commodity components. Adding
500 TB of Hadoop storage is clearly affordable; adding 500 TB to a conventional
database system is often not.

 Hadoop is designed to move programs to data rather than the inverse. This basic
paradigm change is required to deal with modern, high-volume disk drives.

 Because of the OpenSource community, Hadoop software is available for free
rather than at current database and data warehouse licensing fees. The use of
Hadoop isn’t free, but the elimination of traditional license fees makes it much
easier to experiment (for example).

 Because Hadoop is designed to deal with unstructured data and unconstrained
analysis (in contrast to a data warehouse that is carefully schematized and
optimized), it doesn’t require database trained individuals (e.g., a DBA),
although it clearly requires specialized expertise.

 The MapReduce model minimizes the parallel programming experience and
expertise. To program MapReduce directly requires significant programming
skills (Java and functional programming), but the basic Hadoop model is
designed to use scaling (adding more nodes, especially as they get cheaper) as an
alternative to parallel programming optimization of resources.

Hadoop represents a quite dramatic rethinking of “data processing” drive by the
increasing volumes of data being processed and by the opportunity to follow the
pioneering work of Google and others, and use commodity system technology at
a much lower price. The downside of taking a new approach is twofold:
 There is a lot of learning to do. Conventional data management and analysis is a
large and well-established business. There are many analysts trained to use
today’s tools, and a lot of technical people trained for the installation, operation
and maintenance of these tools.

 The “whole” product still needs some fleshing out. A modern data storage and
analysis product is complicated: tools to import data, tools to transform data, job
and work management systems, data management and migration tools, interfaces
to and integration with popular analysis tools, for a beginning. By this standard
Hadoop is still pretty young.

From a product perspective, the biggest deficiencies are probably the adaptation
of Hadoop for operation in an IT shop rather than a large Web property, and the



development of tools that let users with more diverse skill sets (e.g., business
analysts), make productive use of Hadoop-stored data. All of this is being
worked on, either within the OpenSource community or as licensed proprietary
software to use in conjunction with Hadoop. Companies providing Hadoop
support and training services have discovered a vibrant and growing market. The
usability of Hadoop (both operationally and as a data tool) is improving all the
time. But it does have some more distance to go.

11. What Are the Interesting Sources of Big Data?
There is no single answer. Different companies will have different data sets of
interest. Some of the common ones of interest are these:

 Integration of data from multiple data warehouses: Most big companies have
multiple data warehouses, in part because each may have a particular divisional
or departmental focus, and in part to keep each at an affordable and manageable
level since traditional data warehouses all tend to increase in cost rapidly beyond
some capacity. Hadoop provides a tool by which multiple sources of data can be
brought together and analyzed, and by which a bigger “virtual” data warehouse
can be built at a more affordable price.

 Clickstream data: A Web server can record (in a log file) every interaction with a
browser/user that it sees. This detailed record of use provides a wealth of
information on the optimality of the Web site design, the Web system
performance and in many cases, the underlying business. For example, for the
large Web properties, clickstream analysis is the source of fundamental business
analysis and optimization. For other businesses, the value depends on the
importance of Web systems to the business.

 Log file data: modern systems, subsystems, and applications and devices all can
be configured to create log “interesting” events. This information is potentially
the source of a wealth of information ranging from security/attack analysis to
design correctness and system utilization.

 Information scraped from the Web: Every year more information is captured on
the Web and more valuable data is captured on the Web. Much of it is free to use
for the cost of finding it and recording it.

 Specific sources such as Twitter produce high-volume data streams potentially of
value.

Where is this information all coming from? There are multiple sources, but to
begin with, consider:

 The remarkable and continuing growth of the World Wide Web. The Web has
become a remarkable repository of data to analyze in terms of all the contents of
the Web, and for a Web site owner, the ability to analyze in complete detail the
use of the Web site.



 The remarkable and growing use of mobile devices. The iPhone has only existed
for the last five years (and the iPad for less), but this kind of mobile device has
transformed how we deal with information. Specifically, more and more of what
we do is in text form (not written notes, nor FAXs nor phone calls) and available
for analysis one way or another. Mobile devices also provide valuable (albeit
frightening) information on where and when the data was created or read.

 The rise in “social sites.” There has been rapid growth in Facebook, LinkedIn as
well as customer feedback on specific products (both at shared sites like Amazon
and on vendor sites). Twitter provides remarkable volumes of data with possible
value.

 The rise in customer self-service. Increasingly companies look for ways for the
community of their customers to help oneanother through shared Web sites. This
not only is cost-effective, but generally leads to the earlier identification and
solution to problems, as well as providing a rich source of data by which to
assess customer sentiment.

 Machine generated data. Almost all “devices” are now implemented in software
and capable of providing log data (see above) if it can be used productively.

12. How Important Is Big Data Analytics?
The only reasonable answer is “it depends.” Big data evangelists note that
analytics can be worth 5% on the bottom line, meaning that intelligent analysis
of business data can have a significant impact on the financial performance of a
company. Even if that is true, for most companies most of the value will come
from the analysis of “small data,” not from the incremental analysis of data that
is infeasible to store or analyze today.

At the same time, there are unquestionably companies for which the ability to do
big data analytics is essential (Google and Facebook for example). These
companies depend on the analysis of huge data sets (clickstream data from large
on-line user communities) that cannot be practically processed by conventional
database and analytics solutions.

For most companies, big data analytics can provide incremental value, but the
larger value will come from small data analytics. Over time, the value will
clearly shift toward big data as more and more interesting data is available.
There will almost always be value in the analysis of some very large data set.
The more important question from a business optimization perspective, is
whether the highest priority requirement is based on big data or is there still
untapped and higher value “small” data?



13. Things You Don’t Want to Do with Hadoop
The Hadoop source distribution is “free” and a bright Java programmer can
often “find” enough “underutilized” servers with which to stand up a small
Hadoop cluster and do experiments. While it is true that almost every large
company has real large-data problems of interest, to date much of the
experimentation has been on problems that don’t really need this class of
solution. Here is a partial list of some of the workloads that probably don’t
justify going to Hadoop:

 Non-huge problems. Keep in mind that a relatively inexpensive server can easily
have 10 cores and 200 GB of memory. 200 GB is a lot of data, especially in a
compressed format (Microsoft PowerPivot – an Excel plugin – can process
100 M rows of a compressed fact table data in 5% of that storage). Having the
data resident in DRAM makes a huge difference (PowerPivot can scan 1 trillion
rows a minute with less than 5 cores). If a compressed version of the data can
reside in a large commodity-server memory, it’s almost certain to be a better
solution (there are various in-memory database tools available).

 Only for data storage. Although Hadoop is a good, very-large storage system
(HDFS) unless you want to do embedded processing, there are often better
storage solutions around.

 Only for parallel processing. If you just want to manage the parallel execution of
a distributed Java program, there are simpler and better solutions.

 For HPC applications. Although a larger Hadoop cluster (100 nodes) comprises a
significant amount of processing power and memory, you wouldn’t want to do
traditional HPC algorithms in Hadoop rather than in a more traditional
computational grid (e.g., FEA, CFD, geophysical data analysis).

14. Horizontal Hadoop Applications
With some very bright programmers, Hadoop can be applied wherever the
functional model can be applied. One generic class of applications is
characterized by this:

 Data sets that are clearly too large to economically store in traditional enterprise
storage systems (SAN and NAS) and that are clearly too large to analyze with
traditional data warehouse systems.

 Think of a Hadoop as a place where you can now store the data economically,
and use MapReduce to preprocess the data and extract data that can be fed into
an existing data warehouse and analyzed, along with existing structured data,
using existing analysis tools.



 Alternatively you can think of Hadoop as a way of “extending” the capacity of
an existing storage and analysis system when the cost of the solution starts to
grow faster than linearly as more capacity is required.

 As introduced above, Hadoop can also be used as a means of integrating data
from multiple existing warehouse and analysis systems.

15. Summary
Technology progress and the increased use of the Internet are creating very large
new data sets with increasing value to businesses and making the processing
power to analyze them affordable. The size of these data sets suggests that
exploitation may well require a new category of data storage and analysis
systems with different system architectures (parallel processing capability
integrated with high-volume storage), different use of components (more
exploitation of the same high-volume, commodity components that are used
within today’s very-large Web properties). Hadoop is a strong candidate for
such a new processing tier. In addition to its initial design by Google, the fact
that it is today a vibrant OpenSource efforts suggests additional disruptive
impact in product pricing and the economics of use is possible.


Analyst Report : The Enterprise Use of Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Analyst Report : The Enterprise Use of Hadoop

Similar to Analyst Report : The Enterprise Use of Hadoop (20)

More from EMC

More from EMC (20)

Recently uploaded

Recently uploaded (20)

Analyst Report : The Enterprise Use of Hadoop