SlideShare a Scribd company logo
1 of 16
Download to read offline
The Enterprise
                   Use of Hadoop
                                      (v1)




        Internet Research Group
             November 2011




About The Internet Research Group
www.irg-intl.com

The Internet Research Group (IRG) provides market research and
market strategy services to product and service vendors. IRG services
combine the formidable and unique experience and perspective of the
two principals: John Katsaros and Peter Christy, each an experienced
industry veteran. The overarching mission of IRG is to help clients
make faster and better decisions about product strategy, market entry,
and market development. Katsaros and Christy published a book on
high tech business strategy Getting It Right the First Time – Praeger,
2005 www.gettingitrightthefirsttime.com.




     © 2011 Internet Research Group – all rights reserved
IRG 2011: The Enterprise Use of Hadoop (v1)                                                                      page i




                                              Table of Contents



1.    Overview .................................................................................................................... 1
2.    Background ............................................................................................................... 1
3.    What Is Hadoop? ...................................................................................................... 2
4.    Why Is Embedded Processing So Important? ....................................................... 3
5.    MapReduce Analytics ............................................................................................... 4
6.    What is “Big” Data? .................................................................................................. 4
7.    The Major Components of Hadoop ......................................................................... 5
8.    The Hadoop Application Ecology............................................................................ 6
9.    Cloud Economics ..................................................................................................... 6
10. Why Is Hadoop So Interesting? ............................................................................... 8
11. What Are the Interesting Sources of Big Data? ..................................................... 9
12. How Important Is Big Data Analytics? .................................................................. 10
13. Things You Don’t Want to Do with Hadoop .......................................................... 11
14. Horizontal Hadoop Applications ........................................................................... 11
15. Summary ................................................................................................................. 12




                           © 2011 Internet Research Group – all rights reserved
1.    Overview
 The last decade has seen amazing continuing progress in computer technology,
 systems and implementations, as evidenced by some of the remarkable Web and
 Internet systems that have been constructed such as Google and Facebook.

 Although most enterprise CIOs yearn to be able to take advantage of the
 performance and cost efficiencies that these pioneering Web systems deliver, the
 enterprise path to Cloud computing is intrinsically complex because of the need
 to bring forward existing applications and evolve organization structure and skill
 set, so achieving those economies will take some time.

 Hadoop, an Apache Foundation Open Source project, represents a way for
 enterprise IT to take advantage of Cloud and Internet capabilities sooner when it
 comes to the storage and processing of huge (by enterprise IT standards)
 amounts of data. Hadoop provides a means of implementing storage systems
 with Internet economics and doing large-scale processing on that data. It is not
 a general replacement for existing enterprise data management and analysis
 systems, but for many companies an attractive complement to those systems, as
 well as a way of making use of the large-volume data sets that are increasingly
 available. The Yahoo! Hadoop team argues that in five years, 50% of enterprise
 data will be stored in Hadoop – they might well be right.




2.    Background
 The last decade has been remarkable for the advances in computer technology
 and systems:

  There has been continuing, relentless “Moore’s Law” progress in semiconductor
   technology (CPUs, DRAM and now SSD).
  There has been even faster progress in disk price/performance improvement.

  Google demonstrated the remarkable performance and cost-effectiveness that
   could be achieved using mega-scale systems built from commodity technology,
   as well as pioneering the application and operational adaptations needed to take
   advantage of such systems.

 The compounded impact of these improvements is seen most dramatically in
 various Cloud offerings (starting with Google or Amazon Web Services) where
 the cost of storage or computation is dramatically (orders of magnitude) cheaper
 than in typical enterprise computing.


                     © 2011 Internet Research Group, all rights reserved
                             Provided to Clients Under License
IRG 2011: The Enterprise Use of Hadoop (v1)                                    page 2


 Hadoop presents an opportunity for enterprises to take advantage of Cloud
 economics immediately, especially in terms of storage, as we will sketch below.




3.     What Is Hadoop?


 Hadoop builds on a massive file system (Google File System or GFS) and a
 parallel application model (MapReduce) originally developed at Google. Google
 has an unbelievable number of servers compared to typical large enterprises (in
 all likelihood more than a million). Search is a relatively easy task to parallelize:
 many search requests can be run in parallel because they only have to be loosely
 synchronized (the same search done at the same time doesn’t have to get exactly
 the same response). GFS was developed as a file system for applications running
 at this scale. MapReduce was developed as a means of performing data analysis
 using these resources.

 Hadoop is an OpenSource reimplementation of GFS and MapReduce. Google’s
 systems run a unique and proprietary software “stack” so no one else could run
 Google’s MapReduce even if Google permitted it. Hadoop is designed to run on
 a conventional LINUX stack. Google has encouraged the development of
 Hadoop, recognizing the value in a broader population of people trained in the
 methodology and tools. Much of the development of Hadoop has been driven by
 Yahoo!. Yahoo! is also a large Hadoop user, internally running a cluster of more
 than 40,000 servers.

 Operationally we talk about a Hadoop “cluster”: a set of servers dedicated to a
 particular instance of Hadoop that may consist of just a few to the clusters of
 more than 4,000 servers in use at Yahoo!.

 Today a typical Hadoop server might be two sockets, a total of 8 cores (two 4-
 core servers), 48 GB of DRAM, and 8-16 directly attached disks, typically cost-
 per-byte optimized (e.g., 2 or 3 TB 3.5” SATA drives). When implemented with
 high-volume commodity technology, the majority of the server cost is the disk
 drive complement, and each server will have 20-50 TB of storage.




                       © 2011 Internet Research Group, all rights reserved
                               Provided to Clients Under License
IRG 2011: The Enterprise Use of Hadoop (v1)                                 page 3




4.     Why Is Embedded Processing So Important?
 A useful way of thinking about a Hadoop cluster is as a very high-capacity
 storage system built with “Cloud” economics (using inexpensive, high-capacity
 drives), with substantial, general purpose, embedded processing power. The
 importance of having local processing capability becomes clear as soon as you
 realize that even when using the fastest LAN links (10Gbits/sec), it takes 40
 minutes to transfer the contents of a single 3 TB disk drive. Big data sets may be
 remarkably inexpensive to store, but they aren’t easy to move around, even
 within a data center using high-speed network connections.1

 In the past we brought the data to the program: we ran a program on a server,
 opened a file on a network-based storage system, brought the file to the server,
 processed the data, and then probably wrote new data back out to the storage
 system.2 With Hadoop, this is reversed reflecting the fact that it’s much easier
 to move the program to the data than the data to the program. Modern servers
 and large-capacity disks enable affordable storage systems of enormous
 capacity, but you have to process the data in place when possible; you can’t
 move it.

 Some “Cloud” storage applications require only infrequent access to the stored
 data. Almost all the activity in a Cloud-based backup service is writing the
 protected data to the disks. Reading the stored data is only done infrequently
 (albeit being able to read a backup file when needed is the key value
 proposition). The same is true to an only slightly lesser degree when pictures,
 videos or music are stored in the Cloud. Only a small percentage of that data is
 ever accessed, and that small fraction can (and is) cached on higher
 performance, more expensive storage. Analysis is very different; data will be
 processed repeatedly as it is used to answer diverse questions. PC backup or
 picture storage are write-once/read-never applications. Analysis is write-
 once/read-many.




 1
   A modern SATA drive can transfer data between the disk and server at a
 sustained rate of about 1 Gbit/second. On a 12-disk node, the aggregate read rate
 could be up to about 10 Gbits/second. On a 50-node cluster the total aggregate
 read rate could approach 500 Gbits/second.
 2
  A 10 MB file (100Mbits) can be transmitted in about 0.1 second over a
 Gbit/second link.

                       © 2011 Internet Research Group, all rights reserved
                               Provided to Clients Under License
IRG 2011: The Enterprise Use of Hadoop (v1)                                   page 4




5.     MapReduce Analytics
 The use of Hadoop has created a lot of interest in large-scale analytics (the
 MapReduce part of Hadoop). This kind of “divide and conquer” algorithm
 methodology has been used for numerical analysis for many years as a way of
 dealing with problems that were known to be bigger than the biggest machine
 available. MapReduce is an elegant way of structuring this kind of algorithm
 that isolates the analyst/programmer from the specific details of managing the
 pieces of work that get distributed to the available machines, as well as an
 application architecture that doesn’t depend on any specific structuring of the
 data.

 As Hadoop evolves, the basic ideas will be adapted to more computer system
 architectures than just the commodity scale-out systems used by the mega Web
 properties like Google and Yahoo. A MapReduce computation cluster could also
 be used with data stored in a high-performance, high-bandwidth storage
 subsystem which would make a lot of sense if the data was already stored there
 for other reasons. We expect many such variants of the original architecture to
 emerge over time.




6.     What is “Big” Data?
 Google and Yahoo! use MapReduce for purposes that are unique to extremely
 large-scale systems (e.g., search optimization, ad delivery optimization). That
 fact notwithstanding, almost all companies have important sources of big data.
 For example:

  World-wide markets: The Internet enables any company, large or small, to
   interact with the billions of people world-wide who are connected. Modern
   logistics services such as UPS, FedEx and USPS let any company sell to global
   markets. A successful company has to think of millions of people and build
   business systems capable of running at that scale. That’s big data.

  Machine-generated data: IT infrastructure (the stuff that all modern companies
   run on) comprises thousands of devices (PCs and mobile devices, servers,
   storage, network and security devices) all of which are capable of generating a
   stream of log-data summarizing normal and abnormal activity. In aggregate this
   stream is a rich source of business process, operational, security and regulatory
   compliance analysis. That’s big data.

 We’ll talk more later about how big data will impact enterprises over time.




                       © 2011 Internet Research Group, all rights reserved
                               Provided to Clients Under License
IRG 2011: The Enterprise Use of Hadoop (v1)                                     page 5




7.     The Major Components of Hadoop
 The core of the Hadoop OpenSource projects is HDFS (the Hadoop Distributed
 File System), the reimplementation of the Google File System, and MapReduce
 defined by the public documents Google has published. HDFS is the basic file
 storage, capable of storing a large number of large files. MapReduce is the
 programming model by which data is analyzed using the processing resources
 within the cluster.

 HDFS has these goals:

  Build very large data management systems from commodity parts where
   component failure had to be assumed and dealt with as part of the basic design of
   the data system (in contrast to most enterprise storage where great attention is
   paid to making the components reliable).

  A file system capable of storing huge files by historical standards (many files
   larger than 1 GB).

  A file system that was optimized assuming that files typically change by data be
   appended to the file (e.g., additions to a log file) rather than by the modification
   of internal pieces of the file.

  A system where the file system APIs reflect the needs of these new applications.

 The motivation for MapReduce is more complicated. Today’s world of
 commodity servers and inexpensive disk drives is completely different from
 yesterday’s world of enterprise IT. Historically, analytics ran on expensive,
 high-end servers and used expensive, enterprise-class disk drives. Buying a new
 database server is a big decision and comes with software licensing costs, as
 well as incremental operational needs (e.g., a database administrator). In the
 Hadoop world, adding more nodes isn’t a major capital expense (< $10K server)
 and doesn’t trigger new software licenses, or additional administrators.
 MapReduce was designed for an environment where adding more hardware is a
 perfectly reasonable approach to problem solving. MapReduce is designed for
 such environments: progress is more easily made by adding hardware than by
 thinking about the problem and carefully crafting an optimized solution.

 MapReduce allows the scale of the solution to grow with minimal need for the
 analyst or programmer to adapt the program. The MapReduce infrastructure
 functions to distribute that work among the available processors (the application
 programmer shouldn’t have to worry about how big the actual cluster is),
 monitor progress, restart work that stalls or fails, or to balance the work among
 the available nodes.

 Using MapReduce is by no means simple, nor something that many business
 analysts would ever want to do directly (or be able to do for that matter). Google
 has required all summer college interns to develop a MapReduce application, all


                       © 2011 Internet Research Group, all rights reserved
                               Provided to Clients Under License
IRG 2011: The Enterprise Use of Hadoop (v1)                                  page 6


 being excellent programmers and having the benefit of colleagues who were
 experienced and still found it difficult to do. Google has supported the Hadoop
 effort in part so that it could be used in education to train more knowledgeable
 individuals. This isn’t a reason why the impact of MapReduce will be limited,
 however; it’s the motivation for a software ecology built on top of HDFS and
 MapReduce that makes the capability usable to a broader population.




8.     The Hadoop Application Ecology
 It is useful to think of Hadoop as a platform, like Windows or Linux. Although
 Hadoop was developed based on the specific Google application model, the
 interest in Hadoop has spawned the creation of a set of related programs. The
 Apache OpenSource Project includes these:

  HBase – the Hadoop database

  Pig – a high-level database for data analysis programs

  Hive – a data warehouse system

  Mahout – a set of machine learning tools

 There is other software that can be licensed to use with Hadoop including:

  MapR – an alternative storage system

  Cloudera – management tools

 Various database and BI vendors offer software for us with Hadoop including
 these:

  Various database and BI vendors offer connectors that make it easy to control an
   attached Hadoop system and import the output of Hadoop processors

  Similarly the “ETL” vendors offer connectors so that Hadoop can be a source (or
   sink) of data in that process.




9.     Cloud Economics
 Now that we have introduced Hadoop and HDFS, we can explain in more detail
 what we mean by “Cloud Economics.” If you walked into any modern large-
 scale data center (Google, Yahoo!, Facebook, Microsoft) you would see
 something that looked very different from an enterprise data center. The
 enterprise data center would be filled with top-of-the-line systems (“enterprise
 class”); the Web data center would be filled with something looking more like
 what you would find in a thrift shop: inexpensive “white box” servers and
 storage. As the cost of the hardware continues to decline, lots of other aspects of

                       © 2011 Internet Research Group, all rights reserved
                               Provided to Clients Under License
IRG 2011: The Enterprise Use of Hadoop (v1)                                   page 7


IT have to evolve as well (e.g., software licensing fees, operational costs) if the
value of the hardware is to be exploited. The basic system and application
design have to evolve as well.

Perhaps most importantly, Google recognized that in large-scale computing
failure and reliability had to be reconsidered. In large-scale systems, failure was
the rule rather than the exception (with millions of disk drives, disk drive failure
is ongoing). In large-scale systems, it makes more sense to achieve reliability
and availability in the higher-level system (e.g., HDFS) and application (e.g.,
MapReduce) layers, not by using “enterprise-class” subsystems (e.g., RAID disk
systems). HDFS is a very reliable data storage subsystem because the file data is
replicated and distributed. MapReduce anticipates that individual tasks will fail
on an ongoing basis (because of some combination of software and hardware
failure) and manages the redistribution of work so that the overall job is
completed in a timely manner.

Consider how this plays out with storage. In the enterprise data center, the data
would likely be stored on a shared SAN (storage area networking) system.
Because this SAN system held key data for multiple important applications, the
performance, reliability and availability of the SAN system was critical:

 Redundant disks would be included and the data spread among multiple disks so
  that the loss of one or more of the disks wouldn’t result in the loss or
  unavailability of the data.

 Critical elements (the controller, SAN switches and links, power supplies, host
  adaptors) would all be replicated for availability.

 Because the SAN system supported multiple applications concurrently,
  performance was critical, so the fastest (and most expensive) disks would be
  used, with the fastest (and most expensive) connection to the controller. The
  controller would include substantial RAM memory for caching.

In contrast, a Hadoop cluster of 50 nodes has 500-1000 high-capacity, low-cost
disk drives.

 The disks are selected to be cost optimized – lowest cost per byte stored, least
  expensive attachment directly to a server (no storage network, no Fiber Channel
  attachment).

 The design has no redundancy at the disk level (no RAID configurations, for
  example). The HDFS file system assumes that disk failures are an ongoing issue
  and achieves high-availability data storage despite that.

Cloud economics of storage means cost-effective drives directly connected to a
commodity server with the least expensive connection. In a typical Hadoop
node, 70% of the cost of the node is the cost of the disk drives and the disk
drives are the most cost-effective possible. It can’t get any cheaper than that! A
Hadoop cluster is a large data store built in the most cost-effective way possible.




                      © 2011 Internet Research Group, all rights reserved
                              Provided to Clients Under License
IRG 2011: The Enterprise Use of Hadoop (v1)                                     page 8




10.         Why Is Hadoop So Interesting?
As we noted earlier, big data is something of relevance to essentially all
businesses because of Internet markets and because of machine-generated log
data, if for no other reason. For dealing with big data, Hadoop is unquestionably
a game changer:

 It enables the purchase and operation of very large-scale data systems at a much
  lower cost because it uses cost-optimized, commodity components. Adding
  500 TB of Hadoop storage is clearly affordable; adding 500 TB to a conventional
  database system is often not.

 Hadoop is designed to move programs to data rather than the inverse. This basic
  paradigm change is required to deal with modern, high-volume disk drives.

 Because of the OpenSource community, Hadoop software is available for free
  rather than at current database and data warehouse licensing fees. The use of
  Hadoop isn’t free, but the elimination of traditional license fees makes it much
  easier to experiment (for example).

 Because Hadoop is designed to deal with unstructured data and unconstrained
  analysis (in contrast to a data warehouse that is carefully schematized and
  optimized), it doesn’t require database trained individuals (e.g., a DBA),
  although it clearly requires specialized expertise.

 The MapReduce model minimizes the parallel programming experience and
  expertise. To program MapReduce directly requires significant programming
  skills (Java and functional programming), but the basic Hadoop model is
  designed to use scaling (adding more nodes, especially as they get cheaper) as an
  alternative to parallel programming optimization of resources.

Hadoop represents a quite dramatic rethinking of “data processing” drive by the
increasing volumes of data being processed and by the opportunity to follow the
pioneering work of Google and others, and use commodity system technology at
a much lower price. The downside of taking a new approach is twofold:
 There is a lot of learning to do. Conventional data management and analysis is a
  large and well-established business. There are many analysts trained to use
  today’s tools, and a lot of technical people trained for the installation, operation
  and maintenance of these tools.

 The “whole” product still needs some fleshing out. A modern data storage and
  analysis product is complicated: tools to import data, tools to transform data, job
  and work management systems, data management and migration tools, interfaces
  to and integration with popular analysis tools, for a beginning. By this standard
  Hadoop is still pretty young.

From a product perspective, the biggest deficiencies are probably the adaptation
of Hadoop for operation in an IT shop rather than a large Web property, and the

                      © 2011 Internet Research Group, all rights reserved
                              Provided to Clients Under License
IRG 2011: The Enterprise Use of Hadoop (v1)                                    page 9


development of tools that let users with more diverse skill sets (e.g., business
analysts), make productive use of Hadoop-stored data. All of this is being
worked on, either within the OpenSource community or as licensed proprietary
software to use in conjunction with Hadoop. Companies providing Hadoop
support and training services have discovered a vibrant and growing market. The
usability of Hadoop (both operationally and as a data tool) is improving all the
time. But it does have some more distance to go.




11.         What Are the Interesting Sources of Big Data?
There is no single answer. Different companies will have different data sets of
interest. Some of the common ones of interest are these:

 Integration of data from multiple data warehouses: Most big companies have
  multiple data warehouses, in part because each may have a particular divisional
  or departmental focus, and in part to keep each at an affordable and manageable
  level since traditional data warehouses all tend to increase in cost rapidly beyond
  some capacity. Hadoop provides a tool by which multiple sources of data can be
  brought together and analyzed, and by which a bigger “virtual” data warehouse
  can be built at a more affordable price.

 Clickstream data: A Web server can record (in a log file) every interaction with a
  browser/user that it sees. This detailed record of use provides a wealth of
  information on the optimality of the Web site design, the Web system
  performance and in many cases, the underlying business. For example, for the
  large Web properties, clickstream analysis is the source of fundamental business
  analysis and optimization. For other businesses, the value depends on the
  importance of Web systems to the business.

 Log file data: modern systems, subsystems, and applications and devices all can
  be configured to create log “interesting” events. This information is potentially
  the source of a wealth of information ranging from security/attack analysis to
  design correctness and system utilization.

 Information scraped from the Web: Every year more information is captured on
  the Web and more valuable data is captured on the Web. Much of it is free to use
  for the cost of finding it and recording it.

 Specific sources such as Twitter produce high-volume data streams potentially of
  value.

Where is this information all coming from? There are multiple sources, but to
begin with, consider:

 The remarkable and continuing growth of the World Wide Web. The Web has
  become a remarkable repository of data to analyze in terms of all the contents of
  the Web, and for a Web site owner, the ability to analyze in complete detail the
  use of the Web site.

                      © 2011 Internet Research Group, all rights reserved
                              Provided to Clients Under License
IRG 2011: The Enterprise Use of Hadoop (v1)                                   page 10


 The remarkable and growing use of mobile devices. The iPhone has only existed
  for the last five years (and the iPad for less), but this kind of mobile device has
  transformed how we deal with information. Specifically, more and more of what
  we do is in text form (not written notes, nor FAXs nor phone calls) and available
  for analysis one way or another. Mobile devices also provide valuable (albeit
  frightening) information on where and when the data was created or read.

 The rise in “social sites.” There has been rapid growth in Facebook, LinkedIn as
  well as customer feedback on specific products (both at shared sites like Amazon
  and on vendor sites). Twitter provides remarkable volumes of data with possible
  value.

 The rise in customer self-service. Increasingly companies look for ways for the
  community of their customers to help oneanother through shared Web sites. This
  not only is cost-effective, but generally leads to the earlier identification and
  solution to problems, as well as providing a rich source of data by which to
  assess customer sentiment.

 Machine generated data. Almost all “devices” are now implemented in software
  and capable of providing log data (see above) if it can be used productively.




12.         How Important Is Big Data Analytics?
The only reasonable answer is “it depends.” Big data evangelists note that
analytics can be worth 5% on the bottom line, meaning that intelligent analysis
of business data can have a significant impact on the financial performance of a
company. Even if that is true, for most companies most of the value will come
from the analysis of “small data,” not from the incremental analysis of data that
is infeasible to store or analyze today.

At the same time, there are unquestionably companies for which the ability to do
big data analytics is essential (Google and Facebook for example). These
companies depend on the analysis of huge data sets (clickstream data from large
on-line user communities) that cannot be practically processed by conventional
database and analytics solutions.

For most companies, big data analytics can provide incremental value, but the
larger value will come from small data analytics. Over time, the value will
clearly shift toward big data as more and more interesting data is available.
There will almost always be value in the analysis of some very large data set.
The more important question from a business optimization perspective, is
whether the highest priority requirement is based on big data or is there still
untapped and higher value “small” data?




                      © 2011 Internet Research Group, all rights reserved
                              Provided to Clients Under License
IRG 2011: The Enterprise Use of Hadoop (v1)                                    page 11




13.         Things You Don’t Want to Do with Hadoop
The Hadoop source distribution is “free” and a bright Java programmer can
often “find” enough “underutilized” servers with which to stand up a small
Hadoop cluster and do experiments. While it is true that almost every large
company has real large-data problems of interest, to date much of the
experimentation has been on problems that don’t really need this class of
solution. Here is a partial list of some of the workloads that probably don’t
justify going to Hadoop:

 Non-huge problems. Keep in mind that a relatively inexpensive server can easily
  have 10 cores and 200 GB of memory. 200 GB is a lot of data, especially in a
  compressed format (Microsoft PowerPivot – an Excel plugin – can process
  100 M rows of a compressed fact table data in 5% of that storage). Having the
  data resident in DRAM makes a huge difference (PowerPivot can scan 1 trillion
  rows a minute with less than 5 cores). If a compressed version of the data can
  reside in a large commodity-server memory, it’s almost certain to be a better
  solution (there are various in-memory database tools available).

 Only for data storage. Although Hadoop is a good, very-large storage system
  (HDFS) unless you want to do embedded processing, there are often better
  storage solutions around.

 Only for parallel processing. If you just want to manage the parallel execution of
  a distributed Java program, there are simpler and better solutions.

 For HPC applications. Although a larger Hadoop cluster (100 nodes) comprises a
  significant amount of processing power and memory, you wouldn’t want to do
  traditional HPC algorithms in Hadoop rather than in a more traditional
  computational grid (e.g., FEA, CFD, geophysical data analysis).




14.           Horizontal Hadoop Applications
With some very bright programmers, Hadoop can be applied wherever the
functional model can be applied. One generic class of applications is
characterized by this:

 Data sets that are clearly too large to economically store in traditional enterprise
  storage systems (SAN and NAS) and that are clearly too large to analyze with
  traditional data warehouse systems.

 Think of a Hadoop as a place where you can now store the data economically,
  and use MapReduce to preprocess the data and extract data that can be fed into
  an existing data warehouse and analyzed, along with existing structured data,
  using existing analysis tools.


                      © 2011 Internet Research Group, all rights reserved
                              Provided to Clients Under License
IRG 2011: The Enterprise Use of Hadoop (v1)                                 page 12


 Alternatively you can think of Hadoop as a way of “extending” the capacity of
  an existing storage and analysis system when the cost of the solution starts to
  grow faster than linearly as more capacity is required.

 As introduced above, Hadoop can also be used as a means of integrating data
  from multiple existing warehouse and analysis systems.




15.         Summary
Technology progress and the increased use of the Internet are creating very large
new data sets with increasing value to businesses and making the processing
power to analyze them affordable. The size of these data sets suggests that
exploitation may well require a new category of data storage and analysis
systems with different system architectures (parallel processing capability
integrated with high-volume storage), different use of components (more
exploitation of the same high-volume, commodity components that are used
within today’s very-large Web properties). Hadoop is a strong candidate for
such a new processing tier. In addition to its initial design by Google, the fact
that it is today a vibrant OpenSource efforts suggests additional disruptive
impact in product pricing and the economics of use is possible.




                      © 2011 Internet Research Group, all rights reserved
                              Provided to Clients Under License

More Related Content

What's hot

Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsCognizant
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Introducing the hadoop ecosystem
Introducing the hadoop ecosystemIntroducing the hadoop ecosystem
Introducing the hadoop ecosystemGeert Van Landeghem
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?cneudecker
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with ExamplesJoe McTee
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course pptNjain85
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRDouglas Bernardini
 
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersMrigendra Sharma
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 

What's hot (19)

Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Introducing the hadoop ecosystem
Introducing the hadoop ecosystemIntroducing the hadoop ecosystem
Introducing the hadoop ecosystem
 
What is Hadoop?
What is Hadoop?What is Hadoop?
What is Hadoop?
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, Providers
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop
HadoopHadoop
Hadoop
 

Similar to Analyst Report : The Enterprise Use of Hadoop

Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training reportSarvesh Meena
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data DiscoveryBenjamin Ashkar
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
 
Hadoop explained [e book]
Hadoop explained [e book]Hadoop explained [e book]
Hadoop explained [e book]Supratim Ray
 
Machine Learning Hadoop
Machine Learning HadoopMachine Learning Hadoop
Machine Learning HadoopAletheLabs
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 

Similar to Analyst Report : The Enterprise Use of Hadoop (20)

Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
HDFS
HDFSHDFS
HDFS
 
Actian DataFlow Whitepaper
Actian DataFlow WhitepaperActian DataFlow Whitepaper
Actian DataFlow Whitepaper
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
paper
paperpaper
paper
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
Hadoop
HadoopHadoop
Hadoop
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
Big Data
Big DataBig Data
Big Data
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop Overview
Hadoop OverviewHadoop Overview
Hadoop Overview
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Hadoop Based Data Discovery
Hadoop Based Data DiscoveryHadoop Based Data Discovery
Hadoop Based Data Discovery
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Hadoop explained [e book]
Hadoop explained [e book]Hadoop explained [e book]
Hadoop explained [e book]
 
Machine Learning Hadoop
Machine Learning HadoopMachine Learning Hadoop
Machine Learning Hadoop
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
IJARCCE_49
IJARCCE_49IJARCCE_49
IJARCCE_49
 

More from EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOEMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremioEMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereEMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewEMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeEMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityEMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeEMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsEMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookEMC
 

More from EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

Analyst Report : The Enterprise Use of Hadoop

  • 1. The Enterprise Use of Hadoop (v1) Internet Research Group November 2011 About The Internet Research Group www.irg-intl.com The Internet Research Group (IRG) provides market research and market strategy services to product and service vendors. IRG services combine the formidable and unique experience and perspective of the two principals: John Katsaros and Peter Christy, each an experienced industry veteran. The overarching mission of IRG is to help clients make faster and better decisions about product strategy, market entry, and market development. Katsaros and Christy published a book on high tech business strategy Getting It Right the First Time – Praeger, 2005 www.gettingitrightthefirsttime.com. © 2011 Internet Research Group – all rights reserved
  • 2.
  • 3. IRG 2011: The Enterprise Use of Hadoop (v1) page i Table of Contents 1. Overview .................................................................................................................... 1 2. Background ............................................................................................................... 1 3. What Is Hadoop? ...................................................................................................... 2 4. Why Is Embedded Processing So Important? ....................................................... 3 5. MapReduce Analytics ............................................................................................... 4 6. What is “Big” Data? .................................................................................................. 4 7. The Major Components of Hadoop ......................................................................... 5 8. The Hadoop Application Ecology............................................................................ 6 9. Cloud Economics ..................................................................................................... 6 10. Why Is Hadoop So Interesting? ............................................................................... 8 11. What Are the Interesting Sources of Big Data? ..................................................... 9 12. How Important Is Big Data Analytics? .................................................................. 10 13. Things You Don’t Want to Do with Hadoop .......................................................... 11 14. Horizontal Hadoop Applications ........................................................................... 11 15. Summary ................................................................................................................. 12 © 2011 Internet Research Group – all rights reserved
  • 4.
  • 5. 1. Overview The last decade has seen amazing continuing progress in computer technology, systems and implementations, as evidenced by some of the remarkable Web and Internet systems that have been constructed such as Google and Facebook. Although most enterprise CIOs yearn to be able to take advantage of the performance and cost efficiencies that these pioneering Web systems deliver, the enterprise path to Cloud computing is intrinsically complex because of the need to bring forward existing applications and evolve organization structure and skill set, so achieving those economies will take some time. Hadoop, an Apache Foundation Open Source project, represents a way for enterprise IT to take advantage of Cloud and Internet capabilities sooner when it comes to the storage and processing of huge (by enterprise IT standards) amounts of data. Hadoop provides a means of implementing storage systems with Internet economics and doing large-scale processing on that data. It is not a general replacement for existing enterprise data management and analysis systems, but for many companies an attractive complement to those systems, as well as a way of making use of the large-volume data sets that are increasingly available. The Yahoo! Hadoop team argues that in five years, 50% of enterprise data will be stored in Hadoop – they might well be right. 2. Background The last decade has been remarkable for the advances in computer technology and systems:  There has been continuing, relentless “Moore’s Law” progress in semiconductor technology (CPUs, DRAM and now SSD).  There has been even faster progress in disk price/performance improvement.  Google demonstrated the remarkable performance and cost-effectiveness that could be achieved using mega-scale systems built from commodity technology, as well as pioneering the application and operational adaptations needed to take advantage of such systems. The compounded impact of these improvements is seen most dramatically in various Cloud offerings (starting with Google or Amazon Web Services) where the cost of storage or computation is dramatically (orders of magnitude) cheaper than in typical enterprise computing. © 2011 Internet Research Group, all rights reserved Provided to Clients Under License
  • 6. IRG 2011: The Enterprise Use of Hadoop (v1) page 2 Hadoop presents an opportunity for enterprises to take advantage of Cloud economics immediately, especially in terms of storage, as we will sketch below. 3. What Is Hadoop? Hadoop builds on a massive file system (Google File System or GFS) and a parallel application model (MapReduce) originally developed at Google. Google has an unbelievable number of servers compared to typical large enterprises (in all likelihood more than a million). Search is a relatively easy task to parallelize: many search requests can be run in parallel because they only have to be loosely synchronized (the same search done at the same time doesn’t have to get exactly the same response). GFS was developed as a file system for applications running at this scale. MapReduce was developed as a means of performing data analysis using these resources. Hadoop is an OpenSource reimplementation of GFS and MapReduce. Google’s systems run a unique and proprietary software “stack” so no one else could run Google’s MapReduce even if Google permitted it. Hadoop is designed to run on a conventional LINUX stack. Google has encouraged the development of Hadoop, recognizing the value in a broader population of people trained in the methodology and tools. Much of the development of Hadoop has been driven by Yahoo!. Yahoo! is also a large Hadoop user, internally running a cluster of more than 40,000 servers. Operationally we talk about a Hadoop “cluster”: a set of servers dedicated to a particular instance of Hadoop that may consist of just a few to the clusters of more than 4,000 servers in use at Yahoo!. Today a typical Hadoop server might be two sockets, a total of 8 cores (two 4- core servers), 48 GB of DRAM, and 8-16 directly attached disks, typically cost- per-byte optimized (e.g., 2 or 3 TB 3.5” SATA drives). When implemented with high-volume commodity technology, the majority of the server cost is the disk drive complement, and each server will have 20-50 TB of storage. © 2011 Internet Research Group, all rights reserved Provided to Clients Under License
  • 7. IRG 2011: The Enterprise Use of Hadoop (v1) page 3 4. Why Is Embedded Processing So Important? A useful way of thinking about a Hadoop cluster is as a very high-capacity storage system built with “Cloud” economics (using inexpensive, high-capacity drives), with substantial, general purpose, embedded processing power. The importance of having local processing capability becomes clear as soon as you realize that even when using the fastest LAN links (10Gbits/sec), it takes 40 minutes to transfer the contents of a single 3 TB disk drive. Big data sets may be remarkably inexpensive to store, but they aren’t easy to move around, even within a data center using high-speed network connections.1 In the past we brought the data to the program: we ran a program on a server, opened a file on a network-based storage system, brought the file to the server, processed the data, and then probably wrote new data back out to the storage system.2 With Hadoop, this is reversed reflecting the fact that it’s much easier to move the program to the data than the data to the program. Modern servers and large-capacity disks enable affordable storage systems of enormous capacity, but you have to process the data in place when possible; you can’t move it. Some “Cloud” storage applications require only infrequent access to the stored data. Almost all the activity in a Cloud-based backup service is writing the protected data to the disks. Reading the stored data is only done infrequently (albeit being able to read a backup file when needed is the key value proposition). The same is true to an only slightly lesser degree when pictures, videos or music are stored in the Cloud. Only a small percentage of that data is ever accessed, and that small fraction can (and is) cached on higher performance, more expensive storage. Analysis is very different; data will be processed repeatedly as it is used to answer diverse questions. PC backup or picture storage are write-once/read-never applications. Analysis is write- once/read-many. 1 A modern SATA drive can transfer data between the disk and server at a sustained rate of about 1 Gbit/second. On a 12-disk node, the aggregate read rate could be up to about 10 Gbits/second. On a 50-node cluster the total aggregate read rate could approach 500 Gbits/second. 2 A 10 MB file (100Mbits) can be transmitted in about 0.1 second over a Gbit/second link. © 2011 Internet Research Group, all rights reserved Provided to Clients Under License
  • 8. IRG 2011: The Enterprise Use of Hadoop (v1) page 4 5. MapReduce Analytics The use of Hadoop has created a lot of interest in large-scale analytics (the MapReduce part of Hadoop). This kind of “divide and conquer” algorithm methodology has been used for numerical analysis for many years as a way of dealing with problems that were known to be bigger than the biggest machine available. MapReduce is an elegant way of structuring this kind of algorithm that isolates the analyst/programmer from the specific details of managing the pieces of work that get distributed to the available machines, as well as an application architecture that doesn’t depend on any specific structuring of the data. As Hadoop evolves, the basic ideas will be adapted to more computer system architectures than just the commodity scale-out systems used by the mega Web properties like Google and Yahoo. A MapReduce computation cluster could also be used with data stored in a high-performance, high-bandwidth storage subsystem which would make a lot of sense if the data was already stored there for other reasons. We expect many such variants of the original architecture to emerge over time. 6. What is “Big” Data? Google and Yahoo! use MapReduce for purposes that are unique to extremely large-scale systems (e.g., search optimization, ad delivery optimization). That fact notwithstanding, almost all companies have important sources of big data. For example:  World-wide markets: The Internet enables any company, large or small, to interact with the billions of people world-wide who are connected. Modern logistics services such as UPS, FedEx and USPS let any company sell to global markets. A successful company has to think of millions of people and build business systems capable of running at that scale. That’s big data.  Machine-generated data: IT infrastructure (the stuff that all modern companies run on) comprises thousands of devices (PCs and mobile devices, servers, storage, network and security devices) all of which are capable of generating a stream of log-data summarizing normal and abnormal activity. In aggregate this stream is a rich source of business process, operational, security and regulatory compliance analysis. That’s big data. We’ll talk more later about how big data will impact enterprises over time. © 2011 Internet Research Group, all rights reserved Provided to Clients Under License
  • 9. IRG 2011: The Enterprise Use of Hadoop (v1) page 5 7. The Major Components of Hadoop The core of the Hadoop OpenSource projects is HDFS (the Hadoop Distributed File System), the reimplementation of the Google File System, and MapReduce defined by the public documents Google has published. HDFS is the basic file storage, capable of storing a large number of large files. MapReduce is the programming model by which data is analyzed using the processing resources within the cluster. HDFS has these goals:  Build very large data management systems from commodity parts where component failure had to be assumed and dealt with as part of the basic design of the data system (in contrast to most enterprise storage where great attention is paid to making the components reliable).  A file system capable of storing huge files by historical standards (many files larger than 1 GB).  A file system that was optimized assuming that files typically change by data be appended to the file (e.g., additions to a log file) rather than by the modification of internal pieces of the file.  A system where the file system APIs reflect the needs of these new applications. The motivation for MapReduce is more complicated. Today’s world of commodity servers and inexpensive disk drives is completely different from yesterday’s world of enterprise IT. Historically, analytics ran on expensive, high-end servers and used expensive, enterprise-class disk drives. Buying a new database server is a big decision and comes with software licensing costs, as well as incremental operational needs (e.g., a database administrator). In the Hadoop world, adding more nodes isn’t a major capital expense (< $10K server) and doesn’t trigger new software licenses, or additional administrators. MapReduce was designed for an environment where adding more hardware is a perfectly reasonable approach to problem solving. MapReduce is designed for such environments: progress is more easily made by adding hardware than by thinking about the problem and carefully crafting an optimized solution. MapReduce allows the scale of the solution to grow with minimal need for the analyst or programmer to adapt the program. The MapReduce infrastructure functions to distribute that work among the available processors (the application programmer shouldn’t have to worry about how big the actual cluster is), monitor progress, restart work that stalls or fails, or to balance the work among the available nodes. Using MapReduce is by no means simple, nor something that many business analysts would ever want to do directly (or be able to do for that matter). Google has required all summer college interns to develop a MapReduce application, all © 2011 Internet Research Group, all rights reserved Provided to Clients Under License
  • 10. IRG 2011: The Enterprise Use of Hadoop (v1) page 6 being excellent programmers and having the benefit of colleagues who were experienced and still found it difficult to do. Google has supported the Hadoop effort in part so that it could be used in education to train more knowledgeable individuals. This isn’t a reason why the impact of MapReduce will be limited, however; it’s the motivation for a software ecology built on top of HDFS and MapReduce that makes the capability usable to a broader population. 8. The Hadoop Application Ecology It is useful to think of Hadoop as a platform, like Windows or Linux. Although Hadoop was developed based on the specific Google application model, the interest in Hadoop has spawned the creation of a set of related programs. The Apache OpenSource Project includes these:  HBase – the Hadoop database  Pig – a high-level database for data analysis programs  Hive – a data warehouse system  Mahout – a set of machine learning tools There is other software that can be licensed to use with Hadoop including:  MapR – an alternative storage system  Cloudera – management tools Various database and BI vendors offer software for us with Hadoop including these:  Various database and BI vendors offer connectors that make it easy to control an attached Hadoop system and import the output of Hadoop processors  Similarly the “ETL” vendors offer connectors so that Hadoop can be a source (or sink) of data in that process. 9. Cloud Economics Now that we have introduced Hadoop and HDFS, we can explain in more detail what we mean by “Cloud Economics.” If you walked into any modern large- scale data center (Google, Yahoo!, Facebook, Microsoft) you would see something that looked very different from an enterprise data center. The enterprise data center would be filled with top-of-the-line systems (“enterprise class”); the Web data center would be filled with something looking more like what you would find in a thrift shop: inexpensive “white box” servers and storage. As the cost of the hardware continues to decline, lots of other aspects of © 2011 Internet Research Group, all rights reserved Provided to Clients Under License
  • 11. IRG 2011: The Enterprise Use of Hadoop (v1) page 7 IT have to evolve as well (e.g., software licensing fees, operational costs) if the value of the hardware is to be exploited. The basic system and application design have to evolve as well. Perhaps most importantly, Google recognized that in large-scale computing failure and reliability had to be reconsidered. In large-scale systems, failure was the rule rather than the exception (with millions of disk drives, disk drive failure is ongoing). In large-scale systems, it makes more sense to achieve reliability and availability in the higher-level system (e.g., HDFS) and application (e.g., MapReduce) layers, not by using “enterprise-class” subsystems (e.g., RAID disk systems). HDFS is a very reliable data storage subsystem because the file data is replicated and distributed. MapReduce anticipates that individual tasks will fail on an ongoing basis (because of some combination of software and hardware failure) and manages the redistribution of work so that the overall job is completed in a timely manner. Consider how this plays out with storage. In the enterprise data center, the data would likely be stored on a shared SAN (storage area networking) system. Because this SAN system held key data for multiple important applications, the performance, reliability and availability of the SAN system was critical:  Redundant disks would be included and the data spread among multiple disks so that the loss of one or more of the disks wouldn’t result in the loss or unavailability of the data.  Critical elements (the controller, SAN switches and links, power supplies, host adaptors) would all be replicated for availability.  Because the SAN system supported multiple applications concurrently, performance was critical, so the fastest (and most expensive) disks would be used, with the fastest (and most expensive) connection to the controller. The controller would include substantial RAM memory for caching. In contrast, a Hadoop cluster of 50 nodes has 500-1000 high-capacity, low-cost disk drives.  The disks are selected to be cost optimized – lowest cost per byte stored, least expensive attachment directly to a server (no storage network, no Fiber Channel attachment).  The design has no redundancy at the disk level (no RAID configurations, for example). The HDFS file system assumes that disk failures are an ongoing issue and achieves high-availability data storage despite that. Cloud economics of storage means cost-effective drives directly connected to a commodity server with the least expensive connection. In a typical Hadoop node, 70% of the cost of the node is the cost of the disk drives and the disk drives are the most cost-effective possible. It can’t get any cheaper than that! A Hadoop cluster is a large data store built in the most cost-effective way possible. © 2011 Internet Research Group, all rights reserved Provided to Clients Under License
  • 12. IRG 2011: The Enterprise Use of Hadoop (v1) page 8 10. Why Is Hadoop So Interesting? As we noted earlier, big data is something of relevance to essentially all businesses because of Internet markets and because of machine-generated log data, if for no other reason. For dealing with big data, Hadoop is unquestionably a game changer:  It enables the purchase and operation of very large-scale data systems at a much lower cost because it uses cost-optimized, commodity components. Adding 500 TB of Hadoop storage is clearly affordable; adding 500 TB to a conventional database system is often not.  Hadoop is designed to move programs to data rather than the inverse. This basic paradigm change is required to deal with modern, high-volume disk drives.  Because of the OpenSource community, Hadoop software is available for free rather than at current database and data warehouse licensing fees. The use of Hadoop isn’t free, but the elimination of traditional license fees makes it much easier to experiment (for example).  Because Hadoop is designed to deal with unstructured data and unconstrained analysis (in contrast to a data warehouse that is carefully schematized and optimized), it doesn’t require database trained individuals (e.g., a DBA), although it clearly requires specialized expertise.  The MapReduce model minimizes the parallel programming experience and expertise. To program MapReduce directly requires significant programming skills (Java and functional programming), but the basic Hadoop model is designed to use scaling (adding more nodes, especially as they get cheaper) as an alternative to parallel programming optimization of resources. Hadoop represents a quite dramatic rethinking of “data processing” drive by the increasing volumes of data being processed and by the opportunity to follow the pioneering work of Google and others, and use commodity system technology at a much lower price. The downside of taking a new approach is twofold:  There is a lot of learning to do. Conventional data management and analysis is a large and well-established business. There are many analysts trained to use today’s tools, and a lot of technical people trained for the installation, operation and maintenance of these tools.  The “whole” product still needs some fleshing out. A modern data storage and analysis product is complicated: tools to import data, tools to transform data, job and work management systems, data management and migration tools, interfaces to and integration with popular analysis tools, for a beginning. By this standard Hadoop is still pretty young. From a product perspective, the biggest deficiencies are probably the adaptation of Hadoop for operation in an IT shop rather than a large Web property, and the © 2011 Internet Research Group, all rights reserved Provided to Clients Under License
  • 13. IRG 2011: The Enterprise Use of Hadoop (v1) page 9 development of tools that let users with more diverse skill sets (e.g., business analysts), make productive use of Hadoop-stored data. All of this is being worked on, either within the OpenSource community or as licensed proprietary software to use in conjunction with Hadoop. Companies providing Hadoop support and training services have discovered a vibrant and growing market. The usability of Hadoop (both operationally and as a data tool) is improving all the time. But it does have some more distance to go. 11. What Are the Interesting Sources of Big Data? There is no single answer. Different companies will have different data sets of interest. Some of the common ones of interest are these:  Integration of data from multiple data warehouses: Most big companies have multiple data warehouses, in part because each may have a particular divisional or departmental focus, and in part to keep each at an affordable and manageable level since traditional data warehouses all tend to increase in cost rapidly beyond some capacity. Hadoop provides a tool by which multiple sources of data can be brought together and analyzed, and by which a bigger “virtual” data warehouse can be built at a more affordable price.  Clickstream data: A Web server can record (in a log file) every interaction with a browser/user that it sees. This detailed record of use provides a wealth of information on the optimality of the Web site design, the Web system performance and in many cases, the underlying business. For example, for the large Web properties, clickstream analysis is the source of fundamental business analysis and optimization. For other businesses, the value depends on the importance of Web systems to the business.  Log file data: modern systems, subsystems, and applications and devices all can be configured to create log “interesting” events. This information is potentially the source of a wealth of information ranging from security/attack analysis to design correctness and system utilization.  Information scraped from the Web: Every year more information is captured on the Web and more valuable data is captured on the Web. Much of it is free to use for the cost of finding it and recording it.  Specific sources such as Twitter produce high-volume data streams potentially of value. Where is this information all coming from? There are multiple sources, but to begin with, consider:  The remarkable and continuing growth of the World Wide Web. The Web has become a remarkable repository of data to analyze in terms of all the contents of the Web, and for a Web site owner, the ability to analyze in complete detail the use of the Web site. © 2011 Internet Research Group, all rights reserved Provided to Clients Under License
  • 14. IRG 2011: The Enterprise Use of Hadoop (v1) page 10  The remarkable and growing use of mobile devices. The iPhone has only existed for the last five years (and the iPad for less), but this kind of mobile device has transformed how we deal with information. Specifically, more and more of what we do is in text form (not written notes, nor FAXs nor phone calls) and available for analysis one way or another. Mobile devices also provide valuable (albeit frightening) information on where and when the data was created or read.  The rise in “social sites.” There has been rapid growth in Facebook, LinkedIn as well as customer feedback on specific products (both at shared sites like Amazon and on vendor sites). Twitter provides remarkable volumes of data with possible value.  The rise in customer self-service. Increasingly companies look for ways for the community of their customers to help oneanother through shared Web sites. This not only is cost-effective, but generally leads to the earlier identification and solution to problems, as well as providing a rich source of data by which to assess customer sentiment.  Machine generated data. Almost all “devices” are now implemented in software and capable of providing log data (see above) if it can be used productively. 12. How Important Is Big Data Analytics? The only reasonable answer is “it depends.” Big data evangelists note that analytics can be worth 5% on the bottom line, meaning that intelligent analysis of business data can have a significant impact on the financial performance of a company. Even if that is true, for most companies most of the value will come from the analysis of “small data,” not from the incremental analysis of data that is infeasible to store or analyze today. At the same time, there are unquestionably companies for which the ability to do big data analytics is essential (Google and Facebook for example). These companies depend on the analysis of huge data sets (clickstream data from large on-line user communities) that cannot be practically processed by conventional database and analytics solutions. For most companies, big data analytics can provide incremental value, but the larger value will come from small data analytics. Over time, the value will clearly shift toward big data as more and more interesting data is available. There will almost always be value in the analysis of some very large data set. The more important question from a business optimization perspective, is whether the highest priority requirement is based on big data or is there still untapped and higher value “small” data? © 2011 Internet Research Group, all rights reserved Provided to Clients Under License
  • 15. IRG 2011: The Enterprise Use of Hadoop (v1) page 11 13. Things You Don’t Want to Do with Hadoop The Hadoop source distribution is “free” and a bright Java programmer can often “find” enough “underutilized” servers with which to stand up a small Hadoop cluster and do experiments. While it is true that almost every large company has real large-data problems of interest, to date much of the experimentation has been on problems that don’t really need this class of solution. Here is a partial list of some of the workloads that probably don’t justify going to Hadoop:  Non-huge problems. Keep in mind that a relatively inexpensive server can easily have 10 cores and 200 GB of memory. 200 GB is a lot of data, especially in a compressed format (Microsoft PowerPivot – an Excel plugin – can process 100 M rows of a compressed fact table data in 5% of that storage). Having the data resident in DRAM makes a huge difference (PowerPivot can scan 1 trillion rows a minute with less than 5 cores). If a compressed version of the data can reside in a large commodity-server memory, it’s almost certain to be a better solution (there are various in-memory database tools available).  Only for data storage. Although Hadoop is a good, very-large storage system (HDFS) unless you want to do embedded processing, there are often better storage solutions around.  Only for parallel processing. If you just want to manage the parallel execution of a distributed Java program, there are simpler and better solutions.  For HPC applications. Although a larger Hadoop cluster (100 nodes) comprises a significant amount of processing power and memory, you wouldn’t want to do traditional HPC algorithms in Hadoop rather than in a more traditional computational grid (e.g., FEA, CFD, geophysical data analysis). 14. Horizontal Hadoop Applications With some very bright programmers, Hadoop can be applied wherever the functional model can be applied. One generic class of applications is characterized by this:  Data sets that are clearly too large to economically store in traditional enterprise storage systems (SAN and NAS) and that are clearly too large to analyze with traditional data warehouse systems.  Think of a Hadoop as a place where you can now store the data economically, and use MapReduce to preprocess the data and extract data that can be fed into an existing data warehouse and analyzed, along with existing structured data, using existing analysis tools. © 2011 Internet Research Group, all rights reserved Provided to Clients Under License
  • 16. IRG 2011: The Enterprise Use of Hadoop (v1) page 12  Alternatively you can think of Hadoop as a way of “extending” the capacity of an existing storage and analysis system when the cost of the solution starts to grow faster than linearly as more capacity is required.  As introduced above, Hadoop can also be used as a means of integrating data from multiple existing warehouse and analysis systems. 15. Summary Technology progress and the increased use of the Internet are creating very large new data sets with increasing value to businesses and making the processing power to analyze them affordable. The size of these data sets suggests that exploitation may well require a new category of data storage and analysis systems with different system architectures (parallel processing capability integrated with high-volume storage), different use of components (more exploitation of the same high-volume, commodity components that are used within today’s very-large Web properties). Hadoop is a strong candidate for such a new processing tier. In addition to its initial design by Google, the fact that it is today a vibrant OpenSource efforts suggests additional disruptive impact in product pricing and the economics of use is possible. © 2011 Internet Research Group, all rights reserved Provided to Clients Under License