An integrated approach and methodology that incorporates massively parallel processing relational databases and Apache Hadoop to provide a framework for the enterprise data architecture.
Optimizing Big Data Value Across Hadoop and Relational
1. White Paper
10.13
Optimize the Business
Value of All Your
Enterprise Data
Integrated approach incorporates relational
databases and Apache Hadoop to provide a
framework for the enterprise data architecture
BY Chad Meley
Director of eCommerce & Digital Media
eb 7873
2. Optimize the Business Value
of All Your Enterprise Data
Executive Summary
Few industries have evolved as quickly as data
processing, thanks to the effect of Moore’s Law coupled
with Silicon Valley–style software innovation. So it comes
as no surprise that innovations in data analysis have led
to new data, new tools, and new demands to remain
competitive. Market leaders in many industries are
adopting these new capabilities, fast followers are on their
heels, and the mainstream is not far behind.
This renaissance has affected the data warehouse in
powerful ways. In the 1990s and earlier 2000s, the
massively parallel processing (MPP) relational data
warehouse was the only proven and scalable place to
hold corporate memory. In the late 2000s, an explosion
of new data types and enabling technologies lead some
to claim the demise of the traditional data warehouse.
A more pragmatic view has emerged recently, that a
one-size-fits-all approach—whether a traditional data
warehouse or Apache™ Hadoop®—is insufficient by itself
in a time when datasets and usage patterns vary widely.
Technology advances have expanded the options to
include permutations of the data warehouse in what is
referred to as built-for-purpose solutions.
Yet even seasoned practitioners who embrace
multiplatform data environments still struggle to decide
which technology is the best choice for each use case. By
analogy, consider the transformations that have occurred
in moving physical goods around the world in the past
century—first cargo ships, then rail and trucks, and finally
airplanes. Because of our familiarity with these modes,
we know intrinsically what use cases are best for each
transportation option, and nobody questions the need for
all of them to exist within a global logistics framework.
Knowing the value propositions and economics for each,
it would be foolish for someone to say “Why would
anyone ever use an airplane to ship goods when rail is a
fraction of the cost per pound?” Or “Why would I ever
consider using a cargo ship to move oil when I can get it
to market faster using air?”
But the best fit for data platform technologies is not as
universally understood at this time.
This paper will not bring instant clarity to this complex subject; rather, the intent is to define a framework of capabilities and costs for various options to encourage informed
dialogue that will accelerate more comprehensive understanding in the industry.
2
White Paper
10.13
eb 7873
Teradata has defined the Teradata® Unified Data
Architecture™, a solution that allows the analytics renaissance to flourish while controlling costs and discovering new
analytics. As guideposts in this expansion, we have identified
workloads that fit into built-for-purpose zones of activity:
~~Integrated data warehouse
~~Interactive discovery
~~Batch data processing
~~General-purpose file system
By making use of this array of analytical environments, companies can extract significant value from a broader range
of data—much of which would have been discarded just a
few years ago. As a result, business users can solve more
high-value business problems, achieve greater operational
efficiencies, and execute faster on strategic initiatives.
While the big data landscape is spawning new and
innovative products at an astonishing pace, a great
deal of attention continues to be focused on one of
the seminal technologies that launched the big data
analytics expansion: Hadoop. An open source software
framework that supports the processing of large datasets
in a distributed applications environment, Hadoop
uses parallelism over raw files through its MapReduce
framework. It has the momentum and community support
that make it the most likely to eventually become the
dominant enterprise standard in its space in a new breed
of data technologies.
The Teradata Unified Data
Architecture
Teradata offers a hybrid enterprise data architecture that
integrates Hadoop and massively parallel processing (MPP)
relational database management systems (RDBMS). Known
as the Teradata Unified Data Architecture™, this solution
relies on input from Teradata subject-matter experts and
Teradata customers who are experienced practitioners
with both Hadoop and traditional data warehousing. This
architecture has also been validated with leading industry
analysts and provides a strong foundation for designing
next-generation enterprise data architectures.
The essence of the Teradata Unified Data Architecture™ is
captured in a comprehensive infographic that is intended
to be a reference for database architects and strategic
planners as they develop their next-generation enterprise
3. Optimize the Business Value
of All Your Enterprise Data
White Paper
10.13
eb 7873
Before the big data revolution, organizations established
clear guidelines to determine what data would be
captured and how long it would be retained. As a result,
only the dense data (high BVD) was retained. Lower BVD
data was discarded, compounded by the absence of
identified use cases and tools to exploit it.
data architectures (Figure 1). The graphic, along with
the more detailed explanations in this paper, provide
objective criteria for deciding which technology is best
suited to particular needs within the organization.
To provide a framework for understanding the use cases,
the following sections describe a number of important
concepts such as business value density (BVD), stable
and evolving schemas, and query and data volumes.
There are many different concepts that interplay within
the graphic, so it is broken down in a logical order.
Factors Affecting Business Value Density
Data Parameter
High BVD
Low BVD
Age
One of the most important concepts for understanding
the Teradata Unified Data Architecture™ is BVD, defined
as the amount of business relevance per gigabyte of data
(Figure 2). Put another way, how many business insights
can be extracted for a given amount of data? There are
a number of factors that influence BVD, including when
the data was captured, the amount of detail in the data,
the percentage of inaccurate or corrupt records (data
hygiene), and how often the data is accessed and reused
(see table).
Recent
Older
Form
Modeled
Raw
Hygiene
Clean
Raw
Access
Frequent
Rare
Reuse
Business value density
Frequent
Rare
The big data movement has brought a fundamental shift
in data capture, retention, and processing philosophies.
Declining storage costs and file-based data capture
COST CHARACTERISTICS
CHARACTERISTICS COST
Single view of your business
Shared source for analytics
Load once, Use many times
SQL / 3rd party applications
Knowledge workers and analysts
HARDWARE / SOFTWARE
LOW
MED
HIGH
LOW
MED
HIGH
DEVELOPMENT / MAINTENANCE
LOW
MED
HIGH
LOW
MED
HIGH
DEVELOPMENT / MAINTENANCE
USAGE
LOW
MED
HIGH
LOW
MED
HIGH
USAGE
RESOURCE CONSUMPTION
LOW
MED
HIGH
LOW
MED
HIGH
HARDWARE / SOFTWARE
Accommodates both Stable and Evolving Schemas
Does not require extensive data modeling
SQL / NoSQL / Map Reduce / statistical functions
RESOURCE CONSUMPTION Pre-packaged analytic modules
Analysts and data scientists
INTERACTIVE
DISCOVERY
DATA WAREHOUSE
HIGH BUSINESS VALUE DENSITY
QUERY VOLUME
OPTIMIZING VALUE
IN THE UNIFIED
DATA ARCHITECTURE
RDBMS
HADOOP
THE HIGHER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE IT
USING RELATIONAL TECHNIQUES
DATA VOLUME
THE LOWER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE
USING HADOOP
CROSS FUNCTIONAL REUSE
BUSINESS VALUE DENSITY
The ratio of business relevance to the size of the data.
STABLE SCHEMA
NO SCHEMA
LOW BUSINESS VALUE DENSITY
GENERAL PURPOSE
FILE SYSTEM
Flexible programming languages (Java, Python, C++, etc.)
Economic online archive
Land/source operational data
Data scientists and engineers
BATCH DATA
PROCESSING
HARDWARE / SOFTWARE
LOW
MED
LOW
MED
HIGH
USAGE
LOW
MED
LOW
MED
MED
HIGH
MED
HIGH
MED
HIGH
LOW
HIGH
EVOLVING SCHEMA
LOW
LOW
HIGH
RESOURCE CONSUMPTION
LOW
HIGH
DEVELOPMENT / MAINTENANCE
MED
HIGH
CHARACTERISTICS COST
COST CHARACTERISTICS
ANALYTICAL FLEXIBILITY
DATA GOVERNANCE
DATA QUALITY/INTEGRITY
Figure 1. The Teradata Unified Data Architecture
3
No transformations of data required
Scripting / Declarative languages
Analysis against raw files
USAGE Refinement, transformation, and cleansing
RESOURCE CONSUMPTION Analysts and data scientists
HARDWARE / SOFTWARE
DEVELOPMENT / MAINTENANCE
FAST RESPONSE/THROUGHPUT
FINE GRAIN SECURITY
4. Optimize the Business Value
of All Your Enterprise Data
White Paper
10.13
and processing now allow enterprises to capture and
retain most, if not all, of the information generated by
business activities. Why capture so much lower BVD
data? Because low BVD does not mean no value. In fact,
many organizations are discovering that sparse data
that was routinely discarded not so long ago now holds
tremendous potential business value—but only if it can be
accessed efficiently.
To illustrate the concept of BVD, consider a dataset made
up of cleansed and packaged online order information for
a given time period such as the previous three months.
This dataset is relatively small and yet highly valuable to
business users in operations, marketing, finance, and other
functional areas. This order data is considered to have
high BVD; in other words, it contains a high level of useful
business insights per gigabyte.
eb 7873
In contrast, imagine capturing Web log data representing
every click on the company’s Web site over the past
five years. Compared to the order data described
previously, this dataset is significantly larger. While there
is potentially a treasure trove of business insights within
this dataset, the number of people and applications
interrogating it in its raw form would be less than the
dataset made up of cleansed and packaged orders. So,
this raw Web site data has sparse BVD, but is still highly
valuable.
Stable and evolving schemas
The ability to handle evolving schemas is an important
capability. In contrast to stable schemas that change
slowly (e.g., order records and product information),
evolving schemas change continually—think of new
HIGH BUSINESS VALUE DENSITY
OPTIMIZING VALUE
IN THE UNIFIED
DATA ARCHITECTURE
B
A
DATA VOLUME
BUSINESS VALUE DENSITY
The ratio of business relevance to the size of the data.
LOW BUSINESS VALUE DENSITY
Legend
~~ Data volume—Represented by the thickness of the circle; greatest at point A and decreases counterclockwise around the circle
~~ BVD—Lowest at point A and increases around the circle
~~ Sparse/dense—Sparse data represented by light rows with a few dark blue squares at point A; dense data represented by darker blue rows
at point B
Figure 2. Business value density
4
5. Optimize the Business Value
of All Your Enterprise Data
columns being added frequently, for example, Web log
data (Figure 3).
All data has structure. Instead of the oft-used (and
misused) terms structured, semi-structured, and
unstructured, the more useful concepts are stable and
evolving schemas. For example, even though XML and
JSON formats are often classified as semi-structured,
the schema for an individual event such as order checkout can be highly stable over long periods of time. As
a result, this information can be easily accessed using
standard ETL (extract, transform, and load) tools with
little maintenance overhead. Conversely, XML and JSON
formats frequently—and unexpectedly, from the viewpoint
of a data platform engineer—capture a new event type
such as “hovered over a particular image with pointer.”
This scenario describes evolving schema, which is
particularly challenging for traditional relational tools.
White Paper
10.13
eb 7873
No-schema data
As noted previously, all data has structure and therefore
what is frequently seen as unstructured data should
be reclassified as no-schema data (Figure 4). What’s
interesting about no-schema data with respect to analytics
is that it has analytical value in unanticipated ways. In fact,
a skilled data scientist can draw substantial insights from
no-schema data. Here are two real-life scenarios:
~~An online retailer is boosting revenue through image
analysis. In a typical case, a merchant is marketing
a red dress and supplies the search terms size 6 and
Ralph Lauren along with an image of the dress itself.
Using sophisticated image-analysis software, the
retailer can with a high degree of confidence attach
additional descriptors such as A-line and cardinal red,
which makes searching more accurate, benefiting both
merchants and buyers.
~~An innovative insurance company is using audio
recordings of phone conversations between customer
service representatives and policyholders to determine
the likelihood of a fraudulent claim based on signals
derived from voice inflections.
HIGH BUSINESS VALUE DENSITY
In both examples, the companies had made the decision to
capture the data before they had a complete idea of how
to use it. Business users developed the innovative uses
after they had become familiar with the data structure and
had access to tools to extract the hidden value.
DATA VOLUME
DATA VOLUME
STABLE SCHEMA
STABLE SCHEMA
LOW BUSINESS VALUE DENSITY
NO SCHEMA
LOW BUSINESS VALUE DENSITY
EVOLVING SCHEMA
Legend
~~ Stable schema data—The blue section of the band. Note that the
areas of high BVD are composed entirely of stable schemas.
~~ Evolving schema data—The gray section of the band. While much
of the data volume corresponds to evolving schemas, the BVD is
fairly low compared to the stable schemas.
Legend
No-schema data—The magenta band between the evolving and
stable schemas
Figure 3. Stable and evolving schemas
Figure 4. No-schema data
5
EVOLVING SCHEMA
6. Optimize the Business Value
of All Your Enterprise Data
Usage and query volume
By definition, there is a strong correlation between BVD
and usage volume. For example, if a company captures
100 petabytes of data, 80 percent of all queries would be
addressed to just 20 petabytes—the high BVD portion of
the dataset (Figure 5).
Usage volume includes two primary access methods: adhoc and scheduled queries. Ad-hoc queries are usually
initiated by the person who needs the information using
SQL interfaces, analytical tools, and business applications.
Scheduled queries are set up and monitored by business
analysts or data platform engineers. Applicable tools
include SQL interfaces for regularly scheduled reports, automated business applications, and low-level programming
scripts for scheduled analytics and data transformations.
A significant and growing portion of usage volume is due
to applications such as campaign management, ad serving,
search, and supply chain management that depend on
insights from the data to drive more intelligent decisions.
HIGH BUSINESS VALUE DENSITY
10.13
eb 7873
RDBMS or Hadoop
Building on the core concepts of BVD; query volume; and
stable, evolving, and no-schema data, we can draw a line
showing which data is most appropriate for an RDBMS or
Hadoop and give some background about that particular
placement.
In general, as BVD increases, the more it makes sense to
use relational techniques; while decreasing BVD indicates
that Hadoop may be the best choice. While the graphic
(Figure 6) draws the line arbitrarily through the equator,
every organization will have its own threshold based
on its information culture and maturity. Also note that
no-schema data resides solely within Hadoop because
relational constructs are often less-suited for managing
this type of data.
RDBMS technology has clear advantages over Hadoop in
terms of response time, throughput, and security, which
make it more appropriate for higher BVD data that has
greater concurrency and more security requirements
given the shared nature of the data.
These differentiators are due to the following:
QUERY VOLUME
~~Mature cost-base optimizers—When a query is
submitted, the optimizer evaluates various execution
plans and estimates the resource consumption for each.
The optimizer then selects the plan that minimizes
resource usage and thus maximizes throughput.
~~Indexing—RDBMS software has a multitude of robust
indexes with stored statistics to facilitate access, thus
shortening response times.
DATA VOLUME
CROSS FUNCTIONAL REUSE
STABLE SCHEMA
NO SCHEMA
LOW BUSINESS VALUE DENSITY
EVOLVING SCHEMA
Legend
~~ Usage base volume—The amplitude of the outside spirals
indicates usage volume. Note the inverse correlation between
BVD and usage volume.
~~ Cross-functional reuse—The three colors represent the percentage of the data that is reused by groups such as marketing, customer service, and finance. These groups typically need access
to the same high-BVD data such as recent orders.
Figure 5. Usage and query volume
6
White Paper
~~Advanced partitioning—Today’s RDBMS products
feature a number of advanced partitioning methods
and criteria to optimize database performance and
improve manageability.
~~Workload management—RDBMS technology addresses
the throughput problem that occurs when many
queries are executing concurrently. The workload
manager prioritizes the query queue so that short
queries are executed quickly and long queries receive
adequate resources to avoid excessively long execution
times. Filters and throttles regulate database activity
by rejecting or limiting requests. (A filter causes
specific logon and query requests to be rejected, while
a throttle limits the number of active sessions, query
requests, or load utilities on the database.)
~~Extensive security features—Relational databases offer
sophisticated row- and column-level security, which
enables role-based security. They also include fine-grain
7. Optimize the Business Value
of All Your Enterprise Data
White Paper
10.13
security features such as authentication options, security
roles, directory integration, and encryption, versus more
coarse-grain features of the same within Hadoop.
Cost factors
Along with technological capabilities, cost drives the
design of the enterprise data architecture. The Teradata
Unified Data Architecture™ rates the relative cost of use
cases using a four-factor cost analysis:
~~Hardware and software investment—The costs associated with the acquisition of the hardware and software.
~~Development and maintenance—The ongoing cost of
acquiring data and packaging it for consumption as
well as the costs of implementing systemwide changes
such as software upgrades and changes to code and
scripts running in the environment.
eb 7873
~~Usage—The costs of querying and analyzing the data
to derive actionable insights, primarily based on market
compensation for required skills, time to author and alter
scripts and code, and wait time as it relates to productivity; these costs often are spread across multiple departments and budgets and therefore often go unnoticed;
however, they are very real for business initiatives that
leverage data and analytics for strategic advantage.
~~Resource consumption—The extent to which the
CPU, I/O, and disk resources are utilized over time;
when system resources are close to full utilization,
the organization is achieving the maximum value for
its investment in hardware and therefore resource
consumption costs would be low; underutilized systems
waste resources and drive up costs without adding
value and would therefore be medium or high.
HIGH BUSINESS VALUE DENSITY
QUERY VOLUME
RDBMS
HADOOP
THE HIGHER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE IT
USING RELATIONAL TECHNIQUES
DATA VOLUME
THE LOWER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE
USING HADOOP
CROSS FUNCTIONAL REUSE
BUSINESS VALUE DENSITY
The ratio of business relevance to the size of the data.
STABLE SCHEMA
NO SCHEMA
LOW BUSINESS VALUE DENSITY
EVOLVING SCHEMA
FAST RESPONSE/THROUGHPUT
FINE GRAIN SECURITY
Legend
~~ RDBMS-Hadoop partition—The horizontal line partitions the BVD space between high-BVD data that can be effectively managed with
an RDBMS and low-BVD data that is best suited to Hadoop. The partitioning point (intersection of line and data curve) is unique to each
organization and may change over time.
~~ RDBMS features—The two arcs within the data circles represent key advantages of RDBMS: fast response times/throughput and
fine-grain security.
Figure 6. The RDBMS-Hadoop partition
7
8. Optimize the Business Value
of All Your Enterprise Data
White Paper
10.13
Use Case Overview
While there are a large number of possible data scenarios
in the enterprise world today, the majority fall into these
four use cases:
~~Integrated data warehouse—Provides an unambiguous
view of information for timely and accurate decision
making
~~Interactive discovery—Addresses the challenge of exploring large datasets with less defined or evolving schemas
~~Batch data processing—Transforms data and performs
analytics against larger datasets when storage costs are
valued over interactive response times and throughput
~~General file system—Ingests and stores raw data with
no transformation, making this use case an economical
online archive for the lowest BVD data
Each use case is described in more detail in the
following sections.
Integrated data warehouse
The association of the relational database and big data
occurs in the integrated data warehouse (Figure 7). The
integrated data warehouse is the overwhelming choice
for the important data that drives organizational decisionmaking where a single, accurate, timely, and unambiguous
version of the information is required.
The integrated data warehouse uses a well-defined
schema to offer a single view of the business to enable
easy data access and ensure consistent results across
the entire enterprise. It also provides a shared source
for analytics across multiple departments within the
enterprise. Data is loaded once and used many times
without the need for the user to repeatedly define and
execute agreed-upon transformation rules such as the
definitions of customer, order, and lifetime value score.
The integrated data warehouse supports ANSI SQL as
well as many mature third-party applications. Information
in the integrated data warehouse is scalable and can be
accessed by knowledge workers and business analysts
across the enterprise.
The integrated data warehouse is the tried-and-true gold
standard for high-BVD data, supporting cross-functional
reuse and the largest number of business users with
a full set of features and benefits unmatched by other
approaches to data management.
CHARACTERISTICS COST
Single view of your business
Shared source for analytics
Load once, Use many times
SQL / 3rd party applications
Knowledge workers and analysts
HARDWARE / SOFTWARE
LOW
MED
HIGH
DEVELOPMENT / MAINTENANCE
LOW
MED
HIGH
USAGE
LOW
MED
HIGH
RESOURCE CONSUMPTION
LOW
MED
HIGH
DATA WAREHOUSE
HIGH BUSINESS VALUE DENSITY
QUERY VOLUME
Figure 7. Integrated data warehouse
PTIMIZING VALUE
IN THE UNIFIED
A ARCHITECTURE
8
RDBMS
eb 7873
DATA VOLUME
9. Optimize the Business Value
of All Your Enterprise Data
White Paper
10.13
Cost analysis
~~Hardware and software investment: High—Software
development for the commercial engineering
effort required to deliver the differentiated benefits
described previously, as well as an optimized,
integrated hardware platform, warrant substantial
initial investments.
~~Development and maintenance expense:
Medium—Realizing the maximum benefit of clean,
integrated, easy-to-consume information requires
data modeling and ETL operations, which drive up
development costs. However, the productivity tools
and people skills for developing and maintaining a
relational environment are readily available in the
marketplace, mitigating the development costs. Also,
the data warehouse has diminishing incremental
development costs because it builds on existing data
and transformation rules and facilitates data reuse.
~~Usage expense: Low—Users can navigate the enterprise
data and create complex queries in SQL that return
results quickly, minimizing the need for expensive
programmers and reducing unproductive wait
times. This benefit is a result of the costs incurred in
development and maintenance as described previously.
~~Resource consumption: Low—Tight vertical integration
across the stack enables optimal utilization of system
eb 7873
CPU and I/O resources so that the maximum amount
of throughput can be achieved within an environment
bounded by CPU and I/O.
Interactive discovery
Interactive discovery platforms address the challenge of
exploring large datasets with less-defined or evolving
schemas by adapting methodologies that originate from
the Hadoop ecosystem within an RBDMS (Figure 8). Some
of the inherent advantages of the RDBMS technology are
particularly fast response times and throughput, as well
as the ease of use stemming from ANSI SQL compliance.
Interactive discovery requires less time spent on data
governance, data quality, and data integrity because
users are looking for new insights in advance of such
rigor required for more formal auctioning of the data and
insights. The fast response times enable accelerated insight
discovery, and the ANSI SQL interface democratizes the
data in the widest possible user base.
This approach combines schema-on-read, MapReduce,
and flexible programming languages with RBDMS features
such as ANSI SQL support, low latency, fine-grain security,
data quality, and reliability. Interactive discovery has
cost and flexibility advantages over the integrated data
warehouse, but at the expense of concurrency (usage
volume) and governance control.
COST CHARACTERISTICS
LOW
MED
HIGH
LOW
MED
HIGH
LOW
MED
HIGH
LOW
MED
HIGH
HARDWARE / SOFTWARE
Accommodates both Stable and Evolving Schemas
Does not require extensive data modeling
USAGE SQL / NoSQL / Map Reduce / statistical functions
RESOURCE CONSUMPTION Pre-packaged analytic modules
Analysts and data scientists
DEVELOPMENT / MAINTENANCE
INTERACTIVE
DISCOVERY
Figure 8. Interactive discovery
THE HIGHER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE IT
USING RELATIONAL TECHNIQUES
9
THE LOWER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE
USING HADOOP
10. Optimize the Business Value
of All Your Enterprise Data
White Paper
10.13
eb 7873
COST CHARACTERISTICS
LOW
MED
HIGH
HARDWARE / SOFTWARE Accommodates both Stable as business analysts can
A key reason to use interactive discovery is analytical
SQL script, data scientists as well and Evolving Schemas
MED
HIGH
LOW
DEVELOPMENT /
flexibility (also applicable to Hadoop), which is based on MAINTENANCE Does not require extensive data modeling
use interactive discovery without additional training.
LOW
MED
HIGH
USAGE SQL / NoSQL / Map Reduce / statistical functions
these features:
RESOURCE CONSUMPTION Pre-packaged analytic modules
LOW
MED
HIGH
Cost analysis
~~Schema-on-read—Structure is imposed when the
Analysts and data
~~Hardware and softwarescientists
investment: Medium—Interactive
data is read, unlike the schema-on-write approach of
discovery platforms are less expensive than the
the integrated data warehouse. This feature allows
integrated data warehouse.
complete freedom to transform and manipulate the
~~Development and maintenance: Low—Interactive
data at a later time. The use cases in the Hadoop
discovery uses light modeling techniques, which
hemisphere also use schema-on-read techniques.
minimize efforts for ETL and data modeling.
~~Low-level programming—Languages such as Java and
~~Usage: Low—SQL is easy to use, reducing user time
Python can be used to construct complex queries and
required to generate queries. Built-in analytical
even perform row-over-row comparisons, both of which
functions reduce hundreds of lines of code to single
are extremely challenging with SQL. This kind of prostatements. The performance characteristics of an
cessing is more commonly associated with row-over-row
RDBMS reduce unproductive wait times.
comparisons, such as time-series and pathing analysis.
INTERACTIVE
DISCOVERY
~~Resource consumption: Low—Commercial RDBMS
Interactive discovery accommodates both stable and
software is optimized for efficient utilization of
evolving schemas without extensive data modeling.
resources.
It leverages SQL, NoSQL, MapReduce, and statistical
functions in a single analytical process and incorporates
Batch data processing
prepackaged analytical modules. NoSQL and MapReduce
Unlike the integrated data warehouse and interactive
are particularly useful for analyses such as time series
discovery platforms, batch
and social graph that require complex processing HIGHER THE BUSINESS VALUE DENSITY, processing lies within the
THE
Hadoop sphere (Figure 9).
THE ANSI
beyond the capabilities of ANSI SQL. As a result of MORE IT MAKES SENSE TO MANAGE IT A key difference between
batch data processing and
USING RELATIONAL TECHNIQUES interactive discovery is that
SQL compliance and a myriad of prebuilt MapReduce
batch processing involves no physical data movement
analytical functions that can be incorporated into an ANSI
THE LOWER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE
USING HADOOP
BLE SCHEMA
HEMA
BATCH DATA
PROCESSING
LOW
HIGH
LOW
MED
HIGH
LOW
MED
HIGH
LOW
LVING SCHEMA
MED
MED
HIGH
No transformations of data required
Scripting / Declarative languages
DEVELOPMENT / MAINTENANCE Analysis against raw files
USAGE Refinement, transformation, and cleansing
RESOURCE CONSUMPTION Analysts and data scientists
HARDWARE / SOFTWARE
COST CHARACTERISTICS
CAL FLEXIBILITY
FAST RESPONSE/THROUGHPUT
RNANCE
FINE GRAIN SECURITY
NTEGRITY
Figure 9. Batch data processing
10
11. Optimize the Business Value
of All Your Enterprise Data
White Paper
10.13
eb 7873
as part of the transformation into a more usable model.
~~Usage: Medium—Unlike the previous use cases that are
Light data modeling is applied against the raw data files
accessible to SQL users, batch processing requires new
to facilitate more intuitive usage. The nature of the file
skills for authoring queries and is not compatible with the
system and the ability to flexibly manipulate data makes
full breadth of features and functionality found in modern
batch processing an ideal environment for refining,
business intelligence tools. In addition, query run times
transforming, and cleansing data, as well as performing
are longer, resulting in wait times that lower productivity.
analytics against larger datasets CHARACTERISTICS COST
when storage costs are
~~Resource consumption:HIGH
High—In general, Hadoop
LOW
MED
Single view of your business HARDWARE / SOFTWARE
valued over fast response times and throughput.
LOW
MED
HIGH
MAINTENANCE
Shared source for analytics DEVELOPMENT / software makes less efficient use of hardware resources
LOW
MED
HIGH
than RDBMS.
Load once, Use many times USAGE
Since the underlying data is raw, the task of transforming
SQL / 3rd party applications
RESOURCE CONSUMPTION
LOW
MED
HIGH
the data must be performed when the query analysts
is processed.
Knowledge workers and
General-purpose file
This is immensely valuable in that it provides a high DATA WAREHOUSE system
degree of flexibility for the user.
As used in this context, the general-purpose file system
HIGH BUSINESS to the Hadoop Distributed File System (HDFS) and
refers VALUE DENSITY
Batch processing incorporates a wide range of declarative
flexible programming languages (Figure 10). Raw data
language processing using Pig, Hive, and other emerging
is ingested and stored with no transformation, making
QUERY VOLUME
access tools in the Hadoop ecosystem. These tools are
this use case an economical online archive for the lowest
especially valuable for analyzing low BVD data when query
BVD data. Hadoop allows data scientists and engineers to
response time is not as critical, the logic applied to the
apply flexible low-level programming languages such as
data is complex, and full scans of the data are required—for
Java, Python, and C++ against the largest datasets without
example, sessionizing Web log data, counting events, and
any up-front characterization of the data.
executing complex algorithms. This approach is ideal for
analysts, developers, and data scientists.
Cost analysis
OPTIMIZING VALUE
IN THE UNIFIED
~~Hardware and software investment: Low—Batch
processing is available through open source software
DATA ARCHITECTURE
Cost analysis
~~Hardware and software investment: Low—Like batch
processing, this approach benefits from open source
software and commodity hardware.
~~Development and maintenance: High—Working effectively in this environment requires not only proficiency
DATA VOLUME
RDBMS
~~Development and maintenance: Medium—The skills
with low-level programming languages but also a workrequired to do development and maintain the Hadoop
ing understanding of Linux and the network configuraHADOOP
environment are relatively scarce in the marketplace, CROSS FUNCTIONAL REUSE
tion. The lack of mature development tools and applidriving up labor costs. Optimizing code in the environcations and the premium salaries demanded by skilled
BUSINESS VALUE DENSITY
ment is primarily a burden on the development team.
scientists and engineers all contribute to costs.
The ratio of business relevance to the size of the data.
and runs on commodity hardware.
STABLE SCHEMA
NO SCHEMA
LOW BUSINESS VALUE DENSITY
GENERAL PURPOSE
FILE SYSTEM
Flexible programming languages (Java, Python, C++, etc.)
Economic online archive
Land/source operational data
Data scientists and engineers
HARDWARE / SOFTWARE
LOW
MED
HIGH
DEVELOPMENT / MAINTENANCE
LOW
MED
HIGH
USAGE
LOW
MED
HIGH
RESOURCE CONSUMPTION
LOW
MED
HIGH
EVOLVING SCHEMA
CHARACTERISTICS COST
ANALYTICAL FLEXIBILITY
DATA GOVERNANCE
DATA QUALITY/INTEGRITY
Figure 10. General-purpose file system
11
FAST RE
FINE GRAIN S