Optimizing Big Data Value Across Hadoop and Relational

White Paper
10.13

Optimize the Business
Value of All Your
Enterprise Data
Integrated approach incorporates relational
databases and Apache Hadoop to provide a
framework for the enterprise data architecture
BY Chad Meley
Director of eCommerce & Digital Media

eb 7873

Optimize the Business Value
of All Your Enterprise Data

Executive Summary
Few industries have evolved as quickly as data
processing, thanks to the effect of Moore’s Law coupled
with Silicon Valley–style software innovation. So it comes
as no surprise that innovations in data analysis have led
to new data, new tools, and new demands to remain
competitive. Market leaders in many industries are
adopting these new capabilities, fast followers are on their
heels, and the mainstream is not far behind.
This renaissance has affected the data warehouse in
powerful ways. In the 1990s and earlier 2000s, the
massively parallel processing (MPP) relational data
warehouse was the only proven and scalable place to
hold corporate memory. In the late 2000s, an explosion
of new data types and enabling technologies lead some
to claim the demise of the traditional data warehouse.
A more pragmatic view has emerged recently, that a
one-size-fits-all approach—whether a traditional data
warehouse or Apache™ Hadoop®—is insufficient by itself
in a time when datasets and usage patterns vary widely.
Technology advances have expanded the options to
include permutations of the data warehouse in what is
referred to as built-for-purpose solutions.
Yet even seasoned practitioners who embrace
multiplatform data environments still struggle to decide
which technology is the best choice for each use case. By
analogy, consider the transformations that have occurred
in moving physical goods around the world in the past
century—first cargo ships, then rail and trucks, and finally
airplanes. Because of our familiarity with these modes,
we know intrinsically what use cases are best for each
transportation option, and nobody questions the need for
all of them to exist within a global logistics framework.
Knowing the value propositions and economics for each,
it would be foolish for someone to say “Why would
anyone ever use an airplane to ship goods when rail is a
fraction of the cost per pound?” Or “Why would I ever
consider using a cargo ship to move oil when I can get it
to market faster using air?”
But the best fit for data platform technologies is not as
universally understood at this time.
This paper will not bring instant clarity to this complex subject; rather, the intent is to define a framework of capabilities and costs for various options to encourage informed
dialogue that will accelerate more comprehensive understanding in the industry.

2

White Paper
10.13

eb 7873

Teradata has defined the Teradata® Unified Data
Architecture™, a solution that allows the analytics renaissance to flourish while controlling costs and discovering new
analytics. As guideposts in this expansion, we have identified
workloads that fit into built-for-purpose zones of activity:
~~Integrated data warehouse
~~Interactive discovery
~~Batch data processing
~~General-purpose file system
By making use of this array of analytical environments, companies can extract significant value from a broader range
of data—much of which would have been discarded just a
few years ago. As a result, business users can solve more
high-value business problems, achieve greater operational
efficiencies, and execute faster on strategic initiatives.
While the big data landscape is spawning new and
innovative products at an astonishing pace, a great
deal of attention continues to be focused on one of
the seminal technologies that launched the big data
analytics expansion: Hadoop. An open source software
framework that supports the processing of large datasets
in a distributed applications environment, Hadoop
uses parallelism over raw files through its MapReduce
framework. It has the momentum and community support
that make it the most likely to eventually become the
dominant enterprise standard in its space in a new breed
of data technologies.

The Teradata Unified Data
Architecture
Teradata offers a hybrid enterprise data architecture that
integrates Hadoop and massively parallel processing (MPP)
relational database management systems (RDBMS). Known
as the Teradata Unified Data Architecture™, this solution
relies on input from Teradata subject-matter experts and
Teradata customers who are experienced practitioners
with both Hadoop and traditional data warehousing. This
architecture has also been validated with leading industry
analysts and provides a strong foundation for designing
next-generation enterprise data architectures.
The essence of the Teradata Unified Data Architecture™ is
captured in a comprehensive infographic that is intended
to be a reference for database architects and strategic
planners as they develop their next-generation enterprise


White Paper
10.13

eb 7873

Before the big data revolution, organizations established
clear guidelines to determine what data would be
captured and how long it would be retained. As a result,
only the dense data (high BVD) was retained. Lower BVD
data was discarded, compounded by the absence of
identified use cases and tools to exploit it.

data architectures (Figure 1). The graphic, along with
the more detailed explanations in this paper, provide
objective criteria for deciding which technology is best
suited to particular needs within the organization.
To provide a framework for understanding the use cases,
the following sections describe a number of important
concepts such as business value density (BVD), stable
and evolving schemas, and query and data volumes.
There are many different concepts that interplay within
the graphic, so it is broken down in a logical order.

Factors Affecting Business Value Density
Data Parameter

High BVD

Low BVD

Age

One of the most important concepts for understanding
the Teradata Unified Data Architecture™ is BVD, defined
as the amount of business relevance per gigabyte of data
(Figure 2). Put another way, how many business insights
can be extracted for a given amount of data? There are
a number of factors that influence BVD, including when
the data was captured, the amount of detail in the data,
the percentage of inaccurate or corrupt records (data
hygiene), and how often the data is accessed and reused
(see table).

Recent

Older

Form

Modeled

Raw

Hygiene

Clean

Raw

Access

Frequent

Rare

Reuse

Business value density

Frequent

Rare

The big data movement has brought a fundamental shift
in data capture, retention, and processing philosophies.
Declining storage costs and file-based data capture

COST CHARACTERISTICS

CHARACTERISTICS COST
Single view of your business
Shared source for analytics
Load once, Use many times
SQL / 3rd party applications
Knowledge workers and analysts

HARDWARE / SOFTWARE

LOW

MED

HIGH

LOW

MED

HIGH

DEVELOPMENT / MAINTENANCE

LOW

MED

HIGH

LOW

MED

HIGH


USAGE

LOW

MED

HIGH

LOW

MED

HIGH

USAGE

RESOURCE CONSUMPTION

LOW

MED

HIGH

LOW

MED

HIGH

HARDWARE / SOFTWARE

Accommodates both Stable and Evolving Schemas
Does not require extensive data modeling
SQL / NoSQL / Map Reduce / statistical functions
RESOURCE CONSUMPTION Pre-packaged analytic modules
Analysts and data scientists

INTERACTIVE
DISCOVERY

DATA WAREHOUSE
HIGH BUSINESS VALUE DENSITY

QUERY VOLUME

OPTIMIZING VALUE
IN THE UNIFIED
DATA ARCHITECTURE
RDBMS
HADOOP

THE HIGHER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE IT
USING RELATIONAL TECHNIQUES

DATA VOLUME

THE LOWER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE
USING HADOOP

CROSS FUNCTIONAL REUSE

BUSINESS VALUE DENSITY
The ratio of business relevance to the size of the data.
STABLE SCHEMA
NO SCHEMA

LOW BUSINESS VALUE DENSITY

GENERAL PURPOSE
FILE SYSTEM
Flexible programming languages (Java, Python, C++, etc.)
Economic online archive
Land/source operational data
Data scientists and engineers

BATCH DATA
PROCESSING

HARDWARE / SOFTWARE

LOW

MED

LOW

MED

HIGH

USAGE

LOW

MED

LOW

MED

MED

HIGH

MED

HIGH

MED

HIGH

LOW

HIGH

EVOLVING SCHEMA

LOW
LOW

HIGH


LOW

HIGH


MED

HIGH


ANALYTICAL FLEXIBILITY
DATA GOVERNANCE
DATA QUALITY/INTEGRITY

Figure 1. The Teradata Unified Data Architecture

3

No transformations of data required
Scripting / Declarative languages
Analysis against raw files
USAGE Refinement, transformation, and cleansing
RESOURCE CONSUMPTION Analysts and data scientists
HARDWARE / SOFTWARE


FAST RESPONSE/THROUGHPUT
FINE GRAIN SECURITY


White Paper
10.13

and processing now allow enterprises to capture and
retain most, if not all, of the information generated by
business activities. Why capture so much lower BVD
data? Because low BVD does not mean no value. In fact,
many organizations are discovering that sparse data
that was routinely discarded not so long ago now holds
tremendous potential business value—but only if it can be
accessed efficiently.
To illustrate the concept of BVD, consider a dataset made
up of cleansed and packaged online order information for
a given time period such as the previous three months.
This dataset is relatively small and yet highly valuable to
business users in operations, marketing, finance, and other
functional areas. This order data is considered to have
high BVD; in other words, it contains a high level of useful
business insights per gigabyte.

eb 7873

In contrast, imagine capturing Web log data representing
every click on the company’s Web site over the past
five years. Compared to the order data described
previously, this dataset is significantly larger. While there
is potentially a treasure trove of business insights within
this dataset, the number of people and applications
interrogating it in its raw form would be less than the
dataset made up of cleansed and packaged orders. So,
this raw Web site data has sparse BVD, but is still highly
valuable.
Stable and evolving schemas
The ability to handle evolving schemas is an important
capability. In contrast to stable schemas that change
slowly (e.g., order records and product information),
evolving schemas change continually—think of new


OPTIMIZING VALUE
IN THE UNIFIED
DATA ARCHITECTURE

B

A

DATA VOLUME



Legend
~~ Data volume—Represented by the thickness of the circle; greatest at point A and decreases counterclockwise around the circle
~~ BVD—Lowest at point A and increases around the circle
~~ Sparse/dense—Sparse data represented by light rows with a few dark blue squares at point A; dense data represented by darker blue rows
at point B

Figure 2. Business value density

4


columns being added frequently, for example, Web log
data (Figure 3).
All data has structure. Instead of the oft-used (and
misused) terms structured, semi-structured, and
unstructured, the more useful concepts are stable and
evolving schemas. For example, even though XML and
JSON formats are often classified as semi-structured,
the schema for an individual event such as order checkout can be highly stable over long periods of time. As
a result, this information can be easily accessed using
standard ETL (extract, transform, and load) tools with
little maintenance overhead. Conversely, XML and JSON
formats frequently—and unexpectedly, from the viewpoint
of a data platform engineer—capture a new event type
such as “hovered over a particular image with pointer.”
This scenario describes evolving schema, which is
particularly challenging for traditional relational tools.

White Paper
10.13

eb 7873

No-schema data
As noted previously, all data has structure and therefore
what is frequently seen as unstructured data should
be reclassified as no-schema data (Figure 4). What’s
interesting about no-schema data with respect to analytics
is that it has analytical value in unanticipated ways. In fact,
a skilled data scientist can draw substantial insights from
no-schema data. Here are two real-life scenarios:
~~An online retailer is boosting revenue through image
analysis. In a typical case, a merchant is marketing
a red dress and supplies the search terms size 6 and
Ralph Lauren along with an image of the dress itself.
Using sophisticated image-analysis software, the
retailer can with a high degree of confidence attach
additional descriptors such as A-line and cardinal red,
which makes searching more accurate, benefiting both
merchants and buyers.
~~An innovative insurance company is using audio
recordings of phone conversations between customer
service representatives and policyholders to determine
the likelihood of a fraudulent claim based on signals
derived from voice inflections.


In both examples, the companies had made the decision to
capture the data before they had a complete idea of how
to use it. Business users developed the innovative uses
after they had become familiar with the data structure and
had access to tools to extract the hidden value.
DATA VOLUME

DATA VOLUME

STABLE SCHEMA

STABLE SCHEMA

NO SCHEMA

EVOLVING SCHEMA

Legend
~~ Stable schema data—The blue section of the band. Note that the
areas of high BVD are composed entirely of stable schemas.
~~ Evolving schema data—The gray section of the band. While much
of the data volume corresponds to evolving schemas, the BVD is
fairly low compared to the stable schemas.

Legend
No-schema data—The magenta band between the evolving and
stable schemas

Figure 3. Stable and evolving schemas

Figure 4. No-schema data

5

EVOLVING SCHEMA


Usage and query volume
By definition, there is a strong correlation between BVD
and usage volume. For example, if a company captures
100 petabytes of data, 80 percent of all queries would be
addressed to just 20 petabytes—the high BVD portion of
the dataset (Figure 5).
Usage volume includes two primary access methods: adhoc and scheduled queries. Ad-hoc queries are usually
initiated by the person who needs the information using
SQL interfaces, analytical tools, and business applications.
Scheduled queries are set up and monitored by business
analysts or data platform engineers. Applicable tools
include SQL interfaces for regularly scheduled reports, automated business applications, and low-level programming
scripts for scheduled analytics and data transformations.
A significant and growing portion of usage volume is due
to applications such as campaign management, ad serving,
search, and supply chain management that depend on
insights from the data to drive more intelligent decisions.


10.13

eb 7873

RDBMS or Hadoop
Building on the core concepts of BVD; query volume; and
stable, evolving, and no-schema data, we can draw a line
showing which data is most appropriate for an RDBMS or
Hadoop and give some background about that particular
placement.
In general, as BVD increases, the more it makes sense to
use relational techniques; while decreasing BVD indicates
that Hadoop may be the best choice. While the graphic
(Figure 6) draws the line arbitrarily through the equator,
every organization will have its own threshold based
on its information culture and maturity. Also note that
no-schema data resides solely within Hadoop because
relational constructs are often less-suited for managing
this type of data.
RDBMS technology has clear advantages over Hadoop in
terms of response time, throughput, and security, which
make it more appropriate for higher BVD data that has
greater concurrency and more security requirements
given the shared nature of the data.
These differentiators are due to the following:

QUERY VOLUME

~~Mature cost-base optimizers—When a query is
submitted, the optimizer evaluates various execution
plans and estimates the resource consumption for each.
The optimizer then selects the plan that minimizes
resource usage and thus maximizes throughput.
~~Indexing—RDBMS software has a multitude of robust
indexes with stored statistics to facilitate access, thus
shortening response times.

DATA VOLUME


STABLE SCHEMA
NO SCHEMA


EVOLVING SCHEMA

Legend
~~ Usage base volume—The amplitude of the outside spirals
indicates usage volume. Note the inverse correlation between
BVD and usage volume.
~~ Cross-functional reuse—The three colors represent the percentage of the data that is reused by groups such as marketing, customer service, and finance. These groups typically need access
to the same high-BVD data such as recent orders.

Figure 5. Usage and query volume

6

White Paper

~~Advanced partitioning—Today’s RDBMS products
feature a number of advanced partitioning methods
and criteria to optimize database performance and
improve manageability.
~~Workload management—RDBMS technology addresses
the throughput problem that occurs when many
queries are executing concurrently. The workload
manager prioritizes the query queue so that short
queries are executed quickly and long queries receive
adequate resources to avoid excessively long execution
times. Filters and throttles regulate database activity
by rejecting or limiting requests. (A filter causes
specific logon and query requests to be rejected, while
a throttle limits the number of active sessions, query
requests, or load utilities on the database.)
~~Extensive security features—Relational databases offer
sophisticated row- and column-level security, which
enables role-based security. They also include fine-grain


White Paper
10.13

security features such as authentication options, security
roles, directory integration, and encryption, versus more
coarse-grain features of the same within Hadoop.

Cost factors
Along with technological capabilities, cost drives the
design of the enterprise data architecture. The Teradata
Unified Data Architecture™ rates the relative cost of use
cases using a four-factor cost analysis:
~~Hardware and software investment—The costs associated with the acquisition of the hardware and software.
~~Development and maintenance—The ongoing cost of
acquiring data and packaging it for consumption as
well as the costs of implementing systemwide changes
such as software upgrades and changes to code and
scripts running in the environment.

eb 7873

~~Usage—The costs of querying and analyzing the data
to derive actionable insights, primarily based on market
compensation for required skills, time to author and alter
scripts and code, and wait time as it relates to productivity; these costs often are spread across multiple departments and budgets and therefore often go unnoticed;
however, they are very real for business initiatives that
leverage data and analytics for strategic advantage.
~~Resource consumption—The extent to which the
CPU, I/O, and disk resources are utilized over time;
when system resources are close to full utilization,
the organization is achieving the maximum value for
its investment in hardware and therefore resource
consumption costs would be low; underutilized systems
waste resources and drive up costs without adding
value and would therefore be medium or high.


QUERY VOLUME

RDBMS
HADOOP


DATA VOLUME

USING HADOOP


STABLE SCHEMA
NO SCHEMA


EVOLVING SCHEMA

FINE GRAIN SECURITY

Legend
~~ RDBMS-Hadoop partition—The horizontal line partitions the BVD space between high-BVD data that can be effectively managed with
an RDBMS and low-BVD data that is best suited to Hadoop. The partitioning point (intersection of line and data curve) is unique to each
organization and may change over time.
~~ RDBMS features—The two arcs within the data circles represent key advantages of RDBMS: fast response times/throughput and
fine-grain security.

Figure 6. The RDBMS-Hadoop partition

7


White Paper
10.13

Use Case Overview
While there are a large number of possible data scenarios
in the enterprise world today, the majority fall into these
four use cases:
~~Integrated data warehouse—Provides an unambiguous
view of information for timely and accurate decision
making
~~Interactive discovery—Addresses the challenge of exploring large datasets with less defined or evolving schemas
~~Batch data processing—Transforms data and performs
analytics against larger datasets when storage costs are
valued over interactive response times and throughput
~~General file system—Ingests and stores raw data with
no transformation, making this use case an economical
online archive for the lowest BVD data
Each use case is described in more detail in the
following sections.
Integrated data warehouse
The association of the relational database and big data
occurs in the integrated data warehouse (Figure 7). The

integrated data warehouse is the overwhelming choice
for the important data that drives organizational decisionmaking where a single, accurate, timely, and unambiguous
version of the information is required.
The integrated data warehouse uses a well-defined
schema to offer a single view of the business to enable
easy data access and ensure consistent results across
the entire enterprise. It also provides a shared source
for analytics across multiple departments within the
enterprise. Data is loaded once and used many times
without the need for the user to repeatedly define and
execute agreed-upon transformation rules such as the
definitions of customer, order, and lifetime value score.
The integrated data warehouse supports ANSI SQL as
well as many mature third-party applications. Information
in the integrated data warehouse is scalable and can be
accessed by knowledge workers and business analysts
across the enterprise.
The integrated data warehouse is the tried-and-true gold
standard for high-BVD data, supporting cross-functional
reuse and the largest number of business users with
a full set of features and benefits unmatched by other
approaches to data management.

Single view of your business
Shared source for analytics
Load once, Use many times
Knowledge workers and analysts

HARDWARE / SOFTWARE

LOW

MED

HIGH


LOW

MED

HIGH

USAGE

LOW

MED

HIGH


LOW

MED

HIGH

DATA WAREHOUSE

QUERY VOLUME

Figure 7. Integrated data warehouse

PTIMIZING VALUE
IN THE UNIFIED
A ARCHITECTURE
8

RDBMS

eb 7873

DATA VOLUME


White Paper
10.13

Cost analysis
~~Hardware and software investment: High—Software
development for the commercial engineering
effort required to deliver the differentiated benefits
described previously, as well as an optimized,
integrated hardware platform, warrant substantial
initial investments.
~~Development and maintenance expense:
Medium—Realizing the maximum benefit of clean,
integrated, easy-to-consume information requires
data modeling and ETL operations, which drive up
development costs. However, the productivity tools
and people skills for developing and maintaining a
relational environment are readily available in the
marketplace, mitigating the development costs. Also,
the data warehouse has diminishing incremental
development costs because it builds on existing data
and transformation rules and facilitates data reuse.
~~Usage expense: Low—Users can navigate the enterprise
data and create complex queries in SQL that return
results quickly, minimizing the need for expensive
programmers and reducing unproductive wait
times. This benefit is a result of the costs incurred in
development and maintenance as described previously.
~~Resource consumption: Low—Tight vertical integration
across the stack enables optimal utilization of system

eb 7873

CPU and I/O resources so that the maximum amount
of throughput can be achieved within an environment
bounded by CPU and I/O.
Interactive discovery
Interactive discovery platforms address the challenge of
exploring large datasets with less-defined or evolving
schemas by adapting methodologies that originate from
the Hadoop ecosystem within an RBDMS (Figure 8). Some
of the inherent advantages of the RDBMS technology are
particularly fast response times and throughput, as well
as the ease of use stemming from ANSI SQL compliance.
Interactive discovery requires less time spent on data
governance, data quality, and data integrity because
users are looking for new insights in advance of such
rigor required for more formal auctioning of the data and
insights. The fast response times enable accelerated insight
discovery, and the ANSI SQL interface democratizes the
data in the widest possible user base.
This approach combines schema-on-read, MapReduce,
and flexible programming languages with RBDMS features
such as ANSI SQL support, low latency, fine-grain security,
data quality, and reliability. Interactive discovery has
cost and flexibility advantages over the integrated data
warehouse, but at the expense of concurrency (usage
volume) and governance control.

LOW

MED

HIGH

LOW

MED

HIGH

LOW

MED

HIGH

LOW

MED

HIGH

HARDWARE / SOFTWARE

Accommodates both Stable and Evolving Schemas
Does not require extensive data modeling
USAGE SQL / NoSQL / Map Reduce / statistical functions
Analysts and data scientists


INTERACTIVE
DISCOVERY

Figure 8. Interactive discovery


9

USING HADOOP


White Paper
10.13

eb 7873

LOW
MED
HIGH
HARDWARE / SOFTWARE Accommodates both Stable as business analysts can
A key reason to use interactive discovery is analytical
SQL script, data scientists as well and Evolving Schemas
MED
HIGH
LOW
DEVELOPMENT /
flexibility (also applicable to Hadoop), which is based on MAINTENANCE Does not require extensive data modeling
use interactive discovery without additional training.
LOW
MED
HIGH
USAGE SQL / NoSQL / Map Reduce / statistical functions
these features:
LOW
MED
HIGH
Cost analysis
~~Schema-on-read—Structure is imposed when the
Analysts and data
~~Hardware and softwarescientists
investment: Medium—Interactive
data is read, unlike the schema-on-write approach of
discovery platforms are less expensive than the
the integrated data warehouse. This feature allows
integrated data warehouse.
complete freedom to transform and manipulate the
~~Development and maintenance: Low—Interactive
data at a later time. The use cases in the Hadoop
discovery uses light modeling techniques, which
hemisphere also use schema-on-read techniques.
minimize efforts for ETL and data modeling.
~~Low-level programming—Languages such as Java and
~~Usage: Low—SQL is easy to use, reducing user time
Python can be used to construct complex queries and
required to generate queries. Built-in analytical
even perform row-over-row comparisons, both of which
functions reduce hundreds of lines of code to single
are extremely challenging with SQL. This kind of prostatements. The performance characteristics of an
cessing is more commonly associated with row-over-row
RDBMS reduce unproductive wait times.
comparisons, such as time-series and pathing analysis.

INTERACTIVE
DISCOVERY

~~Resource consumption: Low—Commercial RDBMS
Interactive discovery accommodates both stable and
software is optimized for efficient utilization of
evolving schemas without extensive data modeling.
resources.
It leverages SQL, NoSQL, MapReduce, and statistical
functions in a single analytical process and incorporates
Batch data processing
prepackaged analytical modules. NoSQL and MapReduce
Unlike the integrated data warehouse and interactive
are particularly useful for analyses such as time series
discovery platforms, batch
and social graph that require complex processing HIGHER THE BUSINESS VALUE DENSITY, processing lies within the
THE
Hadoop sphere (Figure 9).
THE ANSI
beyond the capabilities of ANSI SQL. As a result of MORE IT MAKES SENSE TO MANAGE IT A key difference between
batch data processing and
USING RELATIONAL TECHNIQUES interactive discovery is that
SQL compliance and a myriad of prebuilt MapReduce
batch processing involves no physical data movement
analytical functions that can be incorporated into an ANSI
USING HADOOP

BLE SCHEMA

HEMA

BATCH DATA
PROCESSING
LOW

HIGH

LOW

MED

HIGH

LOW

MED

HIGH

LOW

LVING SCHEMA

MED

MED

HIGH

No transformations of data required
Scripting / Declarative languages
DEVELOPMENT / MAINTENANCE Analysis against raw files
USAGE Refinement, transformation, and cleansing
RESOURCE CONSUMPTION Analysts and data scientists
HARDWARE / SOFTWARE


CAL FLEXIBILITY


RNANCE

FINE GRAIN SECURITY

NTEGRITY

Figure 9. Batch data processing

10


White Paper
10.13

eb 7873

as part of the transformation into a more usable model.
~~Usage: Medium—Unlike the previous use cases that are
Light data modeling is applied against the raw data files
accessible to SQL users, batch processing requires new
to facilitate more intuitive usage. The nature of the file
skills for authoring queries and is not compatible with the
system and the ability to flexibly manipulate data makes
full breadth of features and functionality found in modern
batch processing an ideal environment for refining,
business intelligence tools. In addition, query run times
transforming, and cleansing data, as well as performing
are longer, resulting in wait times that lower productivity.
analytics against larger datasets CHARACTERISTICS COST
when storage costs are
~~Resource consumption:HIGH
High—In general, Hadoop
LOW
MED
Single view of your business HARDWARE / SOFTWARE
valued over fast response times and throughput.
LOW
MED
HIGH
MAINTENANCE
Shared source for analytics DEVELOPMENT / software makes less efficient use of hardware resources
LOW
MED
HIGH
than RDBMS.
Load once, Use many times USAGE
Since the underlying data is raw, the task of transforming


LOW

MED

HIGH

the data must be performed when the query analysts
is processed.
Knowledge workers and
General-purpose file
This is immensely valuable in that it provides a high DATA WAREHOUSE system
degree of flexibility for the user.
As used in this context, the general-purpose file system
HIGH BUSINESS to the Hadoop Distributed File System (HDFS) and
refers VALUE DENSITY
Batch processing incorporates a wide range of declarative
flexible programming languages (Figure 10). Raw data
language processing using Pig, Hive, and other emerging
is ingested and stored with no transformation, making
QUERY VOLUME
access tools in the Hadoop ecosystem. These tools are
this use case an economical online archive for the lowest
especially valuable for analyzing low BVD data when query
BVD data. Hadoop allows data scientists and engineers to
response time is not as critical, the logic applied to the
apply flexible low-level programming languages such as
data is complex, and full scans of the data are required—for
Java, Python, and C++ against the largest datasets without
example, sessionizing Web log data, counting events, and
any up-front characterization of the data.
executing complex algorithms. This approach is ideal for
analysts, developers, and data scientists.
Cost analysis

OPTIMIZING VALUE
IN THE UNIFIED
~~Hardware and software investment: Low—Batch
processing is available through open source software
DATA ARCHITECTURE

Cost analysis

~~Hardware and software investment: Low—Like batch
processing, this approach benefits from open source
software and commodity hardware.

~~Development and maintenance: High—Working effectively in this environment requires not only proficiency
DATA VOLUME
RDBMS
~~Development and maintenance: Medium—The skills
with low-level programming languages but also a workrequired to do development and maintain the Hadoop
ing understanding of Linux and the network configuraHADOOP
environment are relatively scarce in the marketplace, CROSS FUNCTIONAL REUSE
tion. The lack of mature development tools and applidriving up labor costs. Optimizing code in the environcations and the premium salaries demanded by skilled
ment is primarily a burden on the development team.
scientists and engineers all contribute to costs.
and runs on commodity hardware.

STABLE SCHEMA
NO SCHEMA


GENERAL PURPOSE
FILE SYSTEM
Flexible programming languages (Java, Python, C++, etc.)
Economic online archive
Land/source operational data
Data scientists and engineers

HARDWARE / SOFTWARE

LOW

MED

HIGH


LOW

MED

HIGH

USAGE

LOW

MED

HIGH


LOW

MED

HIGH

EVOLVING SCHEMA

ANALYTICAL FLEXIBILITY
DATA GOVERNANCE
DATA QUALITY/INTEGRITY

Figure 10. General-purpose file system

11

FAST RE

FINE GRAIN S


~~Usage: High—Data processing in this environment is
essentially a development task, requiring the same skill
set and incurring the same labor costs as described
previously in development and maintenance.
~~Resource consumption: High—Hadoop is less efficient
than RDBMS software in utilizing CPU and I/O
processing cycles.

Conclusion
Database technology is no longer a one-size-fits-all
world—maximizing the business of volumes of
enterprise data requires the right tool for the right job.
This paper is intended to help IT architects and data
platform stakeholders understand how to map available
technologies—in particular, relational databases and big
data frameworks such as Hadoop—to each use case.
Integrating these and other tools into a single, unified
data platform gives data scientists, business analysts,
and other users powerful new capabilities to streamline
workflows, realize operational efficiencies, and drive
competitive advantage—exactly the value proposition of
the Teradata Unified Data Architecture™.

White Paper
10.13

eb 7873

The integrated data warehouse is most appropriate for
the highest BVD data, where demands for the data across
the enterprise are the greatest. When deployed optimally,
there is the right balance of hardware and software costs
for the benefits realized in lower development, usage, and
resource consumption costs.
Interactive discovery is best for capturing and analyzing
both stable and evolving schema data through traditional
set or advanced procedural processing when there is
a premium on fast response times or ease of access to
better democratize the data.
Batch data processing is ideal for analyzing and
transforming any kind of data through procedural
processing by end users who possess either low-level
programming languages or higher-order declarative
language skills, and where fast response times and
throughput are not essential.
General-purpose file system offers the greatest degree
of flexibility and lowest storage costs for engineers and
data scientists with the skills and patience to navigate all
enterprise data.
For more information, visit www.teradata.com.

10000 Innovation Drive Dayton, OH 45342

teradata.com

Unified Data Architecture is a trademark, and Teradata and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in the U.S. and worldwide.
Apache is a trademark, and Hadoop is a registered trademark of Apache Software Foundation. Teradata continually improves products as new technologies and components
become available. Teradata, therefore, reserves the right to change specifications without prior notice. All features, functions, and operations described herein may not be
marketed in all parts of the world. Consult your Teradata representative or Teradata.com for more information.
Copyright © 2013 by Teradata Corporation All Rights Reserved. Produced in USA.
EB-7873
> 1013

Optimizing Big Data Value Across Hadoop and Relational

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (20)

En vedette

En vedette (20)

Optimizing Big Data Value Across Hadoop and Relational