SlideShare une entreprise Scribd logo
1  sur  12
Télécharger pour lire hors ligne
White Paper
10.13

Optimize the Business
Value of All Your
Enterprise Data
Integrated approach incorporates relational
databases and Apache Hadoop to provide a
framework for the enterprise data architecture
BY Chad Meley
Director of eCommerce & Digital Media

eb 7873
Optimize the Business Value
of All Your Enterprise Data

Executive Summary
Few industries have evolved as quickly as data
processing, thanks to the effect of Moore’s Law coupled
with Silicon Valley–style software innovation. So it comes
as no surprise that innovations in data analysis have led
to new data, new tools, and new demands to remain
competitive. Market leaders in many industries are
adopting these new capabilities, fast followers are on their
heels, and the mainstream is not far behind.
This renaissance has affected the data warehouse in
powerful ways. In the 1990s and earlier 2000s, the
massively parallel processing (MPP) relational data
warehouse was the only proven and scalable place to
hold corporate memory. In the late 2000s, an explosion
of new data types and enabling technologies lead some
to claim the demise of the traditional data warehouse.
A more pragmatic view has emerged recently, that a
one-size-fits-all approach—whether a traditional data
warehouse or Apache™ Hadoop®—is insufficient by itself
in a time when datasets and usage patterns vary widely.
Technology advances have expanded the options to
include permutations of the data warehouse in what is
referred to as built-for-purpose solutions.
Yet even seasoned practitioners who embrace
multiplatform data environments still struggle to decide
which technology is the best choice for each use case. By
analogy, consider the transformations that have occurred
in moving physical goods around the world in the past
century—first cargo ships, then rail and trucks, and finally
airplanes. Because of our familiarity with these modes,
we know intrinsically what use cases are best for each
transportation option, and nobody questions the need for
all of them to exist within a global logistics framework.
Knowing the value propositions and economics for each,
it would be foolish for someone to say “Why would
anyone ever use an airplane to ship goods when rail is a
fraction of the cost per pound?” Or “Why would I ever
consider using a cargo ship to move oil when I can get it
to market faster using air?”
But the best fit for data platform technologies is not as
universally understood at this time.
This paper will not bring instant clarity to this complex subject; rather, the intent is to define a framework of capabilities and costs for various options to encourage informed
dialogue that will accelerate more comprehensive understanding in the industry.

2

White Paper
10.13

eb 7873

Teradata has defined the Teradata® Unified Data
Architecture™, a solution that allows the analytics renaissance to flourish while controlling costs and discovering new
analytics. As guideposts in this expansion, we have identified
workloads that fit into built-for-purpose zones of activity:
~~Integrated data warehouse
~~Interactive discovery
~~Batch data processing
~~General-purpose file system
By making use of this array of analytical environments, companies can extract significant value from a broader range
of data—much of which would have been discarded just a
few years ago. As a result, business users can solve more
high-value business problems, achieve greater operational
efficiencies, and execute faster on strategic initiatives.
While the big data landscape is spawning new and
innovative products at an astonishing pace, a great
deal of attention continues to be focused on one of
the seminal technologies that launched the big data
analytics expansion: Hadoop. An open source software
framework that supports the processing of large datasets
in a distributed applications environment, Hadoop
uses parallelism over raw files through its MapReduce
framework. It has the momentum and community support
that make it the most likely to eventually become the
dominant enterprise standard in its space in a new breed
of data technologies.

The Teradata Unified Data
Architecture
Teradata offers a hybrid enterprise data architecture that
integrates Hadoop and massively parallel processing (MPP)
relational database management systems (RDBMS). Known
as the Teradata Unified Data Architecture™, this solution
relies on input from Teradata subject-matter experts and
Teradata customers who are experienced practitioners
with both Hadoop and traditional data warehousing. This
architecture has also been validated with leading industry
analysts and provides a strong foundation for designing
next-generation enterprise data architectures.
The essence of the Teradata Unified Data Architecture™ is
captured in a comprehensive infographic that is intended
to be a reference for database architects and strategic
planners as they develop their next-generation enterprise
Optimize the Business Value
of All Your Enterprise Data

White Paper
10.13

eb 7873

Before the big data revolution, organizations established
clear guidelines to determine what data would be
captured and how long it would be retained. As a result,
only the dense data (high BVD) was retained. Lower BVD
data was discarded, compounded by the absence of
identified use cases and tools to exploit it.

data architectures (Figure 1). The graphic, along with
the more detailed explanations in this paper, provide
objective criteria for deciding which technology is best
suited to particular needs within the organization.
To provide a framework for understanding the use cases,
the following sections describe a number of important
concepts such as business value density (BVD), stable
and evolving schemas, and query and data volumes.
There are many different concepts that interplay within
the graphic, so it is broken down in a logical order.

Factors Affecting Business Value Density
Data Parameter

High BVD

Low BVD

Age

One of the most important concepts for understanding
the Teradata Unified Data Architecture™ is BVD, defined
as the amount of business relevance per gigabyte of data
(Figure 2). Put another way, how many business insights
can be extracted for a given amount of data? There are
a number of factors that influence BVD, including when
the data was captured, the amount of detail in the data,
the percentage of inaccurate or corrupt records (data
hygiene), and how often the data is accessed and reused
(see table).

Recent

Older

Form

Modeled

Raw

Hygiene

Clean

Raw

Access

Frequent

Rare

Reuse

Business value density

Frequent

Rare

The big data movement has brought a fundamental shift
in data capture, retention, and processing philosophies.
Declining storage costs and file-based data capture

COST CHARACTERISTICS

CHARACTERISTICS COST
Single view of your business
Shared source for analytics
Load once, Use many times
SQL / 3rd party applications
Knowledge workers and analysts

HARDWARE / SOFTWARE

LOW

MED

HIGH

LOW

MED

HIGH

DEVELOPMENT / MAINTENANCE

LOW

MED

HIGH

LOW

MED

HIGH

DEVELOPMENT / MAINTENANCE

USAGE

LOW

MED

HIGH

LOW

MED

HIGH

USAGE

RESOURCE CONSUMPTION

LOW

MED

HIGH

LOW

MED

HIGH

HARDWARE / SOFTWARE

Accommodates both Stable and Evolving Schemas
Does not require extensive data modeling
SQL / NoSQL / Map Reduce / statistical functions
RESOURCE CONSUMPTION Pre-packaged analytic modules
Analysts and data scientists

INTERACTIVE
DISCOVERY

DATA WAREHOUSE
HIGH BUSINESS VALUE DENSITY

QUERY VOLUME

OPTIMIZING VALUE
IN THE UNIFIED
DATA ARCHITECTURE
RDBMS
HADOOP

THE HIGHER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE IT
USING RELATIONAL TECHNIQUES

DATA VOLUME

THE LOWER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE
USING HADOOP

CROSS FUNCTIONAL REUSE

BUSINESS VALUE DENSITY
The ratio of business relevance to the size of the data.
STABLE SCHEMA
NO SCHEMA

LOW BUSINESS VALUE DENSITY

GENERAL PURPOSE
FILE SYSTEM
Flexible programming languages (Java, Python, C++, etc.)
Economic online archive
Land/source operational data
Data scientists and engineers

BATCH DATA
PROCESSING

HARDWARE / SOFTWARE

LOW

MED

LOW

MED

HIGH

USAGE

LOW

MED

LOW

MED

MED

HIGH

MED

HIGH

MED

HIGH

LOW

HIGH

EVOLVING SCHEMA

LOW
LOW

HIGH

RESOURCE CONSUMPTION

LOW

HIGH

DEVELOPMENT / MAINTENANCE

MED

HIGH

CHARACTERISTICS COST

COST CHARACTERISTICS
ANALYTICAL FLEXIBILITY
DATA GOVERNANCE
DATA QUALITY/INTEGRITY

Figure 1. The Teradata Unified Data Architecture

3

No transformations of data required
Scripting / Declarative languages
Analysis against raw files
USAGE Refinement, transformation, and cleansing
RESOURCE CONSUMPTION Analysts and data scientists
HARDWARE / SOFTWARE

DEVELOPMENT / MAINTENANCE

FAST RESPONSE/THROUGHPUT
FINE GRAIN SECURITY
Optimize the Business Value
of All Your Enterprise Data

White Paper
10.13

and processing now allow enterprises to capture and
retain most, if not all, of the information generated by
business activities. Why capture so much lower BVD
data? Because low BVD does not mean no value. In fact,
many organizations are discovering that sparse data
that was routinely discarded not so long ago now holds
tremendous potential business value—but only if it can be
accessed efficiently.
To illustrate the concept of BVD, consider a dataset made
up of cleansed and packaged online order information for
a given time period such as the previous three months.
This dataset is relatively small and yet highly valuable to
business users in operations, marketing, finance, and other
functional areas. This order data is considered to have
high BVD; in other words, it contains a high level of useful
business insights per gigabyte.

eb 7873

In contrast, imagine capturing Web log data representing
every click on the company’s Web site over the past
five years. Compared to the order data described
previously, this dataset is significantly larger. While there
is potentially a treasure trove of business insights within
this dataset, the number of people and applications
interrogating it in its raw form would be less than the
dataset made up of cleansed and packaged orders. So,
this raw Web site data has sparse BVD, but is still highly
valuable.
Stable and evolving schemas
The ability to handle evolving schemas is an important
capability. In contrast to stable schemas that change
slowly (e.g., order records and product information),
evolving schemas change continually—think of new

HIGH BUSINESS VALUE DENSITY

OPTIMIZING VALUE
IN THE UNIFIED
DATA ARCHITECTURE

B

A

DATA VOLUME

BUSINESS VALUE DENSITY
The ratio of business relevance to the size of the data.

LOW BUSINESS VALUE DENSITY

Legend
~~ Data volume—Represented by the thickness of the circle; greatest at point A and decreases counterclockwise around the circle
~~ BVD—Lowest at point A and increases around the circle
~~ Sparse/dense—Sparse data represented by light rows with a few dark blue squares at point A; dense data represented by darker blue rows
at point B

Figure 2. Business value density

4
Optimize the Business Value
of All Your Enterprise Data

columns being added frequently, for example, Web log
data (Figure 3).
All data has structure. Instead of the oft-used (and
misused) terms structured, semi-structured, and
unstructured, the more useful concepts are stable and
evolving schemas. For example, even though XML and
JSON formats are often classified as semi-structured,
the schema for an individual event such as order checkout can be highly stable over long periods of time. As
a result, this information can be easily accessed using
standard ETL (extract, transform, and load) tools with
little maintenance overhead. Conversely, XML and JSON
formats frequently—and unexpectedly, from the viewpoint
of a data platform engineer—capture a new event type
such as “hovered over a particular image with pointer.”
This scenario describes evolving schema, which is
particularly challenging for traditional relational tools.

White Paper
10.13

eb 7873

No-schema data
As noted previously, all data has structure and therefore
what is frequently seen as unstructured data should
be reclassified as no-schema data (Figure 4). What’s
interesting about no-schema data with respect to analytics
is that it has analytical value in unanticipated ways. In fact,
a skilled data scientist can draw substantial insights from
no-schema data. Here are two real-life scenarios:
~~An online retailer is boosting revenue through image
analysis. In a typical case, a merchant is marketing
a red dress and supplies the search terms size 6 and
Ralph Lauren along with an image of the dress itself.
Using sophisticated image-analysis software, the
retailer can with a high degree of confidence attach
additional descriptors such as A-line and cardinal red,
which makes searching more accurate, benefiting both
merchants and buyers.
~~An innovative insurance company is using audio
recordings of phone conversations between customer
service representatives and policyholders to determine
the likelihood of a fraudulent claim based on signals
derived from voice inflections.

HIGH BUSINESS VALUE DENSITY

In both examples, the companies had made the decision to
capture the data before they had a complete idea of how
to use it. Business users developed the innovative uses
after they had become familiar with the data structure and
had access to tools to extract the hidden value.
DATA VOLUME

DATA VOLUME

STABLE SCHEMA

STABLE SCHEMA
LOW BUSINESS VALUE DENSITY

NO SCHEMA

LOW BUSINESS VALUE DENSITY
EVOLVING SCHEMA

Legend
~~ Stable schema data—The blue section of the band. Note that the
areas of high BVD are composed entirely of stable schemas.
~~ Evolving schema data—The gray section of the band. While much
of the data volume corresponds to evolving schemas, the BVD is
fairly low compared to the stable schemas.

Legend
No-schema data—The magenta band between the evolving and
stable schemas

Figure 3. Stable and evolving schemas

Figure 4. No-schema data

5

EVOLVING SCHEMA
Optimize the Business Value
of All Your Enterprise Data

Usage and query volume
By definition, there is a strong correlation between BVD
and usage volume. For example, if a company captures
100 petabytes of data, 80 percent of all queries would be
addressed to just 20 petabytes—the high BVD portion of
the dataset (Figure 5).
Usage volume includes two primary access methods: adhoc and scheduled queries. Ad-hoc queries are usually
initiated by the person who needs the information using
SQL interfaces, analytical tools, and business applications.
Scheduled queries are set up and monitored by business
analysts or data platform engineers. Applicable tools
include SQL interfaces for regularly scheduled reports, automated business applications, and low-level programming
scripts for scheduled analytics and data transformations.
A significant and growing portion of usage volume is due
to applications such as campaign management, ad serving,
search, and supply chain management that depend on
insights from the data to drive more intelligent decisions.

HIGH BUSINESS VALUE DENSITY

10.13

eb 7873

RDBMS or Hadoop
Building on the core concepts of BVD; query volume; and
stable, evolving, and no-schema data, we can draw a line
showing which data is most appropriate for an RDBMS or
Hadoop and give some background about that particular
placement.
In general, as BVD increases, the more it makes sense to
use relational techniques; while decreasing BVD indicates
that Hadoop may be the best choice. While the graphic
(Figure 6) draws the line arbitrarily through the equator,
every organization will have its own threshold based
on its information culture and maturity. Also note that
no-schema data resides solely within Hadoop because
relational constructs are often less-suited for managing
this type of data.
RDBMS technology has clear advantages over Hadoop in
terms of response time, throughput, and security, which
make it more appropriate for higher BVD data that has
greater concurrency and more security requirements
given the shared nature of the data.
These differentiators are due to the following:

QUERY VOLUME

~~Mature cost-base optimizers—When a query is
submitted, the optimizer evaluates various execution
plans and estimates the resource consumption for each.
The optimizer then selects the plan that minimizes
resource usage and thus maximizes throughput.
~~Indexing—RDBMS software has a multitude of robust
indexes with stored statistics to facilitate access, thus
shortening response times.

DATA VOLUME

CROSS FUNCTIONAL REUSE

STABLE SCHEMA
NO SCHEMA

LOW BUSINESS VALUE DENSITY

EVOLVING SCHEMA

Legend
~~ Usage base volume—The amplitude of the outside spirals
indicates usage volume. Note the inverse correlation between
BVD and usage volume.
~~ Cross-functional reuse—The three colors represent the percentage of the data that is reused by groups such as marketing, customer service, and finance. These groups typically need access
to the same high-BVD data such as recent orders.

Figure 5. Usage and query volume

6

White Paper

~~Advanced partitioning—Today’s RDBMS products
feature a number of advanced partitioning methods
and criteria to optimize database performance and
improve manageability.
~~Workload management—RDBMS technology addresses
the throughput problem that occurs when many
queries are executing concurrently. The workload
manager prioritizes the query queue so that short
queries are executed quickly and long queries receive
adequate resources to avoid excessively long execution
times. Filters and throttles regulate database activity
by rejecting or limiting requests. (A filter causes
specific logon and query requests to be rejected, while
a throttle limits the number of active sessions, query
requests, or load utilities on the database.)
~~Extensive security features—Relational databases offer
sophisticated row- and column-level security, which
enables role-based security. They also include fine-grain
Optimize the Business Value
of All Your Enterprise Data

White Paper
10.13

security features such as authentication options, security
roles, directory integration, and encryption, versus more
coarse-grain features of the same within Hadoop.

Cost factors
Along with technological capabilities, cost drives the
design of the enterprise data architecture. The Teradata
Unified Data Architecture™ rates the relative cost of use
cases using a four-factor cost analysis:
~~Hardware and software investment—The costs associated with the acquisition of the hardware and software.
~~Development and maintenance—The ongoing cost of
acquiring data and packaging it for consumption as
well as the costs of implementing systemwide changes
such as software upgrades and changes to code and
scripts running in the environment.

eb 7873

~~Usage—The costs of querying and analyzing the data
to derive actionable insights, primarily based on market
compensation for required skills, time to author and alter
scripts and code, and wait time as it relates to productivity; these costs often are spread across multiple departments and budgets and therefore often go unnoticed;
however, they are very real for business initiatives that
leverage data and analytics for strategic advantage.
~~Resource consumption—The extent to which the
CPU, I/O, and disk resources are utilized over time;
when system resources are close to full utilization,
the organization is achieving the maximum value for
its investment in hardware and therefore resource
consumption costs would be low; underutilized systems
waste resources and drive up costs without adding
value and would therefore be medium or high.

HIGH BUSINESS VALUE DENSITY

QUERY VOLUME

RDBMS
HADOOP

THE HIGHER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE IT
USING RELATIONAL TECHNIQUES

DATA VOLUME

THE LOWER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE
USING HADOOP

CROSS FUNCTIONAL REUSE

BUSINESS VALUE DENSITY
The ratio of business relevance to the size of the data.
STABLE SCHEMA
NO SCHEMA

LOW BUSINESS VALUE DENSITY

EVOLVING SCHEMA

FAST RESPONSE/THROUGHPUT
FINE GRAIN SECURITY

Legend
~~ RDBMS-Hadoop partition—The horizontal line partitions the BVD space between high-BVD data that can be effectively managed with
an RDBMS and low-BVD data that is best suited to Hadoop. The partitioning point (intersection of line and data curve) is unique to each
organization and may change over time.
~~ RDBMS features—The two arcs within the data circles represent key advantages of RDBMS: fast response times/throughput and
fine-grain security.

Figure 6. The RDBMS-Hadoop partition

7
Optimize the Business Value
of All Your Enterprise Data

White Paper
10.13

Use Case Overview
While there are a large number of possible data scenarios
in the enterprise world today, the majority fall into these
four use cases:
~~Integrated data warehouse—Provides an unambiguous
view of information for timely and accurate decision
making
~~Interactive discovery—Addresses the challenge of exploring large datasets with less defined or evolving schemas
~~Batch data processing—Transforms data and performs
analytics against larger datasets when storage costs are
valued over interactive response times and throughput
~~General file system—Ingests and stores raw data with
no transformation, making this use case an economical
online archive for the lowest BVD data
Each use case is described in more detail in the
following sections.
Integrated data warehouse
The association of the relational database and big data
occurs in the integrated data warehouse (Figure 7). The

integrated data warehouse is the overwhelming choice
for the important data that drives organizational decisionmaking where a single, accurate, timely, and unambiguous
version of the information is required.
The integrated data warehouse uses a well-defined
schema to offer a single view of the business to enable
easy data access and ensure consistent results across
the entire enterprise. It also provides a shared source
for analytics across multiple departments within the
enterprise. Data is loaded once and used many times
without the need for the user to repeatedly define and
execute agreed-upon transformation rules such as the
definitions of customer, order, and lifetime value score.
The integrated data warehouse supports ANSI SQL as
well as many mature third-party applications. Information
in the integrated data warehouse is scalable and can be
accessed by knowledge workers and business analysts
across the enterprise.
The integrated data warehouse is the tried-and-true gold
standard for high-BVD data, supporting cross-functional
reuse and the largest number of business users with
a full set of features and benefits unmatched by other
approaches to data management.

CHARACTERISTICS COST
Single view of your business
Shared source for analytics
Load once, Use many times
SQL / 3rd party applications
Knowledge workers and analysts

HARDWARE / SOFTWARE

LOW

MED

HIGH

DEVELOPMENT / MAINTENANCE

LOW

MED

HIGH

USAGE

LOW

MED

HIGH

RESOURCE CONSUMPTION

LOW

MED

HIGH

DATA WAREHOUSE
HIGH BUSINESS VALUE DENSITY

QUERY VOLUME

Figure 7. Integrated data warehouse

PTIMIZING VALUE
IN THE UNIFIED
A ARCHITECTURE
8

RDBMS

eb 7873

DATA VOLUME
Optimize the Business Value
of All Your Enterprise Data

White Paper
10.13

Cost analysis
~~Hardware and software investment: High—Software
development for the commercial engineering
effort required to deliver the differentiated benefits
described previously, as well as an optimized,
integrated hardware platform, warrant substantial
initial investments.
~~Development and maintenance expense:
Medium—Realizing the maximum benefit of clean,
integrated, easy-to-consume information requires
data modeling and ETL operations, which drive up
development costs. However, the productivity tools
and people skills for developing and maintaining a
relational environment are readily available in the
marketplace, mitigating the development costs. Also,
the data warehouse has diminishing incremental
development costs because it builds on existing data
and transformation rules and facilitates data reuse.
~~Usage expense: Low—Users can navigate the enterprise
data and create complex queries in SQL that return
results quickly, minimizing the need for expensive
programmers and reducing unproductive wait
times. This benefit is a result of the costs incurred in
development and maintenance as described previously.
~~Resource consumption: Low—Tight vertical integration
across the stack enables optimal utilization of system

eb 7873

CPU and I/O resources so that the maximum amount
of throughput can be achieved within an environment
bounded by CPU and I/O.
Interactive discovery
Interactive discovery platforms address the challenge of
exploring large datasets with less-defined or evolving
schemas by adapting methodologies that originate from
the Hadoop ecosystem within an RBDMS (Figure 8). Some
of the inherent advantages of the RDBMS technology are
particularly fast response times and throughput, as well
as the ease of use stemming from ANSI SQL compliance.
Interactive discovery requires less time spent on data
governance, data quality, and data integrity because
users are looking for new insights in advance of such
rigor required for more formal auctioning of the data and
insights. The fast response times enable accelerated insight
discovery, and the ANSI SQL interface democratizes the
data in the widest possible user base.
This approach combines schema-on-read, MapReduce,
and flexible programming languages with RBDMS features
such as ANSI SQL support, low latency, fine-grain security,
data quality, and reliability. Interactive discovery has
cost and flexibility advantages over the integrated data
warehouse, but at the expense of concurrency (usage
volume) and governance control.

COST CHARACTERISTICS
LOW

MED

HIGH

LOW

MED

HIGH

LOW

MED

HIGH

LOW

MED

HIGH

HARDWARE / SOFTWARE

Accommodates both Stable and Evolving Schemas
Does not require extensive data modeling
USAGE SQL / NoSQL / Map Reduce / statistical functions
RESOURCE CONSUMPTION Pre-packaged analytic modules
Analysts and data scientists

DEVELOPMENT / MAINTENANCE

INTERACTIVE
DISCOVERY

Figure 8. Interactive discovery

THE HIGHER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE IT
USING RELATIONAL TECHNIQUES

9

THE LOWER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE
USING HADOOP
Optimize the Business Value
of All Your Enterprise Data

White Paper
10.13

eb 7873

COST CHARACTERISTICS
LOW
MED
HIGH
HARDWARE / SOFTWARE Accommodates both Stable as business analysts can
A key reason to use interactive discovery is analytical
SQL script, data scientists as well and Evolving Schemas
MED
HIGH
LOW
DEVELOPMENT /
flexibility (also applicable to Hadoop), which is based on MAINTENANCE Does not require extensive data modeling
use interactive discovery without additional training.
LOW
MED
HIGH
USAGE SQL / NoSQL / Map Reduce / statistical functions
these features:
RESOURCE CONSUMPTION Pre-packaged analytic modules
LOW
MED
HIGH
Cost analysis
~~Schema-on-read—Structure is imposed when the
Analysts and data
~~Hardware and softwarescientists
investment: Medium—Interactive
data is read, unlike the schema-on-write approach of
discovery platforms are less expensive than the
the integrated data warehouse. This feature allows
integrated data warehouse.
complete freedom to transform and manipulate the
~~Development and maintenance: Low—Interactive
data at a later time. The use cases in the Hadoop
discovery uses light modeling techniques, which
hemisphere also use schema-on-read techniques.
minimize efforts for ETL and data modeling.
~~Low-level programming—Languages such as Java and
~~Usage: Low—SQL is easy to use, reducing user time
Python can be used to construct complex queries and
required to generate queries. Built-in analytical
even perform row-over-row comparisons, both of which
functions reduce hundreds of lines of code to single
are extremely challenging with SQL. This kind of prostatements. The performance characteristics of an
cessing is more commonly associated with row-over-row
RDBMS reduce unproductive wait times.
comparisons, such as time-series and pathing analysis.

INTERACTIVE
DISCOVERY

~~Resource consumption: Low—Commercial RDBMS
Interactive discovery accommodates both stable and
software is optimized for efficient utilization of
evolving schemas without extensive data modeling.
resources.
It leverages SQL, NoSQL, MapReduce, and statistical
functions in a single analytical process and incorporates
Batch data processing
prepackaged analytical modules. NoSQL and MapReduce
Unlike the integrated data warehouse and interactive
are particularly useful for analyses such as time series
discovery platforms, batch
and social graph that require complex processing HIGHER THE BUSINESS VALUE DENSITY, processing lies within the
THE
Hadoop sphere (Figure 9).
THE ANSI
beyond the capabilities of ANSI SQL. As a result of MORE IT MAKES SENSE TO MANAGE IT A key difference between
batch data processing and
USING RELATIONAL TECHNIQUES interactive discovery is that
SQL compliance and a myriad of prebuilt MapReduce
batch processing involves no physical data movement
analytical functions that can be incorporated into an ANSI
THE LOWER THE BUSINESS VALUE DENSITY,
THE MORE IT MAKES SENSE TO MANAGE
USING HADOOP

BLE SCHEMA

HEMA

BATCH DATA
PROCESSING
LOW

HIGH

LOW

MED

HIGH

LOW

MED

HIGH

LOW

LVING SCHEMA

MED

MED

HIGH

No transformations of data required
Scripting / Declarative languages
DEVELOPMENT / MAINTENANCE Analysis against raw files
USAGE Refinement, transformation, and cleansing
RESOURCE CONSUMPTION Analysts and data scientists
HARDWARE / SOFTWARE

COST CHARACTERISTICS

CAL FLEXIBILITY

FAST RESPONSE/THROUGHPUT

RNANCE

FINE GRAIN SECURITY

NTEGRITY

Figure 9. Batch data processing

10
Optimize the Business Value
of All Your Enterprise Data

White Paper
10.13

eb 7873

as part of the transformation into a more usable model.
~~Usage: Medium—Unlike the previous use cases that are
Light data modeling is applied against the raw data files
accessible to SQL users, batch processing requires new
to facilitate more intuitive usage. The nature of the file
skills for authoring queries and is not compatible with the
system and the ability to flexibly manipulate data makes
full breadth of features and functionality found in modern
batch processing an ideal environment for refining,
business intelligence tools. In addition, query run times
transforming, and cleansing data, as well as performing
are longer, resulting in wait times that lower productivity.
analytics against larger datasets CHARACTERISTICS COST
when storage costs are
~~Resource consumption:HIGH
High—In general, Hadoop
LOW
MED
Single view of your business HARDWARE / SOFTWARE
valued over fast response times and throughput.
LOW
MED
HIGH
MAINTENANCE
Shared source for analytics DEVELOPMENT / software makes less efficient use of hardware resources
LOW
MED
HIGH
than RDBMS.
Load once, Use many times USAGE
Since the underlying data is raw, the task of transforming
SQL / 3rd party applications

RESOURCE CONSUMPTION

LOW

MED

HIGH

the data must be performed when the query analysts
is processed.
Knowledge workers and
General-purpose file
This is immensely valuable in that it provides a high DATA WAREHOUSE system
degree of flexibility for the user.
As used in this context, the general-purpose file system
HIGH BUSINESS to the Hadoop Distributed File System (HDFS) and
refers VALUE DENSITY
Batch processing incorporates a wide range of declarative
flexible programming languages (Figure 10). Raw data
language processing using Pig, Hive, and other emerging
is ingested and stored with no transformation, making
QUERY VOLUME
access tools in the Hadoop ecosystem. These tools are
this use case an economical online archive for the lowest
especially valuable for analyzing low BVD data when query
BVD data. Hadoop allows data scientists and engineers to
response time is not as critical, the logic applied to the
apply flexible low-level programming languages such as
data is complex, and full scans of the data are required—for
Java, Python, and C++ against the largest datasets without
example, sessionizing Web log data, counting events, and
any up-front characterization of the data.
executing complex algorithms. This approach is ideal for
analysts, developers, and data scientists.
Cost analysis

OPTIMIZING VALUE
IN THE UNIFIED
~~Hardware and software investment: Low—Batch
processing is available through open source software
DATA ARCHITECTURE

Cost analysis

~~Hardware and software investment: Low—Like batch
processing, this approach benefits from open source
software and commodity hardware.

~~Development and maintenance: High—Working effectively in this environment requires not only proficiency
DATA VOLUME
RDBMS
~~Development and maintenance: Medium—The skills
with low-level programming languages but also a workrequired to do development and maintain the Hadoop
ing understanding of Linux and the network configuraHADOOP
environment are relatively scarce in the marketplace, CROSS FUNCTIONAL REUSE
tion. The lack of mature development tools and applidriving up labor costs. Optimizing code in the environcations and the premium salaries demanded by skilled
BUSINESS VALUE DENSITY
ment is primarily a burden on the development team.
scientists and engineers all contribute to costs.
The ratio of business relevance to the size of the data.
and runs on commodity hardware.

STABLE SCHEMA
NO SCHEMA

LOW BUSINESS VALUE DENSITY

GENERAL PURPOSE
FILE SYSTEM
Flexible programming languages (Java, Python, C++, etc.)
Economic online archive
Land/source operational data
Data scientists and engineers

HARDWARE / SOFTWARE

LOW

MED

HIGH

DEVELOPMENT / MAINTENANCE

LOW

MED

HIGH

USAGE

LOW

MED

HIGH

RESOURCE CONSUMPTION

LOW

MED

HIGH

EVOLVING SCHEMA

CHARACTERISTICS COST
ANALYTICAL FLEXIBILITY
DATA GOVERNANCE
DATA QUALITY/INTEGRITY

Figure 10. General-purpose file system

11

FAST RE

FINE GRAIN S
Optimize the Business Value
of All Your Enterprise Data

~~Usage: High—Data processing in this environment is
essentially a development task, requiring the same skill
set and incurring the same labor costs as described
previously in development and maintenance.
~~Resource consumption: High—Hadoop is less efficient
than RDBMS software in utilizing CPU and I/O
processing cycles.

Conclusion
Database technology is no longer a one-size-fits-all
world—maximizing the business of volumes of
enterprise data requires the right tool for the right job.
This paper is intended to help IT architects and data
platform stakeholders understand how to map available
technologies—in particular, relational databases and big
data frameworks such as Hadoop—to each use case.
Integrating these and other tools into a single, unified
data platform gives data scientists, business analysts,
and other users powerful new capabilities to streamline
workflows, realize operational efficiencies, and drive
competitive advantage—exactly the value proposition of
the Teradata Unified Data Architecture™.

White Paper
10.13

eb 7873

The integrated data warehouse is most appropriate for
the highest BVD data, where demands for the data across
the enterprise are the greatest. When deployed optimally,
there is the right balance of hardware and software costs
for the benefits realized in lower development, usage, and
resource consumption costs.
Interactive discovery is best for capturing and analyzing
both stable and evolving schema data through traditional
set or advanced procedural processing when there is
a premium on fast response times or ease of access to
better democratize the data.
Batch data processing is ideal for analyzing and
transforming any kind of data through procedural
processing by end users who possess either low-level
programming languages or higher-order declarative
language skills, and where fast response times and
throughput are not essential.
General-purpose file system offers the greatest degree
of flexibility and lowest storage costs for engineers and
data scientists with the skills and patience to navigate all
enterprise data.
For more information, visit www.teradata.com.

10000 Innovation Drive Dayton, OH 45342

teradata.com

Unified Data Architecture is a trademark, and Teradata and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in the U.S. and worldwide.
Apache is a trademark, and Hadoop is a registered trademark of Apache Software Foundation. Teradata continually improves products as new technologies and components
become available. Teradata, therefore, reserves the right to change specifications without prior notice. All features, functions, and operations described herein may not be
marketed in all parts of the world. Consult your Teradata representative or Teradata.com for more information.
Copyright © 2013 by Teradata Corporation    All Rights Reserved.    Produced in USA.
EB-7873
> 1013

Contenu connexe

Dernier

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Dernier (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

En vedette

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 

En vedette (20)

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 

Optimizing Big Data Value Across Hadoop and Relational

  • 1. White Paper 10.13 Optimize the Business Value of All Your Enterprise Data Integrated approach incorporates relational databases and Apache Hadoop to provide a framework for the enterprise data architecture BY Chad Meley Director of eCommerce & Digital Media eb 7873
  • 2. Optimize the Business Value of All Your Enterprise Data Executive Summary Few industries have evolved as quickly as data processing, thanks to the effect of Moore’s Law coupled with Silicon Valley–style software innovation. So it comes as no surprise that innovations in data analysis have led to new data, new tools, and new demands to remain competitive. Market leaders in many industries are adopting these new capabilities, fast followers are on their heels, and the mainstream is not far behind. This renaissance has affected the data warehouse in powerful ways. In the 1990s and earlier 2000s, the massively parallel processing (MPP) relational data warehouse was the only proven and scalable place to hold corporate memory. In the late 2000s, an explosion of new data types and enabling technologies lead some to claim the demise of the traditional data warehouse. A more pragmatic view has emerged recently, that a one-size-fits-all approach—whether a traditional data warehouse or Apache™ Hadoop®—is insufficient by itself in a time when datasets and usage patterns vary widely. Technology advances have expanded the options to include permutations of the data warehouse in what is referred to as built-for-purpose solutions. Yet even seasoned practitioners who embrace multiplatform data environments still struggle to decide which technology is the best choice for each use case. By analogy, consider the transformations that have occurred in moving physical goods around the world in the past century—first cargo ships, then rail and trucks, and finally airplanes. Because of our familiarity with these modes, we know intrinsically what use cases are best for each transportation option, and nobody questions the need for all of them to exist within a global logistics framework. Knowing the value propositions and economics for each, it would be foolish for someone to say “Why would anyone ever use an airplane to ship goods when rail is a fraction of the cost per pound?” Or “Why would I ever consider using a cargo ship to move oil when I can get it to market faster using air?” But the best fit for data platform technologies is not as universally understood at this time. This paper will not bring instant clarity to this complex subject; rather, the intent is to define a framework of capabilities and costs for various options to encourage informed dialogue that will accelerate more comprehensive understanding in the industry. 2 White Paper 10.13 eb 7873 Teradata has defined the Teradata® Unified Data Architecture™, a solution that allows the analytics renaissance to flourish while controlling costs and discovering new analytics. As guideposts in this expansion, we have identified workloads that fit into built-for-purpose zones of activity: ~~Integrated data warehouse ~~Interactive discovery ~~Batch data processing ~~General-purpose file system By making use of this array of analytical environments, companies can extract significant value from a broader range of data—much of which would have been discarded just a few years ago. As a result, business users can solve more high-value business problems, achieve greater operational efficiencies, and execute faster on strategic initiatives. While the big data landscape is spawning new and innovative products at an astonishing pace, a great deal of attention continues to be focused on one of the seminal technologies that launched the big data analytics expansion: Hadoop. An open source software framework that supports the processing of large datasets in a distributed applications environment, Hadoop uses parallelism over raw files through its MapReduce framework. It has the momentum and community support that make it the most likely to eventually become the dominant enterprise standard in its space in a new breed of data technologies. The Teradata Unified Data Architecture Teradata offers a hybrid enterprise data architecture that integrates Hadoop and massively parallel processing (MPP) relational database management systems (RDBMS). Known as the Teradata Unified Data Architecture™, this solution relies on input from Teradata subject-matter experts and Teradata customers who are experienced practitioners with both Hadoop and traditional data warehousing. This architecture has also been validated with leading industry analysts and provides a strong foundation for designing next-generation enterprise data architectures. The essence of the Teradata Unified Data Architecture™ is captured in a comprehensive infographic that is intended to be a reference for database architects and strategic planners as they develop their next-generation enterprise
  • 3. Optimize the Business Value of All Your Enterprise Data White Paper 10.13 eb 7873 Before the big data revolution, organizations established clear guidelines to determine what data would be captured and how long it would be retained. As a result, only the dense data (high BVD) was retained. Lower BVD data was discarded, compounded by the absence of identified use cases and tools to exploit it. data architectures (Figure 1). The graphic, along with the more detailed explanations in this paper, provide objective criteria for deciding which technology is best suited to particular needs within the organization. To provide a framework for understanding the use cases, the following sections describe a number of important concepts such as business value density (BVD), stable and evolving schemas, and query and data volumes. There are many different concepts that interplay within the graphic, so it is broken down in a logical order. Factors Affecting Business Value Density Data Parameter High BVD Low BVD Age One of the most important concepts for understanding the Teradata Unified Data Architecture™ is BVD, defined as the amount of business relevance per gigabyte of data (Figure 2). Put another way, how many business insights can be extracted for a given amount of data? There are a number of factors that influence BVD, including when the data was captured, the amount of detail in the data, the percentage of inaccurate or corrupt records (data hygiene), and how often the data is accessed and reused (see table). Recent Older Form Modeled Raw Hygiene Clean Raw Access Frequent Rare Reuse Business value density Frequent Rare The big data movement has brought a fundamental shift in data capture, retention, and processing philosophies. Declining storage costs and file-based data capture COST CHARACTERISTICS CHARACTERISTICS COST Single view of your business Shared source for analytics Load once, Use many times SQL / 3rd party applications Knowledge workers and analysts HARDWARE / SOFTWARE LOW MED HIGH LOW MED HIGH DEVELOPMENT / MAINTENANCE LOW MED HIGH LOW MED HIGH DEVELOPMENT / MAINTENANCE USAGE LOW MED HIGH LOW MED HIGH USAGE RESOURCE CONSUMPTION LOW MED HIGH LOW MED HIGH HARDWARE / SOFTWARE Accommodates both Stable and Evolving Schemas Does not require extensive data modeling SQL / NoSQL / Map Reduce / statistical functions RESOURCE CONSUMPTION Pre-packaged analytic modules Analysts and data scientists INTERACTIVE DISCOVERY DATA WAREHOUSE HIGH BUSINESS VALUE DENSITY QUERY VOLUME OPTIMIZING VALUE IN THE UNIFIED DATA ARCHITECTURE RDBMS HADOOP THE HIGHER THE BUSINESS VALUE DENSITY, THE MORE IT MAKES SENSE TO MANAGE IT USING RELATIONAL TECHNIQUES DATA VOLUME THE LOWER THE BUSINESS VALUE DENSITY, THE MORE IT MAKES SENSE TO MANAGE USING HADOOP CROSS FUNCTIONAL REUSE BUSINESS VALUE DENSITY The ratio of business relevance to the size of the data. STABLE SCHEMA NO SCHEMA LOW BUSINESS VALUE DENSITY GENERAL PURPOSE FILE SYSTEM Flexible programming languages (Java, Python, C++, etc.) Economic online archive Land/source operational data Data scientists and engineers BATCH DATA PROCESSING HARDWARE / SOFTWARE LOW MED LOW MED HIGH USAGE LOW MED LOW MED MED HIGH MED HIGH MED HIGH LOW HIGH EVOLVING SCHEMA LOW LOW HIGH RESOURCE CONSUMPTION LOW HIGH DEVELOPMENT / MAINTENANCE MED HIGH CHARACTERISTICS COST COST CHARACTERISTICS ANALYTICAL FLEXIBILITY DATA GOVERNANCE DATA QUALITY/INTEGRITY Figure 1. The Teradata Unified Data Architecture 3 No transformations of data required Scripting / Declarative languages Analysis against raw files USAGE Refinement, transformation, and cleansing RESOURCE CONSUMPTION Analysts and data scientists HARDWARE / SOFTWARE DEVELOPMENT / MAINTENANCE FAST RESPONSE/THROUGHPUT FINE GRAIN SECURITY
  • 4. Optimize the Business Value of All Your Enterprise Data White Paper 10.13 and processing now allow enterprises to capture and retain most, if not all, of the information generated by business activities. Why capture so much lower BVD data? Because low BVD does not mean no value. In fact, many organizations are discovering that sparse data that was routinely discarded not so long ago now holds tremendous potential business value—but only if it can be accessed efficiently. To illustrate the concept of BVD, consider a dataset made up of cleansed and packaged online order information for a given time period such as the previous three months. This dataset is relatively small and yet highly valuable to business users in operations, marketing, finance, and other functional areas. This order data is considered to have high BVD; in other words, it contains a high level of useful business insights per gigabyte. eb 7873 In contrast, imagine capturing Web log data representing every click on the company’s Web site over the past five years. Compared to the order data described previously, this dataset is significantly larger. While there is potentially a treasure trove of business insights within this dataset, the number of people and applications interrogating it in its raw form would be less than the dataset made up of cleansed and packaged orders. So, this raw Web site data has sparse BVD, but is still highly valuable. Stable and evolving schemas The ability to handle evolving schemas is an important capability. In contrast to stable schemas that change slowly (e.g., order records and product information), evolving schemas change continually—think of new HIGH BUSINESS VALUE DENSITY OPTIMIZING VALUE IN THE UNIFIED DATA ARCHITECTURE B A DATA VOLUME BUSINESS VALUE DENSITY The ratio of business relevance to the size of the data. LOW BUSINESS VALUE DENSITY Legend ~~ Data volume—Represented by the thickness of the circle; greatest at point A and decreases counterclockwise around the circle ~~ BVD—Lowest at point A and increases around the circle ~~ Sparse/dense—Sparse data represented by light rows with a few dark blue squares at point A; dense data represented by darker blue rows at point B Figure 2. Business value density 4
  • 5. Optimize the Business Value of All Your Enterprise Data columns being added frequently, for example, Web log data (Figure 3). All data has structure. Instead of the oft-used (and misused) terms structured, semi-structured, and unstructured, the more useful concepts are stable and evolving schemas. For example, even though XML and JSON formats are often classified as semi-structured, the schema for an individual event such as order checkout can be highly stable over long periods of time. As a result, this information can be easily accessed using standard ETL (extract, transform, and load) tools with little maintenance overhead. Conversely, XML and JSON formats frequently—and unexpectedly, from the viewpoint of a data platform engineer—capture a new event type such as “hovered over a particular image with pointer.” This scenario describes evolving schema, which is particularly challenging for traditional relational tools. White Paper 10.13 eb 7873 No-schema data As noted previously, all data has structure and therefore what is frequently seen as unstructured data should be reclassified as no-schema data (Figure 4). What’s interesting about no-schema data with respect to analytics is that it has analytical value in unanticipated ways. In fact, a skilled data scientist can draw substantial insights from no-schema data. Here are two real-life scenarios: ~~An online retailer is boosting revenue through image analysis. In a typical case, a merchant is marketing a red dress and supplies the search terms size 6 and Ralph Lauren along with an image of the dress itself. Using sophisticated image-analysis software, the retailer can with a high degree of confidence attach additional descriptors such as A-line and cardinal red, which makes searching more accurate, benefiting both merchants and buyers. ~~An innovative insurance company is using audio recordings of phone conversations between customer service representatives and policyholders to determine the likelihood of a fraudulent claim based on signals derived from voice inflections. HIGH BUSINESS VALUE DENSITY In both examples, the companies had made the decision to capture the data before they had a complete idea of how to use it. Business users developed the innovative uses after they had become familiar with the data structure and had access to tools to extract the hidden value. DATA VOLUME DATA VOLUME STABLE SCHEMA STABLE SCHEMA LOW BUSINESS VALUE DENSITY NO SCHEMA LOW BUSINESS VALUE DENSITY EVOLVING SCHEMA Legend ~~ Stable schema data—The blue section of the band. Note that the areas of high BVD are composed entirely of stable schemas. ~~ Evolving schema data—The gray section of the band. While much of the data volume corresponds to evolving schemas, the BVD is fairly low compared to the stable schemas. Legend No-schema data—The magenta band between the evolving and stable schemas Figure 3. Stable and evolving schemas Figure 4. No-schema data 5 EVOLVING SCHEMA
  • 6. Optimize the Business Value of All Your Enterprise Data Usage and query volume By definition, there is a strong correlation between BVD and usage volume. For example, if a company captures 100 petabytes of data, 80 percent of all queries would be addressed to just 20 petabytes—the high BVD portion of the dataset (Figure 5). Usage volume includes two primary access methods: adhoc and scheduled queries. Ad-hoc queries are usually initiated by the person who needs the information using SQL interfaces, analytical tools, and business applications. Scheduled queries are set up and monitored by business analysts or data platform engineers. Applicable tools include SQL interfaces for regularly scheduled reports, automated business applications, and low-level programming scripts for scheduled analytics and data transformations. A significant and growing portion of usage volume is due to applications such as campaign management, ad serving, search, and supply chain management that depend on insights from the data to drive more intelligent decisions. HIGH BUSINESS VALUE DENSITY 10.13 eb 7873 RDBMS or Hadoop Building on the core concepts of BVD; query volume; and stable, evolving, and no-schema data, we can draw a line showing which data is most appropriate for an RDBMS or Hadoop and give some background about that particular placement. In general, as BVD increases, the more it makes sense to use relational techniques; while decreasing BVD indicates that Hadoop may be the best choice. While the graphic (Figure 6) draws the line arbitrarily through the equator, every organization will have its own threshold based on its information culture and maturity. Also note that no-schema data resides solely within Hadoop because relational constructs are often less-suited for managing this type of data. RDBMS technology has clear advantages over Hadoop in terms of response time, throughput, and security, which make it more appropriate for higher BVD data that has greater concurrency and more security requirements given the shared nature of the data. These differentiators are due to the following: QUERY VOLUME ~~Mature cost-base optimizers—When a query is submitted, the optimizer evaluates various execution plans and estimates the resource consumption for each. The optimizer then selects the plan that minimizes resource usage and thus maximizes throughput. ~~Indexing—RDBMS software has a multitude of robust indexes with stored statistics to facilitate access, thus shortening response times. DATA VOLUME CROSS FUNCTIONAL REUSE STABLE SCHEMA NO SCHEMA LOW BUSINESS VALUE DENSITY EVOLVING SCHEMA Legend ~~ Usage base volume—The amplitude of the outside spirals indicates usage volume. Note the inverse correlation between BVD and usage volume. ~~ Cross-functional reuse—The three colors represent the percentage of the data that is reused by groups such as marketing, customer service, and finance. These groups typically need access to the same high-BVD data such as recent orders. Figure 5. Usage and query volume 6 White Paper ~~Advanced partitioning—Today’s RDBMS products feature a number of advanced partitioning methods and criteria to optimize database performance and improve manageability. ~~Workload management—RDBMS technology addresses the throughput problem that occurs when many queries are executing concurrently. The workload manager prioritizes the query queue so that short queries are executed quickly and long queries receive adequate resources to avoid excessively long execution times. Filters and throttles regulate database activity by rejecting or limiting requests. (A filter causes specific logon and query requests to be rejected, while a throttle limits the number of active sessions, query requests, or load utilities on the database.) ~~Extensive security features—Relational databases offer sophisticated row- and column-level security, which enables role-based security. They also include fine-grain
  • 7. Optimize the Business Value of All Your Enterprise Data White Paper 10.13 security features such as authentication options, security roles, directory integration, and encryption, versus more coarse-grain features of the same within Hadoop. Cost factors Along with technological capabilities, cost drives the design of the enterprise data architecture. The Teradata Unified Data Architecture™ rates the relative cost of use cases using a four-factor cost analysis: ~~Hardware and software investment—The costs associated with the acquisition of the hardware and software. ~~Development and maintenance—The ongoing cost of acquiring data and packaging it for consumption as well as the costs of implementing systemwide changes such as software upgrades and changes to code and scripts running in the environment. eb 7873 ~~Usage—The costs of querying and analyzing the data to derive actionable insights, primarily based on market compensation for required skills, time to author and alter scripts and code, and wait time as it relates to productivity; these costs often are spread across multiple departments and budgets and therefore often go unnoticed; however, they are very real for business initiatives that leverage data and analytics for strategic advantage. ~~Resource consumption—The extent to which the CPU, I/O, and disk resources are utilized over time; when system resources are close to full utilization, the organization is achieving the maximum value for its investment in hardware and therefore resource consumption costs would be low; underutilized systems waste resources and drive up costs without adding value and would therefore be medium or high. HIGH BUSINESS VALUE DENSITY QUERY VOLUME RDBMS HADOOP THE HIGHER THE BUSINESS VALUE DENSITY, THE MORE IT MAKES SENSE TO MANAGE IT USING RELATIONAL TECHNIQUES DATA VOLUME THE LOWER THE BUSINESS VALUE DENSITY, THE MORE IT MAKES SENSE TO MANAGE USING HADOOP CROSS FUNCTIONAL REUSE BUSINESS VALUE DENSITY The ratio of business relevance to the size of the data. STABLE SCHEMA NO SCHEMA LOW BUSINESS VALUE DENSITY EVOLVING SCHEMA FAST RESPONSE/THROUGHPUT FINE GRAIN SECURITY Legend ~~ RDBMS-Hadoop partition—The horizontal line partitions the BVD space between high-BVD data that can be effectively managed with an RDBMS and low-BVD data that is best suited to Hadoop. The partitioning point (intersection of line and data curve) is unique to each organization and may change over time. ~~ RDBMS features—The two arcs within the data circles represent key advantages of RDBMS: fast response times/throughput and fine-grain security. Figure 6. The RDBMS-Hadoop partition 7
  • 8. Optimize the Business Value of All Your Enterprise Data White Paper 10.13 Use Case Overview While there are a large number of possible data scenarios in the enterprise world today, the majority fall into these four use cases: ~~Integrated data warehouse—Provides an unambiguous view of information for timely and accurate decision making ~~Interactive discovery—Addresses the challenge of exploring large datasets with less defined or evolving schemas ~~Batch data processing—Transforms data and performs analytics against larger datasets when storage costs are valued over interactive response times and throughput ~~General file system—Ingests and stores raw data with no transformation, making this use case an economical online archive for the lowest BVD data Each use case is described in more detail in the following sections. Integrated data warehouse The association of the relational database and big data occurs in the integrated data warehouse (Figure 7). The integrated data warehouse is the overwhelming choice for the important data that drives organizational decisionmaking where a single, accurate, timely, and unambiguous version of the information is required. The integrated data warehouse uses a well-defined schema to offer a single view of the business to enable easy data access and ensure consistent results across the entire enterprise. It also provides a shared source for analytics across multiple departments within the enterprise. Data is loaded once and used many times without the need for the user to repeatedly define and execute agreed-upon transformation rules such as the definitions of customer, order, and lifetime value score. The integrated data warehouse supports ANSI SQL as well as many mature third-party applications. Information in the integrated data warehouse is scalable and can be accessed by knowledge workers and business analysts across the enterprise. The integrated data warehouse is the tried-and-true gold standard for high-BVD data, supporting cross-functional reuse and the largest number of business users with a full set of features and benefits unmatched by other approaches to data management. CHARACTERISTICS COST Single view of your business Shared source for analytics Load once, Use many times SQL / 3rd party applications Knowledge workers and analysts HARDWARE / SOFTWARE LOW MED HIGH DEVELOPMENT / MAINTENANCE LOW MED HIGH USAGE LOW MED HIGH RESOURCE CONSUMPTION LOW MED HIGH DATA WAREHOUSE HIGH BUSINESS VALUE DENSITY QUERY VOLUME Figure 7. Integrated data warehouse PTIMIZING VALUE IN THE UNIFIED A ARCHITECTURE 8 RDBMS eb 7873 DATA VOLUME
  • 9. Optimize the Business Value of All Your Enterprise Data White Paper 10.13 Cost analysis ~~Hardware and software investment: High—Software development for the commercial engineering effort required to deliver the differentiated benefits described previously, as well as an optimized, integrated hardware platform, warrant substantial initial investments. ~~Development and maintenance expense: Medium—Realizing the maximum benefit of clean, integrated, easy-to-consume information requires data modeling and ETL operations, which drive up development costs. However, the productivity tools and people skills for developing and maintaining a relational environment are readily available in the marketplace, mitigating the development costs. Also, the data warehouse has diminishing incremental development costs because it builds on existing data and transformation rules and facilitates data reuse. ~~Usage expense: Low—Users can navigate the enterprise data and create complex queries in SQL that return results quickly, minimizing the need for expensive programmers and reducing unproductive wait times. This benefit is a result of the costs incurred in development and maintenance as described previously. ~~Resource consumption: Low—Tight vertical integration across the stack enables optimal utilization of system eb 7873 CPU and I/O resources so that the maximum amount of throughput can be achieved within an environment bounded by CPU and I/O. Interactive discovery Interactive discovery platforms address the challenge of exploring large datasets with less-defined or evolving schemas by adapting methodologies that originate from the Hadoop ecosystem within an RBDMS (Figure 8). Some of the inherent advantages of the RDBMS technology are particularly fast response times and throughput, as well as the ease of use stemming from ANSI SQL compliance. Interactive discovery requires less time spent on data governance, data quality, and data integrity because users are looking for new insights in advance of such rigor required for more formal auctioning of the data and insights. The fast response times enable accelerated insight discovery, and the ANSI SQL interface democratizes the data in the widest possible user base. This approach combines schema-on-read, MapReduce, and flexible programming languages with RBDMS features such as ANSI SQL support, low latency, fine-grain security, data quality, and reliability. Interactive discovery has cost and flexibility advantages over the integrated data warehouse, but at the expense of concurrency (usage volume) and governance control. COST CHARACTERISTICS LOW MED HIGH LOW MED HIGH LOW MED HIGH LOW MED HIGH HARDWARE / SOFTWARE Accommodates both Stable and Evolving Schemas Does not require extensive data modeling USAGE SQL / NoSQL / Map Reduce / statistical functions RESOURCE CONSUMPTION Pre-packaged analytic modules Analysts and data scientists DEVELOPMENT / MAINTENANCE INTERACTIVE DISCOVERY Figure 8. Interactive discovery THE HIGHER THE BUSINESS VALUE DENSITY, THE MORE IT MAKES SENSE TO MANAGE IT USING RELATIONAL TECHNIQUES 9 THE LOWER THE BUSINESS VALUE DENSITY, THE MORE IT MAKES SENSE TO MANAGE USING HADOOP
  • 10. Optimize the Business Value of All Your Enterprise Data White Paper 10.13 eb 7873 COST CHARACTERISTICS LOW MED HIGH HARDWARE / SOFTWARE Accommodates both Stable as business analysts can A key reason to use interactive discovery is analytical SQL script, data scientists as well and Evolving Schemas MED HIGH LOW DEVELOPMENT / flexibility (also applicable to Hadoop), which is based on MAINTENANCE Does not require extensive data modeling use interactive discovery without additional training. LOW MED HIGH USAGE SQL / NoSQL / Map Reduce / statistical functions these features: RESOURCE CONSUMPTION Pre-packaged analytic modules LOW MED HIGH Cost analysis ~~Schema-on-read—Structure is imposed when the Analysts and data ~~Hardware and softwarescientists investment: Medium—Interactive data is read, unlike the schema-on-write approach of discovery platforms are less expensive than the the integrated data warehouse. This feature allows integrated data warehouse. complete freedom to transform and manipulate the ~~Development and maintenance: Low—Interactive data at a later time. The use cases in the Hadoop discovery uses light modeling techniques, which hemisphere also use schema-on-read techniques. minimize efforts for ETL and data modeling. ~~Low-level programming—Languages such as Java and ~~Usage: Low—SQL is easy to use, reducing user time Python can be used to construct complex queries and required to generate queries. Built-in analytical even perform row-over-row comparisons, both of which functions reduce hundreds of lines of code to single are extremely challenging with SQL. This kind of prostatements. The performance characteristics of an cessing is more commonly associated with row-over-row RDBMS reduce unproductive wait times. comparisons, such as time-series and pathing analysis. INTERACTIVE DISCOVERY ~~Resource consumption: Low—Commercial RDBMS Interactive discovery accommodates both stable and software is optimized for efficient utilization of evolving schemas without extensive data modeling. resources. It leverages SQL, NoSQL, MapReduce, and statistical functions in a single analytical process and incorporates Batch data processing prepackaged analytical modules. NoSQL and MapReduce Unlike the integrated data warehouse and interactive are particularly useful for analyses such as time series discovery platforms, batch and social graph that require complex processing HIGHER THE BUSINESS VALUE DENSITY, processing lies within the THE Hadoop sphere (Figure 9). THE ANSI beyond the capabilities of ANSI SQL. As a result of MORE IT MAKES SENSE TO MANAGE IT A key difference between batch data processing and USING RELATIONAL TECHNIQUES interactive discovery is that SQL compliance and a myriad of prebuilt MapReduce batch processing involves no physical data movement analytical functions that can be incorporated into an ANSI THE LOWER THE BUSINESS VALUE DENSITY, THE MORE IT MAKES SENSE TO MANAGE USING HADOOP BLE SCHEMA HEMA BATCH DATA PROCESSING LOW HIGH LOW MED HIGH LOW MED HIGH LOW LVING SCHEMA MED MED HIGH No transformations of data required Scripting / Declarative languages DEVELOPMENT / MAINTENANCE Analysis against raw files USAGE Refinement, transformation, and cleansing RESOURCE CONSUMPTION Analysts and data scientists HARDWARE / SOFTWARE COST CHARACTERISTICS CAL FLEXIBILITY FAST RESPONSE/THROUGHPUT RNANCE FINE GRAIN SECURITY NTEGRITY Figure 9. Batch data processing 10
  • 11. Optimize the Business Value of All Your Enterprise Data White Paper 10.13 eb 7873 as part of the transformation into a more usable model. ~~Usage: Medium—Unlike the previous use cases that are Light data modeling is applied against the raw data files accessible to SQL users, batch processing requires new to facilitate more intuitive usage. The nature of the file skills for authoring queries and is not compatible with the system and the ability to flexibly manipulate data makes full breadth of features and functionality found in modern batch processing an ideal environment for refining, business intelligence tools. In addition, query run times transforming, and cleansing data, as well as performing are longer, resulting in wait times that lower productivity. analytics against larger datasets CHARACTERISTICS COST when storage costs are ~~Resource consumption:HIGH High—In general, Hadoop LOW MED Single view of your business HARDWARE / SOFTWARE valued over fast response times and throughput. LOW MED HIGH MAINTENANCE Shared source for analytics DEVELOPMENT / software makes less efficient use of hardware resources LOW MED HIGH than RDBMS. Load once, Use many times USAGE Since the underlying data is raw, the task of transforming SQL / 3rd party applications RESOURCE CONSUMPTION LOW MED HIGH the data must be performed when the query analysts is processed. Knowledge workers and General-purpose file This is immensely valuable in that it provides a high DATA WAREHOUSE system degree of flexibility for the user. As used in this context, the general-purpose file system HIGH BUSINESS to the Hadoop Distributed File System (HDFS) and refers VALUE DENSITY Batch processing incorporates a wide range of declarative flexible programming languages (Figure 10). Raw data language processing using Pig, Hive, and other emerging is ingested and stored with no transformation, making QUERY VOLUME access tools in the Hadoop ecosystem. These tools are this use case an economical online archive for the lowest especially valuable for analyzing low BVD data when query BVD data. Hadoop allows data scientists and engineers to response time is not as critical, the logic applied to the apply flexible low-level programming languages such as data is complex, and full scans of the data are required—for Java, Python, and C++ against the largest datasets without example, sessionizing Web log data, counting events, and any up-front characterization of the data. executing complex algorithms. This approach is ideal for analysts, developers, and data scientists. Cost analysis OPTIMIZING VALUE IN THE UNIFIED ~~Hardware and software investment: Low—Batch processing is available through open source software DATA ARCHITECTURE Cost analysis ~~Hardware and software investment: Low—Like batch processing, this approach benefits from open source software and commodity hardware. ~~Development and maintenance: High—Working effectively in this environment requires not only proficiency DATA VOLUME RDBMS ~~Development and maintenance: Medium—The skills with low-level programming languages but also a workrequired to do development and maintain the Hadoop ing understanding of Linux and the network configuraHADOOP environment are relatively scarce in the marketplace, CROSS FUNCTIONAL REUSE tion. The lack of mature development tools and applidriving up labor costs. Optimizing code in the environcations and the premium salaries demanded by skilled BUSINESS VALUE DENSITY ment is primarily a burden on the development team. scientists and engineers all contribute to costs. The ratio of business relevance to the size of the data. and runs on commodity hardware. STABLE SCHEMA NO SCHEMA LOW BUSINESS VALUE DENSITY GENERAL PURPOSE FILE SYSTEM Flexible programming languages (Java, Python, C++, etc.) Economic online archive Land/source operational data Data scientists and engineers HARDWARE / SOFTWARE LOW MED HIGH DEVELOPMENT / MAINTENANCE LOW MED HIGH USAGE LOW MED HIGH RESOURCE CONSUMPTION LOW MED HIGH EVOLVING SCHEMA CHARACTERISTICS COST ANALYTICAL FLEXIBILITY DATA GOVERNANCE DATA QUALITY/INTEGRITY Figure 10. General-purpose file system 11 FAST RE FINE GRAIN S
  • 12. Optimize the Business Value of All Your Enterprise Data ~~Usage: High—Data processing in this environment is essentially a development task, requiring the same skill set and incurring the same labor costs as described previously in development and maintenance. ~~Resource consumption: High—Hadoop is less efficient than RDBMS software in utilizing CPU and I/O processing cycles. Conclusion Database technology is no longer a one-size-fits-all world—maximizing the business of volumes of enterprise data requires the right tool for the right job. This paper is intended to help IT architects and data platform stakeholders understand how to map available technologies—in particular, relational databases and big data frameworks such as Hadoop—to each use case. Integrating these and other tools into a single, unified data platform gives data scientists, business analysts, and other users powerful new capabilities to streamline workflows, realize operational efficiencies, and drive competitive advantage—exactly the value proposition of the Teradata Unified Data Architecture™. White Paper 10.13 eb 7873 The integrated data warehouse is most appropriate for the highest BVD data, where demands for the data across the enterprise are the greatest. When deployed optimally, there is the right balance of hardware and software costs for the benefits realized in lower development, usage, and resource consumption costs. Interactive discovery is best for capturing and analyzing both stable and evolving schema data through traditional set or advanced procedural processing when there is a premium on fast response times or ease of access to better democratize the data. Batch data processing is ideal for analyzing and transforming any kind of data through procedural processing by end users who possess either low-level programming languages or higher-order declarative language skills, and where fast response times and throughput are not essential. General-purpose file system offers the greatest degree of flexibility and lowest storage costs for engineers and data scientists with the skills and patience to navigate all enterprise data. For more information, visit www.teradata.com. 10000 Innovation Drive Dayton, OH 45342 teradata.com Unified Data Architecture is a trademark, and Teradata and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in the U.S. and worldwide. Apache is a trademark, and Hadoop is a registered trademark of Apache Software Foundation. Teradata continually improves products as new technologies and components become available. Teradata, therefore, reserves the right to change specifications without prior notice. All features, functions, and operations described herein may not be marketed in all parts of the world. Consult your Teradata representative or Teradata.com for more information. Copyright © 2013 by Teradata Corporation    All Rights Reserved.    Produced in USA. EB-7873 > 1013