Hadoop Data Reservoir Webinar

HADOOP DATA
RESERVOIR
REQUIREMENTS AND
SOLUTIONS
Peter Schlampp, VP Products

Outline
•  What is the Hadoop Data Reservoir (HDR)?

•  Requirements and Solutions

•  Hadoop Data Reservoir in Practice

•  Demo

•  Q&A

What is the Hadoop Data Reservoir (HDR)?
•  Central Hadoop cluster for the enterprise
•  Serves as the Storage and the Source of data for
self-service business analytics
•  Provides Processing for data preparation and
advanced analytics
The Hadoop Data Reservoir
eliminates data silos, reduces costs,
and makes business analytics agile.

HDR is Not a Replacement for the EDW

HDR is Not a Replacement for the EDW

•  EDWs require upfront planning
•  EDWs require major ongoing IT
maintenance and staffing
•  EDWs are not self-service

HDR Origin: Interviews with Enterprise IT
•  Platfora interviewed over 200
enterprise IT professionals working
with Hadoop
•  Summer 2011 through early 2012
•  Topic of interview: challenges using
Hadoop for business intelligence &
analytics

What is Your Vision for Hadoop?
•  “I want Hadoop to be the central repository of all the data people
need.”

•  “We shouldn’t have to plan too much before we store data.”

•  “Cost should only be a minor factor in how long we kept data around.”

•  “I want to give everyone access to the data and break down the existing
silos. But it needs to be secure.”

•  “IT would not have to be involved in day-to-day management.”

“I’m a bit out on a limb here. I pushed to use Hadoop to collect data that we
Out on a Limb
were dropping before. But now it’s taking way more time to make use of it
then I expected.”

Stock Photo

9

The Missing Link to HDR

Automatic /
Fast /
Iterative
Unbounded

FLEXIBLE
Hadoop Data
“SOFTWARE DEFINED”
Web-based
Reservoir
Business Intelligence
DATA MARTS

Performance, Self-Service, and Security

Queries must be consistently fast
Modern BI applications are driving more Modern Data Discovery BI
and more queries all the time.

A single HDR user should not be able to
impact other users simply because they
asked the wrong question.
Each move results in a new query.

“We’re addicted to sub-second. If it takes longer
than that for any reason, something is wrong.”

Most Queries are Straightforward, but Big
“What’s the trend of female visitors clicking on ads on the
Traffic
autos channel over time?”
Logs

Advertising
???

Logs

Clicks
User
Demographics
Big Hadoop cluster

Months
2.4 PB total

700M records/day
Processing the answer
400 GB/day
could touch 10s of billions
2B user records
of records.

Solution: Aggregate Tables Stored In-Memory
•  Pre-calculated summary
tables, summarizing data to a
coarser grain
•  Dramatically reduces data
required to answer a question
•  Keeps redundant processing
off the batch system (Hadoop)
•  Keep summary data in
memory to provide sub-
second access
14

REQUIREMENT 2:
SELF-SERVICE

15

Finding Data in the Reservoir
Sales
Shipments
Hadoop Distributed File
System (HDFS) is organized
like other common FS: a
directory structure
Sentiment Web Logs
Info
Datasets in HDFS could be a
single file or 10,000+ files,
Customer Interactions
commonly organized by
Demographics
directory

Business users must be able to find data to
answers their questions
16

Aggregations Must Be Fully Automatic
•  Building aggregate tables requires planning and up-
front decisions
•  Must choose the metrics, dimensions, granularity
•  In practice, this is an iterative process, and the first
attempt is usually wrong
•  Aggregate tables must be maintained
•  Each time new data arrives
•  Sliding window tables (i.e. last 30 days): data in, data out
For HDR to be self-service, this must be
automatic.

Drilling Through the Aggregation
Netflow Example

Raw Data in Hadoop
Aggregate Tables
Milliseconds
Hours, Days
Source IP Address
# of Machines

Destination IP Address
# of Flows
“What happened between
Application
Total Flow Size (KB)
10:03-10:04am?”
Packets
Application
Bytes
100MB Compressed
26B records/month
Fast
400GB compressed
Slow
Need to “drill through the aggregation” to get more detail,
or add dimensionality. And, it needs to be self-service.

18

Augmenting Datasets
•  Users must be able to augment data with
sources outside of the HDR
•  I.e. market research or demographics

•  Commonly needs to be combined at the raw
level, before data is aggregated

REQUIREMENT 3:
SECURITY

20

Modern Data Security Requirements
•  Hadoop provides:
•  File and directory based permissions (like Unix)
•  Secure authentication (via Kerberos)
•  However, enterprises require a finer level of data
security control
•  Datasets – could be one or many files, spanning directories
•  Columns – datasets likely have many columns, with
different security permissions
•  Rows – can span many files, and directories
•  Solution must abstract file-level security and
enforce a finer level of control
21

Strong and Secure; Collaborative Sharing
•  In a self-service model, security must be strong
and clear
•  End-users will need to understand what they can
access and what they can’t
•  Security administrators must be able to enforce
security centrally, down to the raw data
•  As a centralized system, HDR must integrate
with directory services for authentication and
group membership
22

HADOOP DATA RESERVOIR
IN PRACTICE

23

Platfora: Interest-Driven PipelineTM

Automatic /
Fast /
Iterative
Unbounded

FLEXIBLE
Hadoop Data
“SOFTWARE DEFINED”
Web-based
Reservoir
Business Intelligence
DATA MARTS

Performance, Self-Service, and Security

Edmunds.com
•  Beta participant since January 2013
•  Moved to Hadoop because of explosive data
growth and promise of agility
•  Web, mobile, visitor demographic data
•  Use Case: optimize the matching of visitors with
Founded in 1966:
the cars they are looking for
”For the purpose of publishing •  Correlating browsers with the cars they are actually
new and used automotive pricing
guides to assist automobile
buying
buyers”
•  Platfora has made big data accessible to the

business
Online Innovators:
•  Increased access from 5 to 50 users
•  First auto information
website
•  Decreased time to value from months to hours
•  True Market Value®, True
Cost to Own®, and My Car
Match
“Before, if we wanted access to Hadoop data, we wouldn’t even try.
With Platfora our analysts can access anything they need.”

Introducing Platfora’s Integrated Platform
Web-based Business
Vizboard
Intelligence Application
+
Lens
Scale-out, In-Memory
Data Mart & Processing Engine
+
Dataset
Automated Hadoop
Data Refinery
Powerful Closed-loop Analysis of Big Data

Summary
•  The Hadoop Data Reservoir vision is driven from
requirements of enterprise Hadoop users
•  HDR eliminates data silos, reduces costs, and
makes business analytics agile
•  To make HDR a reality, it needs to provide:
•  Performance
•  Self-service
•  Security
28

Hadoop Data Reservoir Webinar

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (18)

Similaire à Hadoop Data Reservoir Webinar

Similaire à Hadoop Data Reservoir Webinar (20)

Plus de Platfora

Plus de Platfora (10)

Dernier

Dernier (20)

Hadoop Data Reservoir Webinar

Notes de l'éditeur