Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. In this presentation, you will learn about the similarities and differences of Hadoop and parallel data warehouses, and typical best practices. Edmunds will discuss how they increased delivery speed, reduced risk, and achieved faster reporting by combining ELT and ETL. For example, Edmunds ingests raw data into Hadoop and HBase then reprocesses the raw data in Netezza. You will also learn how Edmunds uses prototyping to work on nearly raw data with the company’s Analytics Team using Netezza.
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models
1. Krishnan Parasuraman Greg Rokita
Netezza Edmunds.com
Building Scalable Data Platforms
Hadoop and Netezza Deployment Models
2. Talking Points
• Building scalable data platforms
– Architectural considerations
• Hadoop and Massively Parallel Databases
– Similarities and differences
– Usage patterns
• Practitioner’s View Point
– Edmunds.com data warehouse platform
2 Hadoop World 2011
3. Building scalable data platforms
Typical Digital Media Information Processing Pipeline
Clicks
Visits
Page Views • Scoring
Real Time • Yield optimization
Likes Data • Audience Analytics
Decision
Tweets Processing
Impressions
Engine
Locations
• Display Ads • Correlate Reporting
• Recommendation • Structure
• Personalized Content • Consolidate
• Aggregate
• Summarize
• Ad-hoc analysis
3 Hadoop World 2011
4. Building scalable data platforms
Clicks
Visits
Page Views
Real Time
Likes Data
Decision
Tweets Processing
Impressions
Engine
Locations Reporting
DATA PLATFORM
4 Hadoop World 2011
5. Building scalable data platforms
Real Time
Data
Decision
Processing
Engine
Reporting
• Real Time
• High Velocity • Compute intensive • Cached Queries
• High Concurrency
Workloads • Transactional
• Linearly Scalable • Full table scans • Low Latency
• Disk bound • Disk bound • H. Concurrency
• High Thruput
• Structured • Structured • Mostly Structured • Structured
Data • Un-Structured • Un-Structured • Some unstructured • Relational
• Key-Value pairs • Machine Gen.
• Stream Processing • Low Disk I/O • In-DB computation • OLAP
Capability • Memory resident • Fast Processing • SQL and MR • Columnar
• Key based • Low Cost/TB • Analytic Libraries
lookups
5 Hadoop World 2011
6. Building scalable data platforms
Real Time
Data
Decision
Processing
Engine
Reporting
• Real Time
• High Velocity • Compute intensive • Cached Queries
• High Concurrency
Workloads • Transactional
• Linearly Scalable • Full table scans • Low Latency
• Disk bound • Disk bound
Massively
• High Thruput • H. Concurrency
Hadoop Parallel DB
NoSQL
• Structured • Structured • Mostly Structured • Structured
Data Databases
• Un-Structured • Un-Structured • Some unstructured • Relational
In-Memory
• Key-Value pairs • Machine Gen.
DB
Graph
• Stream Processing • Low Disk I/O Plain Ole’ DB
• In-DB computation • OLAP
DB
Capability • Memory resident • Fast Processing on steroids • Columnar
• SQL and MR
• Key based • Low Cost/TB • Analytic Libraries
lookups
6 Hadoop World 2011
7. Myt A single technology will meet all the considerations for
h our scalable data platform needs
Best Practices
Workloads scale differently – Monolithic architectures don’t work
Minimize components – Data movement is painful
Understand tradeoffs – Performance Price Effort
Start with the core architecture and work in the edge cases
7 Hadoop World 2011
8. Massively parallel data warehouses
SQL And MR
Host controllers
Hosts
Network fabric
FPGA CPU FPGA CPU FPGA CPU Massively
parallel
Memory Memory Memory
compute nodes
Distributed
Storage
8 Hadoop World 2011
9. Hadoop
Map Reduce
Job
Tracke
Name Master Node
Node
r
Network fabric
Task Task Task
Tracke
Data
Node
Tracke
Data
Node
Tracke
Data
Node
Parallel
r r r
compute nodes
Distributed
Storage
9 Hadoop World 2011
10. There are striking similarities….
Map Reduce
Job
Tracke
Name
Node
Massive
r
parallelism
Execute code &
algorithms next to
Task Task Task data
Data Data Data
Tracke Tracke Tracke
Node Node Node
r r r
Scalable
Highly Available
Map Reduce
10 Hadoop World 2011
11. But also key differences
Map
Reduce
Schema on Read – Data loading is fast
Hadoop
Job
Tracker
Name
Node Batch Mode data access
Lower cost of data storage
Process unstructured data
Task Data Task Data Task Data
Tracker Node Tracker Node Tracker Node
Optimized for Performance
Netezza Real time access, random reads,
query optimizer, co-located joins
Hardware Accelerated queries
Data Loading = File copy SQL and Map Reduce
Look Ma, No ETL
11
12. These differences lead to opportunities for co-
existence for Hadoop in a Netezza environment
1. Scalable ETL engine
– Complex data
– Relationships not defined
– Evolving schema
2. Queryable Archive
– Moving computation is cheaper than moving data
3. Analytics sandbox
– Exploratory analysis
12 Hadoop World 2011
13. Netezza-Hadoop: Deployment Patterns
Create context
Analyze
unstructured data (classification, text mining)
Parse, aggregate Analyze, report
semi-structured data
Active archival
Analyze, report Long running queries
structured data
13 Hadoop World 2011
14. Pattern 1: Data Processing Engine (ETL)
Hadoop Cluster
Netezza Environment
NameNode
JobTracker
Raw Weblogs
DataNode DataNode DataNode
TaskTracker TaskTracker TaskTracker
14 Hadoop World 2011
16. Pattern 3: Queryable Archive
1
3
Data Sources 2
Netezza
Environment
16 Hadoop World 2011
17. Edmunds.com and Scale
o Premier online resource for automotive information
launched in 1995 as the first automotive information
Web site
o 15 million unique visitors
o 210 million page views
o 1 million+ new inventory items per day
o 2 TB of new data every month
o 40 node Hadoop cluster aggregating logs,
advertising, vehicle, pricing, inventory and other data
sets
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
18. Edmunds Proposition
We have developed an iterative
approach to data warehouse
development that has dropped the time
it takes for us to deliver reports to our
users from months to weeks.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
18 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
19. How did we do it?
o Process
o Technology
o Understanding of Value
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
20. Process: agile approach
o Continuous and fast delivery of new features
o Collaboration between users and developers
o Make new data available quickly and
inexpensively
o Quick problem resolution
o No wasting of entire development cycle if data is
not useful
o Encouragement of exploration and creation of
new applications
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
21. Process Pre-process:
• Complete
• Raw
• Modeled as source data
• Generically loaded
• Quick turn-around
• Low retention
• Slower performance
Post-process:
• Filtered
• Transformed
• Modeled as star schema
• Optimized
• Slow turn-around
• High retention
• Fast performance
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
21 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
22. Post-Process Sandbox
Use Pre- Load data
process in ad-hock
data manner
Discard:
prevents shadow
No production
Change little effort lost
schema (by
users or Prototype Data has value?
developers)
Develop Optimized
Yes Pipeline:
data is confirmed to
Enhance
Schema is be useful
stable? effort is warranted
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
22 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
23. Technology
Publishing Hadoop
Netezza
System Stack
• All Data • HBase raw data • All data loaded from
• Generic • Oozie job coordinator Hadoop in batch
• Thrift IDL with • HDFS storage of pre • Analysis and data
Versioning and optimized data exploration - use the
replica of RDBMS in speed and power
files • Report generation
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
23 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
24. Edmunds Publishing System
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
24 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
25. Generic flow for pre-process
Producers: Inventory, Pricing, Vehicle,
Dealer, Leads
Broker
Consumer
HBase
Map- G
e
Reduce
n
Netezza e
Action r
i
c
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
,
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
25 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
26. What architecture enables generic
consumer?
Thrift
Camel
ActiveMQ
o Message o Retries
o Delivery o Throttling
o Routing
o Persistence o Versioning
o Durability o Monitoring
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
27. Flexibility for Producers and Consumers:
Support for Topologies
Field Example Values Purpose
Environment PROD, TEST, DEV Promotion cycle of
deployment units
Index Blue, Green, Stage Environment Index
Data Center LAX1, EC2 The data center where
deployment unit is located
Site Edmunds, Insideline Company’s Product
Application HBase, Digital Asset Manager Deployment Unit
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
28. Producer-Consumer matching
Match!
Producer Virtual Queue
Consumer
Topic Name
Name
Publish Publish
Inventory Inventory
I am I am
Prod Test
Lax Broker
EC2
Edmunds Destination
Edmunds
Inventory Interceptor
Dealer
Prod, Test Prod
Send To Lax, EC2 Lax, EC2 Receive From
Edmunds Edmunds
Dealer Inventory
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
29. HBase: how to handle data generically
Colum Binary Discrete Type 2
Family
Columns Serialized Hashcode of Thrift Thrift Thrift Start End List of
Thrift the Thrift Object Object Object Date Date fields
Object Object Field 1 Field 2 Field 3
Role System of Check if Versioning at the most Versioning for
record updates are granular level for lookups optimized
necessary dimension tables
(optimization)
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
29 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
30. Generic Thrift Persistence in HBase
Column Name Value
[ModelYear]|F:id|T:long|I:0 1368
[ModelYear]|F:midYear|T:boolean|I:1 false
[ModelYear]|F:year|T:int|I:2 1993
[ModelYear]|F:name|T:java.lang.String|I:4 Celica
[ModelYear]#[attributss][0]|F:_key|T:java.lang.Long 64
[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][0]|F: Standard Sport
value|T:java.lang.String|I:1 V:GT-S 2dr
[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F: Hatchback
value|T:java.lang.String|I:1
[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F:i 441
d|T:long|I:2
[ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][3]|F: V:GT-S
value|T:java.lang.String|I:1
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
30 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
31. Netezza: Time is Money
Compared to Oracle Business Value
Up to 12x faster load times Can reload data more frequently
Failed workflows are no longer a big problem
Helps in transition to real time system:
We can now create intraday reports for Leads!
Up to 400x faster query More productive Business Intelligence
times Queries that could ‘never’ finish in Oracle are
now providing business value
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
31 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
32. Generic and reusable Oozie actions for
Netezza
Oozie Load and Remove Action
Apache CLI
Nzload and Nzsql (provisioned
on worker nodes using Chef)
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
32 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
33. Value
o Data warehouse proves product value both
internally and to our customers
o Failing fast and quick turn around allow us to
know when we are building the right reporting
and analytical products without a large up front
investment
o By combining all data in a single system we are
enabling new products to be developed that we
previously could not
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
33 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
34. Krishnan Parasuraman Greg Rokita
@kparasuraman Edmunds.com
Building Scalable Data Platforms
Hadoop and Netezza Deployment Models