3. Why Hadoop and Why Now?
THE ADVANTAGES:
Cost reduction
Alleviate performance bottlenecks
ETL too expensive and complex
Mainframe and Data Warehouse processing à Hadoop
THE CHALLENGE:
Traditional enterprises lack of awareness
THE SOLUTION:
Leverage the growing support system for Hadoop
Make Hadoop the data hub in the Enterprise
Use Hadoop for processing batch and analytic jobs
Page 3
4. The Classic Enterprise Challenge
Growing Data
Volumes
Shortened
Tight IT Processing
Budgets
Windows
Latency in The Escalating
Data Challenge Costs
Hitting
ETL Scalability
Complexity
Ceilings
Demanding
Business
Requirements
Page 4
5. The Sears Holdings Approach
Key to our Approach:
1) allowing users to continue to use familiar consumption interfaces
2) providing inherent HA
3) enabling businesses to unlock previously unusable data
1 2 3 4 5 6
Move results Retain, within
Implement a Move
Massively and Hadoop,
Hadoop- enterprise Make Hadoop
reduce ETL by aggregates source files at
centric batch the single
transforming back to legacy the finest
reference processing to point of truth
within Hadoop systems for granularity for
architecture Hadoop
consumption re-use
Page 5
6. The Architecture
• Enterprise solutions using Hadoop must be an
eco-system
• Large companies have a complex environment:
– Transactional system
– Services
– EDW and Data marts
– Reporting tools and needs
• We needed to build an entire solution
Page 6
8. The Learning
Over two years of Hadoop experience using Hadoop for Enterprise legacy workload.
ü We can dramatically reduce batch processing times for mainframe and EDW
HADOOP
ü We can retain and analyze data at a much more granular level, with longer history
ü Hadoop must be part of an overall solution and eco-system
IMPLEMENTATION
ü We can reliably meet our production deliverable time-windows by using Hadoop
ü We can largely eliminate the use of traditional ETL tools
ü New Tools allow improved user experience on very large data sets
UNIQUE VALUE
ü We developed tools and skills – The learning curve is not to be underestimated
ü We developed experience in moving workload from expensive, proprietary mainframe
and EDW platforms to Hadoop with spectacular results
Page 8
10. The Challenge – Use-Case #1
Sales:
Price
8.9B
Sync:
Line Elasticity:
Offers: Daily
Items 12.6B
1.4B
SKUs Parameters
Items: Stores:
Timing: 11.3M 3200
SKUs Inventory: Sites
Weekly
1.8B rows
• Intensive computational and large storage requirements
• Needed to calculate item price elasticity based on 8 billion rows of sales data
• Could only be run quarterly and on subset of data – Needed more often
• Business need - React to market conditions and new product launches
Page 10
11. The Result – Use-Case #1
Business Problem: Sales:
Price
8.9B
Sync:
Line Elasticity:
• Intensive computational Offers: Daily
Items 12.6B
and large storage 1.4B
SKUs Parameters
requirements
• Needed to calculate
Items: Stores:
store-item price 11.3M 3200
Timing:
elasticity based on 8 SKUs Inventory: Sites
Weekly
billion rows of sales 1.8B rows
data
• Could only be run
quarterly and on subset
of data
Hadoop
• Business missing the
opportunity to react to
changing market
conditions and new
product launches
Price elasticity New business 100% of data
calculated capability set and Meets all SLAs
weekly enabled granularity
Page 11
12. The Challenge – Use-Case #2
Mainframe
Data Scalability:
Sources: Unable to Mainframe:
30+ Scale 100 100 MIPS
Input fold on 1% of
Records:
data
Billions
Hadoop
• Mainframe batch business process would not scale
• Needed to process 100 times more detail to handle business critical functionality
• Business need required processing billions of records from 30 input data sources
• Complex business logic and financial calculations
• SLA for this cyclic process was 2 hours per run
Page 12
13. The Result – Use-Case #2
Mainframe
Business Problem: Data Scalability:
Unable to
Sources: Mainframe:
30+ Scale 100 100 MIPS
• Mainframe batch Input fold on 1% of
business process would Records:
data
not scale Billions
• Needed to process 100
times more detail to
handle rollout of high Hadoop
value business critical
functionality
• Time sensitive business
need required processing
billions of records from
30 input data sources
Teradata & Implemented JAVA UDFs for Scalable
Mainframe Data PIG for financial Solution in 8
• Complex business logic
on Hadoop Processing calculations Weeks
and financial calculations
• SLA for this cyclic
process was 2 hours per 6000 Lines
Processing Met $600K Annual
run Reduced to 400
Tighter SLA Savings
Lines of PIG
Page 13
14. The Challenge – Use-Case #3
Data
Storage:
Mainframe
DB2 Tables
Price
Processing
Data:
Window: Mainframe
500M
3.5 Hours Jobs: 64
Records
Hadoop
Mainframe unable to meet SLAs on growing data volume
Page 14
15. The Result – Use-Case #3
Business Problem:
Data
Storage:
Mainframe unable to meet Mainframe
DB2 Tables
SLAs on growing data volume
Price
Processing
Data:
Window: Mainframe
500M
3.5 Hours Jobs: 64
Records
Hadoop
Job Runs Over Maintenance
Source Data in 100% faster – $100K in Annual Improvement –
Hadoop Now in 1.5 Savings <50 Lines PIG
hours code
Page 15
16. The Challenge – Use-Case #4
Teradata via
Transformation:
Business
On Teradata User
Objects
Experience:
Unacceptable
Batch
History
Processing
Retained: New Report
Output: .CS
No Development:
V Files
Slow
Hadoop
• Needed to enhance user experience and ability to perform analytics at granular data
• Restricted availability of data due to space constraint
• Needed to retain granular data
• Needed Excel format interaction on data sources of 100 millions of records with agility
Page 16
17. The Result – Use-Case #4
Business Problem: Teradata via
Transformation:
Business
On Teradata User
Objects
Experience:
• Needed to enhance user Unacceptable
experience and ability to
Batch
perform analytics at Processing
History
granular data Retained: New Report
Output: .CS
No Development:
V Files
Slow
• Restricted availability of
data due to space
constraint
• Needed to retain granular
Hadoop
data
• Needed Excel format
interaction on data
sources of 100 millions of
records with agility User
Sourcing Data Redundant Transformation
Directly to Experience
Storage Moved to
Hadoop Expectations
Eliminated Hadoop
Met
Over 50 Data Business’s
Datameer for PIG Scripts to
Sources Granular History Single Source
Additional Ease Code
Retained in Retained of Truth
Analytics Maintenance
Hadoop
Page 17
18. Summary
• Hadoop can handle Enterprise workload
• Can reduce strain on legacy platforms
• Can reduce cost
• Can bring new business opportunities
• Must be an eco-system
• Must be part of an data overall strategy
• Not to be underestimated
Page 18
19. The Horizon – What do we need next
• Automation tools and techniques that ease the
Enterprise integration of Hadoop
• Educate traditional Enterprise IT organizations
about the possibilities and reasons to deploy
Hadoop
• Continue development of a reusable framework
for legacy workload migration
Page 19
20. For more information, visit:
www.metascale.com
Follow us on Twitter @BigDataMadeEasy
Join us on LinkedIn: www.linkedin.com/company/metascale-llc
Page 20