3. 3
Finding Business Pains
• Frequent or near-term EDW expansion/spend
• Short time windows for data
• SLA challenges with ELT
• Reports/analytics that are “Too big”
• Compliance issues requiring long-term storage AND
query
• Resource restrictions/contention or
disenfranchised/frustrated users
3
4. 4
Common Challenges with the Data Warehouse
4
OLTP
Enterprise
Applications
Data
Warehouse
QueryExtract
Transform
Load
Business
Intelligence
Transform
1
1
1
Slow data transformations, missed SLAs.
2
2
Slow queries, poor QoS and missed opportunities.
4 Must archive. Archived data can’t provide value.
3
3 Wrong or incomplete, modified copies are made.5 Constant pressure to buy additional
warehouse capacity, just to maintain
current quality of service.
NO room to expand use cases.
NO room to innovate.
5. 5
An EDH Compliments the Data Warehouse
5
OLTP
Enterprise
Applications
Data
Warehouse
Query
Extract
Load Business
Intelligence
Cloudera
3
3 Avoid “spreadmarts” across departments.
Transform
Query
2
2
Empowered business analysts.
2
1 Data loaded when & where it’s needed.
1
4 Complete view of all your
products, customers, etc.
5 Cost effective, infinitely scalable,
production ready enterprise data hub for all
your data.
All data.
All users.
7. 7
2014 Gartner MQ for Data Warehouse DBMS
7
“A data warehouse DBMS is now expected to
coordinate data virtualization strategies, and
distributed file and/or processing
approaches, to address changes in data
management and access requirements.”
9. 9
Understanding Benefits for Your Organization
9
• Help You Assess Your Enterprise Data Warehouse Ecosystem
• Identify Viable
Migration Candidates
and Target Reference
Architecture
• Develop a Project Plan
to Deliver the Full Scope
of Benefits
• Understand the
Business Case for
Making the Investment
10. 10
Working With You Through the EDW Assessment
Process
10
Information
•Collect information about your
EDW environment
Analysis
•Identify migration candidates
•Determine feasibility
Recommendations
•Develop a migration plan
•Establish a business case
12. 12
Key Hadoop Platform Requirements
• High availability
• Disaster recovery
• Downtime-less upgrades
• Auditability
• Low-latency SQL & BI support
• Deep SAS & R support
13. 13
Customers Agree: Cloudera Delivers
Customer Workload Results
Leading Payments
Company
Analytics, ETL
Processing, DR
Largest fraud discovery in firm history
Time to report collapsed from 2 days => 2 hours
Save $30M on DR
Global Money Center
Bank
Data Processing (ELT) Avoided tens of millions in expansion purchases
42% faster processing
Mobile Device
Manufacturer
Data Processing (ELT) Offloaded 90% of data volume; keep all data
Fortune 500 Retailer Analytics More insights by supporting more exploration of more
extensive & granular data
Leading Financial
Regulator
Data Processing (ELT)
and DR
Shrank EDW footprint by 4PB, 20X perf. boost
14. 14
DATA WAREHOUSE
Operational Business
Intelligence
Analytics Self-Service BI
Data Processing (ELT)
Staged Data
Operational
Data
Archival Data
WORKLOADSDATA
Assessing Workloads and Data
• Data Processing (ELT)
• Staged data, to be processed
• Temp tables, BLOB/CLOB types, etc.
• Analytics / Machine Learning
• Deep and broad data sets, within and
beyond the warehouse
• Self-Service BI (Ad-Hoc Query)
• Operational data, actively used for BI
• Archival data, inactively used for BI
15. 15
Offload Data Processing (ELT)
High-scale batch data processing
Implemented as SQL + scripting or ETL
running on expensive HW infrastructure
Staging data stored across diverse, temp
tables
High fraction of overall EDW utilization
(25 – 80%)
Difficult to store, manage staging data
in relational form
Limited user adoption risk to migrate
ETL tools to simplify migration
Over 2X the performance
1/10th the cost
What to Migrate Influencing Factors Better in Cloudera
Reliability for mission-critical workloads: high availability, disaster recovery,
downtime-less upgrades
Low-latency SQL processing, ability to absorb short-cycle ELT
Broad support of leading data integration tools
Only Available with Cloudera Partners
16. 16
Offload Self-Service Business Intelligence
Self-Service BI,
Exploratory BI,
Data Discovery
Uncertain business questions
and uncertain data
Fastest growing workload for
many warehouses
Comparable support for end
user tools between Cloudera
and DBMS products
Schema flexibility
End user self-service on full
fidelity data
1/10th the cost
Workload Migration Priority Better In Cloudera
Open source parallel interactive SQL engine: Cloudera Impala
Integration and certification of every leading SSBI vendor
Only Available with Cloudera Partners
17. 17
Offload Analytics / Machine Learning
Training & scoring
predictive models
Deep and broad data sets, within and
beyond the warehouse
Statisticians want unconstrained
analysis; limited DW compute resources
Paying top dollar for warehouse data
storage only to load into ML tools
Inability to analyze data beyond the
warehouse
Greater user productivity
(pre-packaged ML libraries, no more
down-sampling)
Support for 3rd party ML tools
Greater flexibility
(SQL + MR + SAS procs)
1/10th the cost
Workload and Data Influencing Factors Better in Cloudera
Ability to run SAS, R natively on the same cluster
Interactive search and SQL experience for data exploration
Built-in analytics libraries (Mahout, DataFu, ClouderaML)
Support from Cloudera’s Data Science team
Only Available with Cloudera Partners
18. 18
Sample Cloudera Tools for Assisting Migration
• High-speed connector – Moves data between the two systems
• Data definition – Tool for mapping EDW tables & datatypes to Hive tables &
datatypes
• Mainframe input / output format – Support direct feed of mainframe data
into Cloudera
• Result validation – Verifies SQL applications in Cloudera produce the same
results as the original applications
• Support for SQL-H (planned) – Remote queries from EDW to Cloudera
18
20. 20
• Install and configure CDH and Cloudera Manager
• Run standard and specialized performance tests
• Recommend tuning, compression and
decompression, and scheduler configurations
• Document recommended cluster configuration
• Train and certify Hadoop administrators
Is Your Data Architecture Aligned to Your Use Case?
Lay the Foundation for Data Migration and Ensure Success
21. 21
How Quickly and Securely Can You Transition Your Data?
Migrate Disparate Data Sources to Boost Performance
• Collect low-efficiency data from various silos
• Redeploy latent data from EDWs, RDBMSs,
and Hadoop environment
• Develop, test, and implement data
processing jobs
• Integrate Hadoop with relevant external
systems
• Document workload migration
22. 22
Is Your Operational Environment Ready for Handover?
Maximize ROI by Rationalizing All Systems, Teams, and Workloads
• Review current and future requirements
• Review full ecosystem, all jobs, and regular processes
• Review application architecture, ingestion pipeline, data schema,
and data partitioning system
• Review key management and monitoring processes and relevant
production procedures
• Recommend additional training to assure Hadoop expertise on
management and operations teams
• Document cluster configuration, solutions implementation, and
production recommendations
23. 23
How Much Additional Value Can You Capture Long-Term?
Ongoing Optimization Is Key to Deferring Additional Cost
• Expand framework without expanding
footprint
• Rationalize beyond initial burn-in period
• Evolve cluster to support additional use cases
• Annually benchmark performance to
diagnostic
• Balance business opportunity against
technical risk
25. 25
Prioritizing Workloads and Data
Current EDW
Constraints
Workload
Transferability
User
Communities
• Focus on computation
constraints
• Focus on disk space constraints
• Similar or same SQL functionality
• Similar or same tools support
• Opportunity for performance gains
• Group related workloads by user
community
• Migrate one community at a time
1 2 3
26. 26
The Optimization Process
Profile Prioritize Migrate Validate
• Analyze all of the
workload in your
data warehouse
• Queries
• Objects
• User communities
• Framework driven
methodology for
ordering workloads
• Balance financial
opportunity with
business risk
• Set up data ingest
paths to Cloudera
• Map EDW
workload to
Cloudera
Repeat annually to defer
additional expansion
• Verify results
• Evaluate
performance
differences & tune
• Side-by-side “burn
in” period
• Cut-over
27. 27
Sample EDW Rationalization Process
Initial Quarter Second Quarter Third Quarter Fourth Quarter
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12
Program Management
Responsible for overall program
success, resource assignment, project
management, and risk mitigation
Cloudera Migration Teams
Expert resources delivering initial
project framework and advanced
implementation releases
${Customer} Migration Teams
Customer staff resources, taking on
increasing responsibility for release
implementation over time
ProcessPeople
Technology
Management & Risk Mitigation
Initial EDW Assessment
Architecture Oversight
Assessment and Stratification Process
Detailed Workload Analysis
Implement Reference Architecture
Establish Repeatable Migration Approach
Enhance SDLC, Release, and Configuration Management Processes
Release
1
Release
2
Release
3
Release
N
Migration SDLC
Assignment/Kick-off
Execution
Testing
User Acceptance
Documentation
Sign-off
Release
2
Release
3
Release
N
Release
4
Release
5
28. 28
Workload Classification
Cloudera Architecture Implementing Cloudera’s reference architecture(s) and building environment to fit
unique customer requirements
Data Ecosystem
Integration
BI, ETL, and other applications that require integration with the big data platform,
including existing EDW
Data Processing High-scale batch data processing, Implemented as SQL + scripting or via ETL tools,
Staging data stored across diverse, temp tables
Self-service BI Exploratory BI, Data Discovery, Uncertain business questions and uncertain data
Analytics Training & scoring, predictive models, deep and broad data sets (within and
beyond the warehouse)
Archival Processes Traditional archive storage and processes
29. 29
Workload Complexity
Basic
• Leverages pre-existing
architecture and integrations
• Utilizes all off-the-shelf
components
• Repeatable solutions from
existing
training/documentation
Moderate
• Requires minimal
modifications to existing
architecture,
integrations, or other
dependencies
• Some expertise required
for new design decisions
Advanced
• Establishing new
reference architectures
• Several new design
decisions involved
• Unique skillsets required
(eg. Machine learning)
30. 30
Sample Complexity vs. Time for Various Project Types
ComplexityofTask
Estimated Phase
Low
Moderate
High
1 2 3 4
Machine Learning Modeling
Graph Analytics Modeling
Hadoop cluster install/config
One-off ingest/ETL processes
Predictive Analytics Modeling
Production Certification
Hadoop storage schemas
Decision tree/forest/ensemble
Data Pipelining
Generic ingest/ETL processes
31. 31
Mapping Resources to Project Task Type
ComplexityofTask
Estimated Phase
Low
Moderate
High
1 2 3 4
Data Scientist
Senior Architect
Consultant
Architect
Principal Architect
32. 32
Developers AdminData Warehouse
Specialist
Architects
Technology & Ops
Management & Leadership
Big Data
Visionary
Executive
Sponsor
Program
Manager
Business & Data
Lead Data
Scientist
Lead Business
Analyst
LOB Rep
LOB Rep
LOB Rep
Data
Wranglers
Typical Big Data COE Program Roles
Staff Centrally and Train to Scale
33. 33
Benefits Summary
1. Lower costs of data management, growth
2. Improve quality of service
• Meet critical data processing SLAs
• Faster BI queries
3. Extend existing warehouse capacity
• Increase ROI from current investments
• More operational data – volume and schemas
• More business intelligence and analytics workloads
4. Retain all data for analysis
5. Deliver a foundation for innovation
• Bring more applications to Hadoop data for low incremental cost
IN THIS SESSION, WE WILL EXPLORE USING HADOOP TO ADDRESS QUESTIONS AND ISSUES SURROUNDING * Cost of storage * Value of accessibility * Getting maximum return on your IT investments and all of your data