2. The Enterprise Data Warehouse
SERVERS
MARTS
DW
DOCUMENTS
STORAGE
SEARCH
ARCHIVE
ERP, CRM, RDBMS, MACHINES
FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS
EXTERNAL DATA SOURCES
Complex Architecture
•Many special-purposesystems, silos of data
•Moving data around
•No complete views
4
Visibility
•Leaving data behind
•Risk and compliance
•High cost of storage
1
Time to Data
•Up-front modeling
•Transforms slow
•Transforms lose data
2
Cost of Analytics
•Existing systems strained
•No agility
•BI backlog
3
3. Cloudera for the Enterprise Data Hub
Multi-workload analytic platform
•Bring applications to data
•Combine different workloads on common data (i.e. SQL + Search)
•True BI agility
4
Active archive
•Full fidelity original data
•Indefinite time, any source
•Lowest cost storage
1
Data management, transforms
•One source of data for all analytics
•Persist state of transformed data
•Significantly faster & cheaper
2
Self-service exploratory BI
•Simple search + BI tools
•“Schema on read” agility
•Reduce BI user backlog requests
3
SERVERS
MARTS
DW
DOCUMENTS
STORAGE
SEARCH
ARCHIVE
ERP, CRM, RDBMS, MACHINES
FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS
EXTERNAL DATA SOURCES
6. EDW optimisation: Active Archive
6
Archive datasets
Infrequently accessed tables
Large, corpus of data
Frequency of data access
Changing regulatory compliance requirements
Data volume growth
Data remains accessible
Data is not lost
1/10ththe cost
What to Migrate
Influencing Factors
Better in Cloudera
Reliability for mission-critical workloads: high availability, disaster recovery, downtime-less upgrades
Low-latency SQL processing, ability to absorb short-cycle ELT
Broad support of leading data integration tools
Only Available with Cloudera
Key Partners
7. EDW optimisation: Transformation
7
High-scale batch data processing
Implemented as SQL + scripting or ETL running on expensive HW infrastructure
Staging data stored across diverse, temp tables
High fraction of overall EDW utilization (25 –80%)
Difficult to store, manage staging data in relational form
Limited user adoption risk to migrate
ETL tools to simplify migration
Over 2X the performance
1/10ththe cost
Persistent staging,
tracked lineage
What to Migrate
Influencing Factors
Better in Cloudera
Reliability for mission-critical workloads: high availability, disaster recovery, downtime-less upgrades
Low-latency SQL processing, ability to absorb short-cycle ELT
Broad support of leading data integration tools
Only Available with Cloudera
Key Partners
8. EDW optimisation: Self Service BI
8
Self-Service BI, Exploratory BI, Data Discovery
Uncertain business questions and uncertain data
Fastest growing workload for many warehouses
Comparable support for end user tools between Cloudera and DBMS products
Schema flexibility
End user self-service on full fidelity data
1/10ththe cost
Workload
Migration Priority
Better In Cloudera
Open source parallel interactive SQL engine: Cloudera Impala
Integration and certification of every leading SSBI vendor
Only Available with Cloudera
Key Partners
9. EDW optimisation: Multi-workload
9
Training & scoringpredictive models
Deep and broad data sets, within and beyond the warehouse
Statisticians want unconstrained analysis; limited DW compute resources
Paying top dollar for warehouse data storage only to load into ML tools
Inability to analyze data beyond the warehouse
Greater user productivity(pre-packaged ML libraries, no more down-sampling)
Support for 3rdparty ML tools
Greater flexibility(SQL + MR + Search + Spark
+ SAS procs)
1/10ththe cost
Workload and Data
Influencing Factors
Better in Cloudera
Ability to run SAS, R natively on the same cluster
Interactive search and SQL experience for data exploration
Built-in analytics libraries (Mahout, DataFu, ClouderaML) Support from Cloudera’s Data Science team
Only Available with Cloudera
Key Partners
10. Why EDW optimisation?
1.Lower costs of data management, allow growth
2.Improve quality of service
•Shorten ETL windows
•Faster BI queries
3.Extend existing warehouse capacity
•Increase ROI from current investments
•More operational data –volume and schemas
•More business intelligence and analytics workloads
4.Retain all data for more varied analysis
5.Deliver a foundation for innovation
•Bring more applications to Hadoop data for low incremental cost
11. Customers agree, Cloudera delivers
Customer
Workload
Results
Leading Payments Company
Analytics, ETL Processing, DR
Largest fraud discovery in firm history
Time to report collapsedfrom 2 days => 2 hours
Save $30M on DR
Global Money Center Bank
DataProcessing (ELT)
Avoidedtens of millions in expansion purchases
42% faster processing
MobileDevice Manufacturer
Data Processing (ELT)
Offloaded 90% ofdata volume; keep all data
Fortune500 Retailer
Analytics
Moreinsights by supporting more exploration of more extensive & granular data
Leading Financial Regulator
DataProcessing (ELT) and DR
Shrank EDW footprint by 4PB, 20X perf. boost