Contenu connexe Similaire à Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World (20) Plus de Cloudera, Inc. (20) Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World1. 1© Cloudera, Inc. All rights reserved.
Cloudera’s Analytic Database:
BI & SQL Analytics in a Hybrid
Cloud World
Alex Gutow | Product Marketing
Greg Rahn | Product Management
2. 2© Cloudera, Inc. All rights reserved.
What’s Driving Analytics to the Cloud?
Big data deployments in cloud
are accelerating:
● Increased Agility: End-user self-service
● Elasticity: Optimize infrastructure usage
● Lower Overall TCO
● Executive Mandate: Minimize on-prem
datacenter footprint
3. 3© Cloudera, Inc. All rights reserved.
POLL
What best describes your cloud strategy today?
•Standardizing on single-cloud deployment (now/future)
•Multi-cloud deployment (now/future)
•Hybrid on-prem and cloud (now/future)
•Not sure
•No cloud plans yet
4. 4© Cloudera, Inc. All rights reserved.
Key Applications
EDW
Optimization
Data Preparation
Self-Service BI &
Exploration
Use your EDW more
efficiently by offloading
workloads to Hadoop
Fast, flexible ETL over large
data volumes, so data is always
ready for your business
Fastest time-to-insights with a modern
analytic database designed with
Hadoop’s flexibility and agility
5. 5© Cloudera, Inc. All rights reserved.
Key Benefits
An analytic database designed for Hadoop
High-Performance BI and SQL Analytics
Flexibility for Data and Use Case Variety
Cost-effective Scale for Today and Tomorrow
Go Beyond SQL with an Open Architecture
6. 6© Cloudera, Inc. All rights reserved.
Anatomy of an Analytic Database
Cloudera Decoupled by Design
Query Engine
Storage Engine
Catalog
Query Engine
(Impala)
Catalog
(HMS)
Monolithic Analytic Database Modern Analytic Database
Storage
(Kudu)
Storage
(S3)
Storage
(HDFS)
7. 7© Cloudera, Inc. All rights reserved.
Traditional Monolithic Analytic Databases
No Cloud Elasticity or Cloud
Storage Integration
Rigid Data Model with Tightly
Coupled Storage/Compute
Limited to SQL with Data
Movement Necessary
Static Sizing
∞
COMPUTE
STORE
8. 8© Cloudera, Inc. All rights reserved.
Advantages of Our Approach
Cloud-Native & On-Premise
Go Beyond SQL
• Open Architecture: Open
formats and open storage
• Shared data across SQL and
non-SQL workloads
Data Flexibility
• Faster, more agile data
acquisition
• Data portability: Open
formats and open storage
Cost-Effective Scalability
• Elastic scale on-prem or in
the cloud
• Cloud-native pay-per-use
and transience
• Proven at big data scale
Hybrid
• Runs across multi-cloud &
on-prem
• Multi-storage over S3, HDFS,
Kudu, Isilon, DSSD, etcShared Data
9. 9© Cloudera, Inc. All rights reserved.
Cost-Efficiencies & Flexibility in the Cloud
Primary Analytic Database Patterns
Only pay for what you need,
when you need it
▪ Transient clusters
▪ Object storage centric
▪ Cloud-native deployment
ETL
Reduce Operating Costs New Insights, New Revenue
BI/Analytics
Explore and analyze all data,
wherever it lives
▪ Long-running clusters
▪ Object storage or local storage
▪ Lift-and-shift deployment
10. 10© Cloudera, Inc. All rights reserved.
Add Use Cases, Analytics,
and Data On-Demand
• Avoid the IT backlog with instant
access to all data
• On-demand clusters query directly
on shared object storage
Predictable Results
Whenever You Want
• Consistent query performance,
even during peak times
• Multi-tenancy via isolated clusters
on shared data
Just-in-Time Resources
• Real-time capacity for your needs,
as they change
• Elastically grow/shrink your cluster
via decoupled architecture
Contention-Free ETL
• ETL anytime without impacting
other workloads or risking SLAs
• Separate ETL clusters as-needed on
shared data
Benefits Across the Business
ETL and BI/Analytics in the cloud
12. 12© Cloudera, Inc. All rights reserved.
Deployment Patterns: ETL in the Cloud
Object Storage
Batch
Cluster
Transient Batch (most flexible)
Spin up clusters as needed
● On-demand/spot instances
● Usage-based pricing
● Sized for workload
● Cluster per tenant/user
Batch
Cluster
Batch
Cluster
Persistent Batch (most control)
Persistent cluster(s) for frequent ETL
● Reserved instances
● Node-based pricing
● Grow/shrink
● Cluster per tenant group
Persistent
Cluster
Batch
Persistent Batch on HDFS (fastest)
Top performance for frequent ETL
● Reserved instances
● Node-based pricing
● Grow/shrink
● Shared across tenant groups
Batch
Persistent
Cluster
Batch Batch
Persistent Cluster
HDFS
Batch Batch
13. 13© Cloudera, Inc. All rights reserved.
Deployment Patterns: BI/Analytics in the Cloud
Three Architecture Options to Optimize Price/Performance
Object Storage
Transient
Cluster
Transient BI (infrequent usage)
Spin up clusters when needed
● On-demand instances
● Usage-based pricing
● Grow/shrink
● Cluster per tenant or user
Persistent BI (regular usage)
Persistent clusters for BI any time
● Reserved instances
● Node-based pricing
● Grow/shrink
● Cluster per tenant group
Persistent
Cluster
Persistent BI with Local Storage (fastest)
Max speed for more regular workloads
● Reserved instances
● Node-based pricing
● Less frequent grow/shrink
● Shared cluster for shared local data
Persistent Cluster HDFS and/or
Kudu
Persistent
Cluster
Transient
Cluster
Default Choice
14. 14© Cloudera, Inc. All rights reserved.
Persistent BI on Object Storage
Best for elasticity (and speed vs transient)
● This is usually the best choice
● Best when workloads are:
o Flexible and changing
o Frequent during most working days
o Not scheduled for fixed hours
● Benefits include:
o Predictable results readily available
o Full multi-tenant isolation
o Common data in shared object storage
o Grow/shrink for TCO efficiency
● Tradeoffs:
o Per node performance of object storage (use
more, cheaper nodes)
Object Storage
Persistent BI (regular usage)
Persistent clusters for ready availability
● Reserved instances
● Node-based pricing
● Grow/shrink
● Cluster per tenant group
Persistent
Cluster
Persistent
Cluster
Default Choice
15. 15© Cloudera, Inc. All rights reserved.
Persistent BI with Locally-Attached Storage
Best performance for consistent workloads
● Best when workloads are:
o Regular and consistent
o Consistently querying common data
o Tight SLAs for performance
o Fast changing data (that needs Kudu)
o Running without object storage (eg. Azure, GCE)
● Benefits include:
o Faster performance per node on local data
o Ability to query object storage for rest of data
● Tradeoffs:
o Less elastic than object stored based clusters
o Less isolation for multi-tenant workloads using
same HDFS data
o Cost if there are off-peak hours
Object Storage
Persistent BI with HDFS (fastest)
Max speed for more regular workloads
● Reserved instances
● Node-based pricing
● Less frequent grow/shrink
● Shared cluster for shared HDFS data
Persistent
Cluster
HDFS and/or
Kudu
16. 16© Cloudera, Inc. All rights reserved.
Transient BI on Object Storage
Best TCO for infrequent usage
Object Storage
Cloudera
Director
● Best when workloads are:
o Infrequent or scheduled
● Benefits include:
o Lowest TCO with clusters only when needed
o Full multi-tenant isolation
o Common data in shared object storage
● Tradeoffs:
o Delay to spin-up clusters when needed
o Capability of BI users to spin up clusters
o Per node performance of object storage (use
more, cheaper nodes)
Shared
HMS DB
Transient
Cluster
Transient BI (infrequent usage)
Spin up clusters when needed.
● On-demand instances
● Usage-based pricing
● Grow/shrink
● Cluster per tenant or user
Transient
Cluster
17. 17© Cloudera, Inc. All rights reserved.
• Engine: Impala
• Data Formats:
• Use Parquet, Snappy, and 256 MB blocks for best I/O efficiency (especially on object storage)
• Avoid small files to avoid performance issues (especially with object storage)
• Metadata (Hive metastore):
• Use local RDBMS for clusters that use locally-stored HDFS/Kudu data
• Use local RDBMS for persisted clusters (or transient clusters with Kerberos or Sentry)
• Use a shared RDBMS for transient single-tenant clusters without Kerberos or Sentry
• Avoid sharing RDBMS for non-data sharing clusters
• Deployment and Administration:
• Use Cloudera Manager for persistent clusters
• Use Cloudera Director for transient clusters
• Deploy NLB for persistent clusters as usual (and only when needed for transience)
• Monitor workloads with Cloudera Manager
• Security:
• Transient BI clusters: use VPC with strict security groups
• Persistent BI clusters: (see next slide)
Basic Guidelines for BI in the cloud
18. 18© Cloudera, Inc. All rights reserved.
Security Guidelines for BI in the cloud
Persistent clusters
• Identity:
• Tie the cluster to the corporate user directory (AD or LDAP) using Linux SSSD
• Authentication:
• Kerberos for internal CDH services
• AD (LDAP) for BI user/tools with Impala
• Authorization:
• Use Sentry for authorization per BI cluster (shared RDBMS with Sentry is not yet available)
• Encryption:
• With AWS, use SSE-S3 server-side encryption
• If you need HDFS-level encryption or keys outside AWS, then use Persisted Cluster with Local Storage using CM
NOTE: if all users for a cluster have access to all data in the cluster, skip steps above and just use VPC with
strict security groups
20. 20© Cloudera, Inc. All rights reserved.
Helping Companies with a Global Value
Chain Save Millions
• 360° view of supply chain process in
seconds with data from suppliers,
manufacturing, equipment, field
service, IoT and repair
• Improved product quality by identifying
and addressing supply chain issues in
near-real time
• $15-$25 million savings annually for
Siemens clients
• Costs 90% less per TB than RDBMS; 75%
less per TB than Netezza
CUSTOMER 360
21. 21© Cloudera, Inc. All rights reserved.
Providing a complete view of consumer
watching and buying habits
• Helps customers optimize their ad
spend for greater campaign ROI
• Improves processing performance as
data volumes double
• Boosts agility and flexibility and
reduces risk with hybrid and
multi-cloud strategy
CUSTOMER 360
22. 22© Cloudera, Inc. All rights reserved.
Measure user interaction across the
ecosystem, help direct R&D and
development spend
• Virtuous cycle: Identify features that
facilitate sharing of content that drive
new customers
• Real-time streaming and batch data
from product logs, web analytics,
channel data and ERP
• Impala connects to third-party data
wrangling and BI tools for fast reporting
23. 23© Cloudera, Inc. All rights reserved.
Next Steps
Learn about Operational and Data Engineering
Workloads in the Cloud
www.cloudera.com/about-cloudera/events/webinars/cloud-webinar-series.html
Get Started with Impala in
the Cloud
Check Out the Latest
Benchmarks
• www.cloudera.com/downloads
• www.cloudera.com/documentation/e
nterprise/latest/topics/impala_s3.html
http://blog.cloudera.com/blog/2016/09/apache-
impala-incubating-vs-amazon-redshift-s3-
integration-elasticity-agility-and-cost-
performance-benefits-on-aws/