In this presentation Microsoft will join Cloudera to introduce a new Platform-as-a-Service (PaaS) offering that helps data engineers use on-demand cloud infrastructure to speed the creation and operation of data pipelines that power sophisticated, data-driven applications - without onerous administration.
19. Azure Data Lake Store
(ADLS)
A hyper scale repository for
big data analytics workloads
Store ANY DATA in its native format
HADOOP FILE SYSTEM (HDFS) for
the cloud
ENTERPRISE GRADE
No limits to SCALE
Optimized for analytic workload
PERFORMANCE
YARN
Hive | Spark | Impala
Cloudera 5.1x Azure PaaS
Services
ADL Store
Compute
Data
22. ADLS – Under the hood
Data Lake Store Backend
SSD-backed Data Lake Ingestion layer
Data Lake Client Data Lake Management Client
Data Lake Client SDK
REST API
(Data Access)
Data Lake Store Frontend
Management API
Scale out Storage
Azure ML
Metadata
Service
Naming
Service
File System/ HDFS API
1
2
4
3
5
6
Microsoft
R Server
23. Comparison between storage options
Block based options Filesystem based options
VHDs on WASB Premium Storage WASB ADLS
Maximum volume 4TB per disk 4TB per disk 500 TB No limit (tested > exabytes)
Maximum item size N/A N/A 4.75 TB No limit (tested > petabytes)
Physical media HDD Flash/SSD HDD SSD + HDD
Replication LRS and GRS None LRS and GRS LRS
Throughput 60 MBps per disk 250 MBps per disk 60 MBps per blob Extremely high
RBAC N/A N/A N/A POSIX compliant (file & folder level)
Encryption SSE or Azure Key Vault N/A N/A Transparent (AES 256 + TLS 1.2)
Workloads any any low TBs >10 TBs
Locations all most all 4 and growing
https://docs.microsoft.com/en-us/azure/storage/storage-scalability-targets
24. Why Cloudera on Azure Data Lake Store?
Separation of
Compute & Storage
Transient clusters for
flexibility, lower TCO
Shared storage for many
optimized clusters
Compute
time
M T W R F S S
Data Lake
Store
Data Lake
Store
Data Lake
Store