"Increasing demands to collect, store, and analyze massive amounts of data often means that the same tools and approaches that worked in the past, don't work anymore. That’s why many organizations are shifting to a data lake architecture. A data lake is an architectural approach that allows you to store massive amounts of data into a central location, so it’s readily available to be categorized, processed, analyzed and consumed by diverse groups within an organization. In this tech talk, we introduce key concepts for a data lake and present aspects related to its implementation. We highlight the core components of a data lake, such as storage, compute, analytics, databases, stream processing, data management, and security. We discuss how to choose the right technologies for each component of the data lake, based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. We also provide a reference architecture and recommendations to get started with a data lake implementation on AWS.
Learning Objectives:
• Understand key concepts and architectural components of a data lake architecture
• Describe how and when to use a broad set of analytic and data management tools in a data lake architecture
• Get insights on how to get started with a data lake implementation on AWS"
2. What to expect from this session
• Data Lake concept
• Important capabilities of a Data Lake
• Matomy’s Data Lake implementation
• Big Data Reference Architecture
4. What is a Data Lake?
Data Lake is a new and increasingly
popular way to store and analyze
massive volumes and heterogenous
types of data in a centralized repository.
5. Benefits of a Data Lake – Quick Ingest
“How can I collect data quickly
from various sources and store
it efficiently?”
Quickly ingest data
without needing to force it into a
pre-defined schema.
6. Benefits of a Data Lake – All Data in One Place
“Why is the data distributed in
many locations? Where is the
single source of truth?”
Store and analyze all of your data,
from all of your sources, in one
centralized location.
7. Benefits of a Data Lake – Storage vs Compute
Separating your storage and compute
allows you to scale each component as
required
“How can I scale up with the
volume of data being generated?”
8. Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple
analytics and processing frameworks
to the same data?”
A Data Lake enables ad-hoc
analysis by applying schemas
on read, not write.
10. Important components of a Data Lake
Catalog & Search Protect & SecureAccess & User
Interface
Ingest and Store
11. Ingest and Store
Ingest streaming and batch data
Support for any type of data at scale
Durable
Low cost
12. Amazon S3 as your cluster’s persistent data store
Amazon S3
Separate compute and storage
Resize and shut down Analytics
Compute Environments with no data
loss
Point multiple compute clusters at
same data in Amazon S3
13. AWS Direct Connect AWS Snowball ISV Connectors
Amazon Kinesis
Firehose
S3 Transfer
Acceleration
AWS Storage
Gateway
Data Ingestion into Amazon S3
14. Use S3 as Data Substrate for Compute
EMR Kinesis
Redshift DynamoDB RDS
Athena
Storm
Amazon
S3
Import/Export
Snowball
Highly Durable
Low Cost
Scalable Storage
Spark
15. Metadata lake
Used for summary statistics and data
Classification management
Simplified model for data discovery & governance
Catalog & Search
16. Catalog & Search Architecture
Data Collectors
(EC2, ECS)
S3 Bucket
Metadata Index
Amazon DynamoDB
Put Object
AWS Lambda
Object Created,
Object Deleted Put Item
AWS Lambda
Search Index
Amazon Elasticsearch
Extract Search Fields
Update
Stream
Update
Index
17. Exposes the data lake to customers
Programmatically query catalogue
Expose search API
Ensures that entitlements are respected
API & User Interface
18. API & UI Architecture
Metadata Index
Amazon DynamoDB
Search Index
Amazon Elasticsearch
AWS LambdaAPI Gateway
Users
API
User
Management
Static Website
UI
19. Access Control - Authentication & Authorization
Data protection - Encryption
Logging and Monitoring
Protect and Secure
20. Encryption ComplianceSecurity
§ Identity & Access Management
§ Bucket policies
§ Access Control Lists (ACLs)
§ Query string authentication
§ Private VPC endpoints to
Amazon S3
§ SSL endpoints
§ Server Side Encryption
(SSE-S3, SSE-C, SSE-
KMS)
§ Client-side Encryption
§ Buckets access logs
§ Lifecycle Management
Policies
§ Versioning & MFA deletes
§ Certifications – HIPAA, PCI,
SOC 1/2/3 etc.
Implement the right controls
22. A Data Lake on AWS
Catalog & Search Access & User Interface
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena
Kinesis
Analytics
RDS
Central
Storage
S3
Snowball Database
Migration Service
Kinesis Firehose Direct
Connect
Collect & Ingest
Protect & Secure Process & Analyze
Security
Token Service
CloudWatch CloudTrail Key Management
Service