AWS offers everything you need to deploy a secure and flexible data lake in the cloud. Discover how services like Amazon Simple Storage Service (Amazon S3) and Amazon Redshift can be used together to build and manage your own data lake, and how AWS Lake Formation makes it possible to set up a data lake in days. We walk through an example architecture together, covering everything from data storage to data analytics.
3. Agenda
Why a Data Lake?
Key Concepts
Data Lakes on AWS
An Example
Best Practices
4. Related DevChats
DVC10 - Lessons from the backyard: A connected BBQ grill and smoker
DVC06 - Use Neptune to discover where & when events can impact local
businesses
6. The Data Science Hierarchy of Needs
AI
Learn/Optimize
Aggregate/Label
Explore/Transform
Move/Store
Collect
“You need a solid foundation
for your data before being
effective with AI and machine
learning.”
Monica Rogati
Data Science and AI Advisor
7. The Data Warehouse Solution
Data Warehouse
Data Mart Data Mart Data Mart
Advantages
Provides precise reporting and BI
Standardized, consistent data
Drawbacks
Limited to pre-determined questions
No low-level data visibility
8. Considerations for a Modern Solution
Centralized
Data Storage
Store all data
reliably in one
location
Multiple User
Communities
Business
analysts, data
professionals
Schema on
Read
Schema written
at time of
analysis
Storage vs.
Compute
Scale storage
and compute
independently
Data Types &
Formats
Structured,
semi-structured,
unstructured,
raw data
Security
Control access
to the data
9. Photo by Yifan Liu on Unsplash
A data lake is a centralized repository that allows you
to store all your structured and unstructured data at
any scale. You can store your data as-is, without having
to first structure the data, and run different types of
analytics—from dashboards and visualizations to big
data processing, real-time analytics, and machine
learning to guide better decisions.
10. Photo by arsalan arianmehr on Unsplash
Onboard relevant data
Metadata should exist in a data catalog
Data governance policies and procedures govern
storage and access
Automated processes manage data flow, data
cleaning, and enforce practices
12. Data Ingestion
Amazon Kinesis
Data Firehose
Easily and reliably
stream data into data
lakes
AWS Snowball
Migrate large datasets
using secure devices
AWS Storage
Gateway
Gain on-premises
access to AWS cloud
storage
AWS Database
Migration Service
Migrate databases to
AWS quickly and
securely
AWS Direct
Connect
Establish a dedicated
network connection
to AWS
13. Catalog & Search
Amazon DynamoDB
Fully managed NoSQL
database service
Amazon Elasticsearch
Service
Fully managed Elasticsearch
service
AWS Glue
Store metadata in a
data catalog
14. Move & Transform
Amazon Kinesis
Data Firehose
Easily and reliably
stream data into data
lakes
AWS Glue
Fully managed ETL
service
AWS Lambda
Event-driven,
serverless computing
15. Access & User Interfaces
AWS AppSync
Manage and
synchronize mobile
app data in real time
across devices and
users
Amazon Cognito
Add user sign-up,
sign-in, and access
control to your web
and mobile apps
quickly and easily
Amazon API
Gateway
Fully managed service
for creating, publishing,
maintaining, and
monitoring secure APIs
at scale
16. Analytics & Serving
Amazon Redshift
Fast, simple, cost-
effective data
warehousing service
Amazon Athena
Serverless,
interactive query
service
Amazon QuickSight
Fast, cloud-powered
business intelligence
service
AWS Glue
Store metadata
in a data catalog
Amazon DynamoDB
Fully managed NoSQL
database service
Amazon EMR
Run & Scale Spark,
Hadoop, and other
Big Data Frameworks
AWS Direct
Connect
Establish a
dedicated network
connection to AWS
Amazon Elasticsearch
Service
Fully managed
Elasticsearch service
Amazon Neptune
Fully managed Graph
database service
Amazon RDS
Distributed
relational
database service
17. Manage & Secure
AWS KMS
Manage cryptographic
keys and control their
use across services
AWS IAM
Securely manage
access to AWS
services and resources
AWS CloudTrail
Enable governance,
compliance,
operational auditing,
and risk auditing
Amazon CloudWatch
Monitor your AWS
resources and the
applications you run on
AWS in real time
18. A Data Lake in Days
AWS Lake Formation
Source crawlers, ETL and data
prep, data catalog, security
settings, access control
Identify data sources
Data lake storage
Provide self-
service access
19. An Example
Amazon S3 AWS Lambda
AWS CloudTrailAWS IAM
AWS Glue
Amazon Athena Amazon QuickSight
21. Some Best Practices
Encrypt data at-rest and in-transit
Partition data
Compress data
Use columnar file formats
Use lifecycle policies
Automate, automate, automate