Uncovering new, valuable insights from big data requires organizations to collect, store, and analyze increasing volumes of data from multiple, often disparate sources at disparate points in time. This makes it difficult to handle big data with data warehouses or relational database management systems alone. A Data Lake allows you to store massive amounts of data in its original form, without the need to enforce a predefined schema, enabling a far more agile and flexible architecture, which makes it easier to gain new types of analytical insights from your data.
Learning Objectives:
• Introduce key architectural concepts to build a Data Lake using Amazon S3 as the storage layer
• Explore storage options and best practices to build your Data Lake on AWS
• Learn how AWS can help enable a Data Lake architecture
• Understand some of the key architectural considerations when building a Data Lake
• Hear some important Data Lake implementation considerations when using Amazon S3 as your Data Lake
9. Components of a Data Lake
Collect & Store
Catalogue & Search
Entitlements
API & UI An API and user interface that expose these
features to internal and external users
A robust set of security controls –
governance through technology, not policy
A search index and workflow which enables
data discovery
A foundation of highly durable data storage
and streaming of any type of data
15. Encryption for Data protection
Authentication & Authorisation
Access Control & restrictions
Entitlements
16. Data Protection via Encryption
AWS CloudHSM
Dedicated Tenancy SafeNet Luna SA HSM
Device
Common Criteria EAL4+, NIST FIPS 140-2
AWS Key Management Service
Automated key rotation & auditing
Integration with other AWS services
AWS server side encryption
AWS managed key infrastructure
17. Entitlements – Access to Encryption Keys
Customer
Master Key
Customer
Data Keys
Ciphertext
Key
Plaintext
Key
IAM Temporary
Credential
Security Token
Service
MyData
MyData
S3
S3 Object
…
Name: MyData
Key: Ciphertext Key
…
18. Exposes the data lake to customers
Programmatically query catalogue
Expose search API
Ensures that entitlements are respected
API & UI
19. API & UI Architecture
API Gateway
UI - Elastic
Beanstalk
AWS Lambda Metadata IndexUsers
IAM
TVM - Elastic
Beanstalk
23. Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
Multiple upload
Range GET
Store as much as you need
Scale storage and compute
independently
No minimum usage commitments
Scalable
AWS Elastic MapReduce
Amazon Redshift
Amazon DynamoDB
Integrated
Simple REST API
AWS SDKs
Read-after-create consistency
Event Notification
Lifecycle policies
Easy to use
Why Amazon S3 for Data Lake?
24. Why Amazon S3 for Data Lake?
Natively supported by frameworks like — Spark, Hive, Presto, etc.
Can run transient Hadoop clusters
Multiple clusters can use the same data
Highly durable, available, and scalable
Low Cost: S3 Standard starts at $0.0275 per GB per month
25. AWS Direct Connect AWS Snowball ISV Connectors
Amazon Kinesis
Firehose
S3 Transfer
Acceleration
AWS Storage
Gateway
Data Ingestion into Amazon S3
26. Choice of storage classes on S3
Standard
Active data Archive dataInfrequently accessed data
Standard - Infrequent Access Amazon Glacier
27. Encryption ComplianceSecurity
Identity and Access
Management (IAM) policies
Bucket policies
Access Control Lists (ACLs)
Query string authentication
SSL endpoints
Server Side Encryption
(SSE-S3)
Server Side Encryption
with provided keys
(SSE-C, SSE-KMS)
Client-side Encryption
Buckets access logs
Lifecycle Management
Policies
Access Control Lists
(ACLs)
Versioning & MFA
deletes
Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the right controls
28. Use Case
We use S3 as the “source of truth” for our cloud-based data
warehouse. Any dataset that is worth retaining is stored on
S3. This includes data from billions of streaming events
from (Netflix-enabled) televisions, laptops, and mobile
devices every hour captured by our log data pipeline
(called Ursula), plus dimension data from Cassandra
supplied by our Aegisthus pipeline.
“
”
Source: http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
Eva Tse
Director, Big Data Platform
29. Tip #1: Use versioning
Protects from accidental overwrites and
deletes
New version with every upload
Easy retrieval of deleted objects and roll
back to previous versions
Versioning
30. Tip #2: Use lifecycle policies
Automatic tiering and cost controls
Includes two possible actions:
Transition: archives to Standard - IA or
Amazon Glacier based on object age you
specified
Expiration: deletes objects after specified time
Actions can be combined
Set policies at the bucket or prefix level
Set policies for current version or non-
current versions
Lifecycle policies
32. Expired object delete marker policy
Deleting a versioned object makes a
delete marker the current version of the
object
Removing expired object delete marker
can improve list performance
Lifecycle policy automatically removes
the current version delete marker when
previous versions of the object no
longer exist
Expired object delete
marker
34. Incomplete multipart upload expiration policy
Partial upload does incur storage charges
Set a lifecycle policy to automatically make
incomplete multipart uploads expire after a
predefined number of days
Incomplete multipart
upload expiration
Best Practice
36. Considerations for organizing your Data Lake
Amazon S3 storage uses a flat keyspace
Separate data by business unit, application, type, and time
Natural data partitioning is very useful
Paths should be self documenting and intuitive
Changing prefix structure in future is hard/costly
37. Best Practices for your Data Lake
Always store a copy of raw input as the first rule of thumb
Use automation with S3 Events to enable trigger based
workflows
Use a format that supports your data, rather than force your
data into a format
Apply compression everywhere to reduce the network load