Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rahul Bhartia, Solutions Architect
Susan Chan, S...
Evolution of “Data Lakes”
Databases
Transactions
Data
warehouse
Evolution of big data architecture
Extract, transform and load (ETL)
Databases
Files
Transactions
Logs
Data
warehouse
Evolution of big data architecture
ETL
ETL
Databases
Files
Streams
Transactions
Logs
Events
Data
warehouse
Evolution of big data architecture
? Hadoop
?
ETL
ETL
Amazon
Glacier
Amazon S3 Amazon
DynamoDB
Amazon
RDS
Amazon EMR
Amazon
Redshift
AWS Data
Pipeline
Amazon Kinesis Amazon
Clo...
Databases
Files
Streams
Transactions
Logs
Events
Data
warehouse
Data
Lake
The Genesis of “Data Lakes”
What really is a “Data Lake”
Components of a Data Lake
Collect & Store
Catalogue & Search
Entitlements
API & UI  An API and user interface that expose...
Storage
High durability
Stores raw data from input sources
Support for any type of data
Low cost
Data Lake – Hadoop (HDFS) as the Storage
Search
Access
QueryProcess
Archive
Transaction
s
Data Lake – Amazon S3 as the storage
Search
Access
QueryProcess
Archive
Amazon
RDS
Amazon
DynamoDB
Amazon
El...
Metadata lake
Used for summary statistics and data
Classification management
Simplified model for data discovery &
governa...
Catalogue & Search Architecture
Encryption for Data protection
Authentication & Authorisation
Access Control & restrictions
Entitlements
Data Protection via Encryption
AWS CloudHSM
Dedicated Tenancy SafeNet Luna SA HSM
Device
Common Criteria EAL4+, NIST FIPS ...
Entitlements – Access to Encryption Keys
Customer
Master Key
Customer
Data Keys
Ciphertext
Key
Plaintext
Key
IAM Temporary...
Exposes the data lake to customers
Programmatically query catalogue
Expose search API
Ensures that entitlements are respec...
API & UI Architecture
API Gateway
UI - Elastic
Beanstalk
AWS Lambda Metadata IndexUsers
IAM
TVM - Elastic
Beanstalk
Putting It All Together
Amazon
Kinesis
Amazon S3 Amazon Glacier
IAM
Encrypted
Data
Security Token
Service
AWS Lambda
Search
Index
Metadata
Index
A...
Amazon S3 - Foundation for
your Data Lake
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
 Multiple upload
 R...
Why Amazon S3 for Data Lake?
 Natively supported by frameworks like — Spark, Hive, Presto, etc.
 Can run transient Hadoo...
AWS Direct Connect AWS Snowball ISV Connectors
Amazon Kinesis
Firehose
S3 Transfer
Acceleration
AWS Storage
Gateway
Data I...
Choice of storage classes on S3
Standard
Active data Archive dataInfrequently accessed data
Standard - Infrequent Access A...
Encryption ComplianceSecurity
 Identity and Access
Management (IAM) policies
 Bucket policies
 Access Control Lists (AC...
Use Case
We use S3 as the “source of truth” for our cloud-based data
warehouse. Any dataset that is worth retaining is sto...
Tip #1: Use versioning
 Protects from accidental overwrites and
deletes
 New version with every upload
 Easy retrieval ...
Tip #2: Use lifecycle policies
 Automatic tiering and cost controls
 Includes two possible actions:
 Transition: archiv...
Versioning + lifecycle policies
Expired object delete marker policy
 Deleting a versioned object makes a
delete marker the current version of the
object
...
Insert console screen shot
Enable policy with the console
Incomplete multipart upload expiration policy
 Partial upload does incur storage charges
 Set a lifecycle policy to auto...
Enable policy with the Management Console
Considerations for organizing your Data Lake
 Amazon S3 storage uses a flat keyspace
 Separate data by business unit, ap...
Best Practices for your Data Lake
 Always store a copy of raw input as the first rule of thumb
 Use automation with S3 E...
Thank you!
Prochain SlideShare
Chargement dans…5
×

Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly Webinar Series

23 899 vues

Publié le

Uncovering new, valuable insights from big data requires organizations to collect, store, and analyze increasing volumes of data from multiple, often disparate sources at disparate points in time. This makes it difficult to handle big data with data warehouses or relational database management systems alone. A Data Lake allows you to store massive amounts of data in its original form, without the need to enforce a predefined schema, enabling a far more agile and flexible architecture, which makes it easier to gain new types of analytical insights from your data.

Learning Objectives:
• Introduce key architectural concepts to build a Data Lake using Amazon S3 as the storage layer
• Explore storage options and best practices to build your Data Lake on AWS
• Learn how AWS can help enable a Data Lake architecture
• Understand some of the key architectural considerations when building a Data Lake
• Hear some important Data Lake implementation considerations when using Amazon S3 as your Data Lake

Publié dans : Technologie
  • Soyez le premier à commenter

Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly Webinar Series

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rahul Bhartia, Solutions Architect Susan Chan, Senior Product Manager - Amazon S3 August 2016 Building a Data Lake with Amazon S3
  2. 2. Evolution of “Data Lakes”
  3. 3. Databases Transactions Data warehouse Evolution of big data architecture Extract, transform and load (ETL)
  4. 4. Databases Files Transactions Logs Data warehouse Evolution of big data architecture ETL ETL
  5. 5. Databases Files Streams Transactions Logs Events Data warehouse Evolution of big data architecture ? Hadoop ? ETL ETL
  6. 6. Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS Amazon EMR Amazon Redshift AWS Data Pipeline Amazon Kinesis Amazon CloudSearch Amazon Kinesis- enabled app AWS Lambda Amazon Machine Learning Amazon SQS Amazon ElastiCache Amazon DynamoDB Streams A growing ecosystem…
  7. 7. Databases Files Streams Transactions Logs Events Data warehouse Data Lake The Genesis of “Data Lakes”
  8. 8. What really is a “Data Lake”
  9. 9. Components of a Data Lake Collect & Store Catalogue & Search Entitlements API & UI  An API and user interface that expose these features to internal and external users  A robust set of security controls – governance through technology, not policy  A search index and workflow which enables data discovery  A foundation of highly durable data storage and streaming of any type of data
  10. 10. Storage High durability Stores raw data from input sources Support for any type of data Low cost
  11. 11. Data Lake – Hadoop (HDFS) as the Storage Search Access QueryProcess Archive
  12. 12. Transaction s Data Lake – Amazon S3 as the storage Search Access QueryProcess Archive Amazon RDS Amazon DynamoDB Amazon Elasticsearch Service Amazon Glacier Amazon S3 Amazon Redshift Amazon Elastic MapReduce Amazon Machine Learning Amazon ElastiCache
  13. 13. Metadata lake Used for summary statistics and data Classification management Simplified model for data discovery & governance Catalogue & search
  14. 14. Catalogue & Search Architecture
  15. 15. Encryption for Data protection Authentication & Authorisation Access Control & restrictions Entitlements
  16. 16. Data Protection via Encryption AWS CloudHSM Dedicated Tenancy SafeNet Luna SA HSM Device Common Criteria EAL4+, NIST FIPS 140-2 AWS Key Management Service Automated key rotation & auditing Integration with other AWS services AWS server side encryption AWS managed key infrastructure
  17. 17. Entitlements – Access to Encryption Keys Customer Master Key Customer Data Keys Ciphertext Key Plaintext Key IAM Temporary Credential Security Token Service MyData MyData S3 S3 Object … Name: MyData Key: Ciphertext Key …
  18. 18. Exposes the data lake to customers Programmatically query catalogue Expose search API Ensures that entitlements are respected API & UI
  19. 19. API & UI Architecture API Gateway UI - Elastic Beanstalk AWS Lambda Metadata IndexUsers IAM TVM - Elastic Beanstalk
  20. 20. Putting It All Together
  21. 21. Amazon Kinesis Amazon S3 Amazon Glacier IAM Encrypted Data Security Token Service AWS Lambda Search Index Metadata Index API GatewayUsers UI - Elastic Beanstalk KMS Collect & Store Catalogue & Search Entitlements & Access Controls APIs & UI
  22. 22. Amazon S3 - Foundation for your Data Lake
  23. 23. Designed for 11 9s of durability Designed for 99.99% availability Durable Available High performance  Multiple upload  Range GET  Store as much as you need  Scale storage and compute independently  No minimum usage commitments Scalable  AWS Elastic MapReduce  Amazon Redshift  Amazon DynamoDB Integrated  Simple REST API  AWS SDKs  Read-after-create consistency  Event Notification  Lifecycle policies Easy to use Why Amazon S3 for Data Lake?
  24. 24. Why Amazon S3 for Data Lake?  Natively supported by frameworks like — Spark, Hive, Presto, etc.  Can run transient Hadoop clusters  Multiple clusters can use the same data  Highly durable, available, and scalable  Low Cost: S3 Standard starts at $0.0275 per GB per month
  25. 25. AWS Direct Connect AWS Snowball ISV Connectors Amazon Kinesis Firehose S3 Transfer Acceleration AWS Storage Gateway Data Ingestion into Amazon S3
  26. 26. Choice of storage classes on S3 Standard Active data Archive dataInfrequently accessed data Standard - Infrequent Access Amazon Glacier
  27. 27. Encryption ComplianceSecurity  Identity and Access Management (IAM) policies  Bucket policies  Access Control Lists (ACLs)  Query string authentication  SSL endpoints  Server Side Encryption (SSE-S3)  Server Side Encryption with provided keys (SSE-C, SSE-KMS)  Client-side Encryption  Buckets access logs  Lifecycle Management Policies  Access Control Lists (ACLs)  Versioning & MFA deletes  Certifications – HIPAA, PCI, SOC 1/2/3 etc. Implement the right controls
  28. 28. Use Case We use S3 as the “source of truth” for our cloud-based data warehouse. Any dataset that is worth retaining is stored on S3. This includes data from billions of streaming events from (Netflix-enabled) televisions, laptops, and mobile devices every hour captured by our log data pipeline (called Ursula), plus dimension data from Cassandra supplied by our Aegisthus pipeline. “ ” Source: http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html Eva Tse Director, Big Data Platform
  29. 29. Tip #1: Use versioning  Protects from accidental overwrites and deletes  New version with every upload  Easy retrieval of deleted objects and roll back to previous versions Versioning
  30. 30. Tip #2: Use lifecycle policies  Automatic tiering and cost controls  Includes two possible actions:  Transition: archives to Standard - IA or Amazon Glacier based on object age you specified  Expiration: deletes objects after specified time  Actions can be combined  Set policies at the bucket or prefix level  Set policies for current version or non- current versions Lifecycle policies
  31. 31. Versioning + lifecycle policies
  32. 32. Expired object delete marker policy  Deleting a versioned object makes a delete marker the current version of the object  Removing expired object delete marker can improve list performance  Lifecycle policy automatically removes the current version delete marker when previous versions of the object no longer exist Expired object delete marker
  33. 33. Insert console screen shot Enable policy with the console
  34. 34. Incomplete multipart upload expiration policy  Partial upload does incur storage charges  Set a lifecycle policy to automatically make incomplete multipart uploads expire after a predefined number of days Incomplete multipart upload expiration Best Practice
  35. 35. Enable policy with the Management Console
  36. 36. Considerations for organizing your Data Lake  Amazon S3 storage uses a flat keyspace  Separate data by business unit, application, type, and time  Natural data partitioning is very useful  Paths should be self documenting and intuitive  Changing prefix structure in future is hard/costly
  37. 37. Best Practices for your Data Lake  Always store a copy of raw input as the first rule of thumb  Use automation with S3 Events to enable trigger based workflows  Use a format that supports your data, rather than force your data into a format  Apply compression everywhere to reduce the network load
  38. 38. Thank you!

×