Contenu connexe Plus de Amazon Web Services (20) Architecting a Serverless Data Lake (ARC302) - AWS re:Invent 20182. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless Data Lake Workshop
Amardeep Chudda
Solutions Architect
Amazon Web Services
A R C 3 0 2
Mike Gillespie
Solutions Architect
Amazon Web Services
3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
Development Environment Setup
Review Data Lake Architecture
Why Serverless?
Glue Extract Transform Load (ETL)
Data Governance
Bonus Content
4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Related breakouts
Tuesday, Nov 27
ANT354-R - [REPEAT] Build a Query to Analyze Data in Your Amazon
Redshift Warehouse & S3 Data Lake Together
Time – 8:30 AM to 9:30 PM | Mirage
Friday, Nov 30
AIM405-R1 - [REPEAT 1] Better Analytics Through Natural
Language Processing
Time – 11:30 PM to 12:30 PM | Venetian
Thursday, Nov 29
ADT301 - Create a Serverless Web Event Pipeline
Time – 4:00 PM to 5:00 PM | Mirage
5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scenario
You support a successful online ecommerce website with millions of users. The
website is tracking your end user activity and their buying habits online.
Your analytics team would like the ability to query data in both ad-hoc queries and
using Business Intelligence tools with a end goal of helping business teams derive
efficiencies in their marketing campaigns. You want to enable your analytics team
but at the same time you don’t want to loose the focus on data quality and
governance controls.
Data Sources include weblogs, NoSQL databases and other datasources
Your task is to build a cost effective solution to have a unified analytics
environment.
6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
re:Invent workshop summary
• Ingest data from various data sources and join them together
• Enrich raw data
• Convert data to parquet for efficient querying
• Grant access to roles based on the data classification
• SQL Access for Data Scientists
• Data Visualization with charts and graphs
7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
1. Your own device for console access
2. An AWS account that you are able to use for testing.
(Should not be used for production or other purposes.)
3. Workshop on GitHub at https://bit.ly/2RX54o3
Requirements
9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Development environment
Your Cloud Engineering team has deployed a development environment for you
Ingestion / Data Generation
Kinesis / Log Data
Data Generation Lambda Functions
Amazon Simple Storage Service (Amazon S3) Buckets
Amazon DynamoDB
AWS Glue Management Console / Development Endpoint
Amazon Athena
Amazon QuickSight
10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
1. Deploy the Lab CloudFormation template from here
https://bit.ly/2RX54o3
2. Examine the environment in AWS CloudFormation
Designer
3. Deploy your stack
Deploy the lab environment
Template
Stack
11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
High-level architecture
13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Kinesis Data Firehose
• Serverless, easy to use
• Seamless integration with AWS data stores
• Support for serverless transformation
• Near real-time ingestion
• Pay only for what you use
14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Simple Storage Service (Amazon S3)
• Object store
• Highly durable
• Limitless scalability
• Pay for what you use
• Comprehensive security & compliance capabilities
• Support for query in place
15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue
• Serverless ETL
• Universal Data Catalog
• Open source Apache Spark environment
• DynamicFrame – Built in functions
• Seamless integration with AWS services
• Support for on-premises data stores
16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Athena
• Serverless interactive query service
• Integrated with AWS Glue Data Catalog
• Open source, built on Presto, query with standard SQL
• Pay per query
• Support for standard formats like CSV, JSON, ORC, Avro and Parquet
• Fast parallel query execution
17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon QuickSight
• Serverless, end to end BI solution
• Built-in SPICE engine
• Smart visualizations
• Seamless integration with AWS services
• On-premises database support
• Pay only for what you use
• Multiple device support
• Share and collaborate
18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data classification and security
• Grant S3 access by role to bucket / prefix
• Approaches to segment data
• Multiple copies of the data in different buckets
• Tokenization, join to tokenized tables, and views to
resolve them
Bucket with
objects
Role Permissions
20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
UserProfile
Duplication
ID First Last
1 Sam Smith
2 Jane Jones
UserProfileSecure
ID First Last SSN
1 Sam Smith 111-11-1111
2 Jane Jones 222-22-2222
21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Duplication
UserProfile
ID First Last
1 Sam Smith
2 Jane Jones
UserProfileSecure
ID First Last SSN
1 Sam Smith 111-11-1111
2 Jane Jones 222-22-2222
22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tokenization
UserProfile
ID First Last SSN_Token
1 Sam Smith 8c9d409dcc43
2 Jane Jones 06a38ea94e69
SSN_Tokens
Token SSN
8c9d409dcc43 111-11-1111
06a38ea94e69 222-22-2222
23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tokenization
ProfileView
ID First Last
1 Sam Smith
2 Jane Jones
ProfileSecureView
ID First Last SSN
1 Sam Smith 111-11-1111
2 Jane Jones 222-22-2222
24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Redshift Spectrum
UserProfileSecure
ID First Last SSN
1 Sam Smith 111-11-1111
2 Jane Jones 222-22-2222
25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Bonus Content
• AWS Glue Development Endpoints – Apache Zeppelin notebook
• Amazon Redshift/Spectrum Integration
• AWS Database Migration Service (DMS) - Importing files from S3 to
DynamoDB
27. Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amar, Mike
28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.