Contenu connexe Similaire à The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline (20) Plus de Amazon Web Services (20) The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline1. P U B L I C S E C T O R
S U M M I T
Washingt on DC
2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
The Zen of DataOps – AWS Lake
Formation and the Data Supply
Chain Pipeline
Stephen Moon
Specialist Solutions Architect
AWS
3 0 1 3 1 8
3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Agenda
DataOps
Data Supply Chain Pipeline
AWS Lake Formation
4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
What is DataOps?
An automated, process-oriented methodology, used by analytic and data
teams, to improve the quality and reduce the cycle time of data analytics.
The DataOps Engineer orchestrates and automates the data analytics
pipeline, promotes features to production and automates quality.
‒ Wikipedia ‒
6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
DataOps Principles (www.dataopsmanifesto.org)
1. Continually satisfy your customer (Customer Obsession):
Our highest priority is to satisfy the customer through the early and continuous
delivery of valuable analytic insights from a couple of minutes to weeks.
3. Embrace change (Deliver Results):
We welcome evolving customer needs, and in fact, we embrace them to generate
competitive advantage. We believe that the most efficient, effective, and agile
method of communication with customers is face-to-face conversation.
8. Reflect (Learn and Be Curious):
Analytic teams should fine-tune their operational performance by self-reflecting,
at regular intervals, on feedback provided by their customers, themselves, and
operational statistics.
7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
DataOps Principles (www.dataopsmanifesto.org)
12. Disposable environments (Frugality):
We believe it is important to minimize the cost for analytic team members to
experiment by giving them easy to create, isolated, safe, and disposable
technical environments that reflect their production environment.
13. Simplicity (Invent and Simplify):
We believe that continuous attention to technical excellence and good design
enhances agility; likewise simplicity--the art of maximizing the amount of work
not done--is essential.
14. Analytics is manufacturing:
Analytic pipelines are analogous to lean manufacturing lines. We believe a
fundamental concept of DataOps is a focus on process-thinking aimed at
achieving continuous efficiencies in the manufacture of analytic insight.
8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Data Supply Chain Pipeline Mission Statement
Securely democratize data and deliver it to Communities of
Interest when they need it and how they need it.
10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Operating Model
Ross, Jeanne W, et al. Enterprise Architecture As Strategy: Creating a Foundation for Business Execution. Harvard Business Review Press, 2006.
https://www.amazon.com/dp/B004OC07EE/ref=dp-kindle-redirect?_encoding=UTF8&btkr=1
Current State Future State
11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Architecture & Design Principles
Principle: Minimal Disruption
Statement: Minimize disruption to data producers in how they deliver their data
Principle: Configuration (80/20 Rule)
Statement: Focus on 80% of uses cases that can be satisfied with configurable components
Principle: Right Tool for the Right Job
Statement: Processes drive tooling; not the other way around
Principle: Conscious Decoupling
Statement: The right tool today may not be the right tool tomorrow
Principle: Data Residency
Statement: Users should access the data where IT lives regardless of where THEY live
12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Conceptual Architecture
13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Ingest
! There is no single “tool” for receiving, inspecting, staging, and archiving data
Focus on cultivating the organization competencies and the processes for
engaging with Data Suppliers
Build tiger teams who understand the organizational domains of the Data
Suppliers
Develop templates for Memorandums of Understanding (MoU) and Interface
Control Documents (ICD) to govern the relationships with Data Suppliers
The result will be a small set of common patterns that can be
standardized, automated, and scaled to service hundreds to thousands of
Data Suppliers.
14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Process
Extract & Load
• Cleanse
Application of Universal Data Rules and Business Data Rules
Entities and attributes remain distinct from other instances of the same entities and attributes
Entity Resolution
• Aggregate
Instantiating two or more occurrences of the same entity as a single instance
Attributes of aggregated entities remain distinct even though the attributes may be similar or the same
Disparate IDs of the same entity become an attribute linked to a natural or synthetic UUID/GUID
• Associate – Defining the relationships among entities via the application of Business Relationship Rules
Master Data Management
• Merge – Combining aggregated instances of entity attributes into a single version of the truth
15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Enrich
• Assimilate
Organize entities & attributes for consumption by Communities of Interest
Structured as Facts, Graphs, Time-series, and/or Matrices
Driven by questions generated by the Communities of Interest
CRISP-DM project scope
• Transform – Standardize
• Engineer – Normalize, Interpolate, Extrapolate
• Synthesize
Obfuscate – Mask identifying data
Anonymize – Apply privacy models
16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Catalog & Profile
• Business Glossary
• Concept Descriptions
• Data Models
• Classifications (Labeling)
• Summary Statistics (supports Discovery and Exploration)
Maximum
Minimum
Mean & Skew
Mode
Quartiles
Standard Deviation
Correlation Coefficient
Depth & Breadth
17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Interest
Influence
Control
Control
Applications/Systems which are owned and/or
directly managed by the ingesting organization
Influence
Applications/Systems of which the ingesting
organization is an internal or external stakeholder but
does not own or manage the application/system
Interest
Applications/Systems of which the ingesting
organization has a concern for the data but does not
have control or influence over the application/system
Why is this important?
Determines how data is going to be ingested!
Circles of Concern for Ingest
18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Logical Architecture
19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Analytics
Our portfolio
Broad and deep portfolio, purpose-built for builders
QuickSight SageMaker
S3/Glacier
Glue
ETL & Data Catalog
Lake Formation
Data Lakes
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams
Data Movement
Business Intelligence & Machine Learning
Data Lake
Redshift
Data warehousing
EMR
Hadoop + Spark
Kinesis Data Analytics
Real time
Elasticsearch Service
Operational Analytics
Athena
Interactive analytics
RDS
MySQL, PostgreSQL, MariaDB,
Oracle, SQL Server
Aurora
MySQL, PostgreSQL
DynamoDB
Key value, Document
ElastiCache
Redis, Memcached
Neptune
Graph
Timestream
Time Series
QLDB
Ledger Database
RDS on VMware
Databases
20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
What is a Data Lake?
A data lake is a centralized repository that allows
you to store all your structured and unstructured
data at any scale
21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Why data lakes?
Data Lakes provide:
Relational and non-relational data
Scale-out to EBs
Diverse set of analytics and machine learning tools
Work on data without any data movement
Designed for low cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
1001100001001010111001
0101011100101010000101
1111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW Queries Big data
processing
Interactive Real-time
22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Physical Architecture
Databases
AWS DataSync
AWS Database
Migration Service
Amazon Kinesis
Amazon Aurora
Data
Data
Operational
Data Store
Data
Warehouse
Amazon S3
Amazon Aurora
Data
Lake
Amazon EMR
Amazon Athena
Amazon Redshift
Amazon QuickSight
Other Tools
AWS Glue
AWS DMS
Extract Warehouse Data
Load Raw Data
Load
Data Warehouse
Load
Data Warehouse
Amazon S3
Amazon SageMaker
Build Data Marts
23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
The Power of Data Lakes
Data Warehouse
• Permanent data store for structured data
• No direct access
Data
Warehouse
Amazon Aurora Amazon S3
Data Lake
Amazon Redshift
Amazon Neptune
Amazon EMR
Apache MXNet on AWS
Data Lake
• Ephemeral/Dynamic data storage for structured data
• Data sets purpose-built based on use cases (right tool)
• Many-to-One ratio of Tools-to-Data
• Only pay for data processing as its needed
24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Data Lake Challenges
• Maintaining a data catalog /
enabling self-service access
• Configuring and managing
access controls / Data
governance
• Audit logging
Building data lakes can still take
months
25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Typical steps of building a data lake
Make data available
for analytics
Cleanse,
Prepare, &
Catalog Data
Move Data Configure & Enforce
Security & Compliance
Policies
Permissions
Setup Storage
27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
How it works
28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Key Components
• Blueprints / Data Importers - templates for ETL, metadata (schema) and
partition management
• Enhanced Data Catalog - enable users to record more metadata and tag
Data Catalog objects (i.e. databases, tables, columns)
• ML Transformations – ML algorithms that customers can use to create
their own ML Transforms (i.e. record de-duplication)
• Enhanced Security & Governance - security and governance layer at the
Data Catalog level
29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Register existing data or import new
Amazon S3 forms the storage layer for
Lake Formation
Register existing S3 buckets that
contain your data
Ask Lake Formation to create required
S3 buckets and import data into them
Data is stored in your account. You have
direct access to it. No lock-in.
Data Lake Storage
Data
Catalog
Access
Control
Data
import
Crawlers ML-based
data prep
30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Easily load data to your data lake
Logs
DBs
Blueprints
Data Lake Storage
Data
Catalog
Access
Control
Data
import
Crawlers ML-based
data prep
one-shot
incremental
31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Blueprints / Data Importers
Blueprints are templates for data ingestion, transformation, metadata
(schema) and partition management. Blueprints help customers to
quickly and easily build and maintain a data lake.
Templates
32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
With blueprints
You
1. Point us to the source
2. Tell us the location to load to in
your data lake
3. Specify how often you want to
load the data
Blueprints
1. Discover the source table(s)
schema
2. Automatically convert to the
target data format
3. Automatically partition the data
based on the partitioning schema
4. Keep track of data that was
already processed
5. You can customize any of the
above
33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Blueprints build on AWS Glue
34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Enhanced Data Catalog
AWS Lake Formation has an enhanced Data Catalog to enable users to
record more metadata and Tags for Databases, Tables and Columns. All
of the data is searchable.
35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Search and collaborate across multiple users
Text-based, faceted search
across all metadata
Add attributes like Data
owners, stewards, and other as
table properties
Add data sensitivity level,
column definitions, and others
as column properties
Text-based search and filtering
Query data in Amazon Athena
36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
ML Transformations
AWS Lake Formation includes specialized ML-based dataset
transformation algorithms customers can use to create their own ML
Transforms. These include record de-duplication and match finding.
37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
De-duplicate
38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Enhanced Governance Layer
AWS Lake Formation provides a security and governance layer at the Data
Catalog level. Users can grant or revoke permissions to the Data Catalog
objects such as databases, tables and columns for IAM principals (IAM
users and roles). This functionality will be extended to row level access in
subsequent releases.
39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Security permissions in Lake Formation
Control data access with simple
grant and revoke permissions
Specify permissions on tables and
columns rather than on buckets
and objects
Easily view policies granted to a
particular user
Audit all data access at one place
40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Security permissions in Lake Formation
Search and view permissions
granted to a user, role, or group in
one place
Verify permissions granted to a user
Easily revoke policies for a user
41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Audit and monitor in real time
See detailed alerts in the console
Download audit logs for further
analytics
Data ingest and catalog notifications
also published to Amazon CloudWatch
events
42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Secure once, access in multiple ways
Data Lake Storage
Data
Catalog
Access
Control
Admin
Amazon QuickSight
Amazon SageMaker
43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Grant table and column-level permissions
User 1
User 2
44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Lake Formation Security Workflow
User
• IAM Users
• IAM Roles
• Active Directory (Federation)
45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Example: A data lake in 3 easy steps
1. Use blueprints/data importers to ingest data
2. Grant permissions to securely share data
3. Query the data (Amazon Athena)
46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Step 1: Use data importers to ingest data
47. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Imported data as table in the data lake
48. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Step 2: Grant permissions to securely share data
49. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Step 3: Run query in Amazon Athena
50. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
AWS Lake Formation Pricing
No additional charges – Only pay for the underlying services used.
51. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Lake Formation FAQ
Q: When is Lake Formation going to be GA?
A: GA for the service will be Q2 2019.
Q: Will there will be support for data lineage in the enhanced Lake Formation
data catalog?
A: Lineage is on the roadmap for this year. We’ll have a better date after AWS
Lake Formation goes GA.
Q: Will AWS Glue’s existing certifications extend over to AWS Lake Formation?
A: Yes.
52. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
Stephen Moon
moonstep@amazon.com
53. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.P U B L I C S E C T O R
S U M M I T