SlideShare une entreprise Scribd logo
1  sur  29
S U M M I T
SYDNEY
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Building Serverless Analytics
Pipelines with AWS Glue
Tom McMeekin
Solutions Architect
Amazon Web Services
Drew Paterson
Solutions Architect
Amazon Web Services
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
There are more
people accessing data
And more
requirements for
making data available
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data
Engineering
Data stewardship
Data
pipelines
Data
structures
Data lakes
Extract
Transform
Load
Data modelling
Data marts
Data warehouse
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue
Serverless data catalogue and ETL service
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue Crawlers
Amazon S3 Data Lake Storage
AWS Glue Data Catalogue
OLTP
ERP
CRM
LOB
Devices
Web
Sensors
Social
Automatically build your Data
Catalogue and keep it in sync
Built-in classifiers; custom
classifiers using Grok
expression
Run ad hoc or on a
schedule; serverless
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue Data Catalogue
Amazon Athena
Amazon Redshift
Amazon EMR
Amazon QuickSight
Amazon SageMaker
Amazon S3 Data Lake Storage
Search metadata for
data discovery
Single view across all
users, accounts, and
workloads
AWS Glue Data Catalogue
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Use AWS Glue to cleanse, prep, and move
Serverless Apache Spark or Python
environment
Auto-generate, write or bring your own
Python or Scala code
Amazon S3
(Raw data)
Amazon S3
(Staging
data)
Amazon S3
(Processed data)
AWS Glue Data Catalogue
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Apache Spark and AWS Glue ETL
AWS Glue builds on Apache Spark to offer ETL specific functionality
Apache Spark Core: RDDs
Apache Spark
DataFrames
AWS Glue
DynamicFrame
Apache SparkSQL AWS Glue ETL
Apache Spark is a distributed data processing engine for complex analytics
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DataFrames
Core data structure for SparkSQL
Like structured tables
Need schema up-front
Each row has same structure
Suited for SQL-like analytics
DataFrames and DynamicFrames
DynamicFrames
Like DataFrames for ETL
Designed for processing semi-structured data,
e.g. JSON, Avro, Apache logs ...
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Developer Endpoints / Notebooks
Raw Dataset
Amazon SageMaker
Notebook
Optimised Dataset
Connect your IDE to an
AWS Glue development
endpoint
Environment to
interactively develop,
debug, and test ETL code
AWS Glue
Data Catalouge
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
• Specify the capacity that
gets allocated to each job
• Pay only for the resources
you consume
• Auto-configure VPC and
role-based access
• Connect to on-premises
JDBC data stores as source
There is no need to provision, configure, or manage servers
AWS Glue: Job Execution - Serverless
VPC
Amazon RDS
AWS Glue
Corporate data center
Database
AWS Direct Connect
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Three ways to orchestrate an AWS Glue ETL pipeline
• Schedule-driven
• Event-driven
• State machine–driven
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Schedule driven
Crawl
raw
dataset
Run
‘optimise’
job
Crawl
optimised
dataset
SLA
deadlineReady
for
reporting
Work backwards from a daily SLA deadline
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Event driven
Let Amazon CloudWatch Events and AWS Lambda drive the pipeline
Crawl
raw
dataset
Run
‘optimise’
job
Crawl
optimised
dataset
SLA
deadlineReady
for
reporting
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
State machine–driven
Let AWS Step Functions drive the pipeline
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data
Engineering
DevOps
CI/CD
Canary
deployments
Feature flags
Chaos
engineering
Configuration
management
CI/CD for AWS Glue ETL
AWS CodePipeline
• Help Data Engineers write quality code
• Automate the ETL job release management process
• Mitigate risk
CI/CD for AWS Glue ETL
AWS CodePipeline
pipe_line_template.yaml
etl_job.py
live_test.py
AWS
CodeCommit
CI/CD for AWS Glue ETL
AWS
CloudFormation
Amazon S3
(Raw data)
Amazon S3
(Test data)
AWS CodePipeline
AWS
CodeCommit
pipe_line_template.yaml
etl_job.py
Role
CI/CD for AWS Glue ETL
Amazon S3
(Raw data)
Amazon S3
(Test data)
AWS Glue Data Catalogue
AWS
CodeBuild
AWS
CloudFormation
AWS
CodeCommit
live_test.py
CI/CD for AWS Glue ETL
Amazon Athena
AWS
CodeBuild
AWS
CloudFormation
AWS CodePipeline
Amazon S3
(Data Lake)
Amazon S3
(Test Data)
SELECT count(*) FROM ”sales".”data_lake”;
SELECT count(*) FROM ”sales_parquet".”test_data";
AWS
CodeCommit
✓
CI/CD for AWS Glue ETL
AWS
CodeCommit
AWS
CodeBuild
AWS
CloudFormation
AWS
CloudFormation
AWS CodePipeline
Amazon S3
(Raw data)
Amazon S3
(Prd data)
pipe_line_template.yaml
etl_job.py
Role
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Go learn
• Remember the three steps to build a serverless data pipeline
• Use AWS Glue features
• Leverage the breadth of the AWS Platform
• Scan your badge to receive links to learning resources
Thank you!
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tom McMeekin Drew Paterson

Contenu connexe

Tendances

Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeKent Graziano
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!Chris Taylor
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveCobus Bernard
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)Amazon Web Services Korea
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know SnowflakeKnoldus Inc.
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation Brett VanderPlaats
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...Amazon Web Services
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 

Tendances (20)

Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
What is AWS Glue
What is AWS GlueWhat is AWS Glue
What is AWS Glue
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 

Similaire à Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019

Serverless data prep with AWS Glue - ADB306 - New York AWS Summit
Serverless data prep with AWS Glue - ADB306 - New York AWS SummitServerless data prep with AWS Glue - ADB306 - New York AWS Summit
Serverless data prep with AWS Glue - ADB306 - New York AWS SummitAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitPerforming serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitAmazon Web Services
 
Building-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSBuilding-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSAmazon Web Services
 
How to go from zero to data lakes in days - ADB202 - New York AWS Summit
How to go from zero to data lakes in days - ADB202 - New York AWS SummitHow to go from zero to data lakes in days - ADB202 - New York AWS Summit
How to go from zero to data lakes in days - ADB202 - New York AWS SummitAmazon Web Services
 
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019javier ramirez
 
Best Practices for Migrating Databases to the Cloud - AWS Summit Sydney
Best Practices for Migrating Databases to the Cloud - AWS Summit SydneyBest Practices for Migrating Databases to the Cloud - AWS Summit Sydney
Best Practices for Migrating Databases to the Cloud - AWS Summit SydneyAmazon Web Services
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWSAmazon Web Services
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019Amazon Web Services
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Summits
 
Ask me anything about building data lakes on AWS - ADB209 - New York AWS Summit
Ask me anything about building data lakes on AWS - ADB209 - New York AWS SummitAsk me anything about building data lakes on AWS - ADB209 - New York AWS Summit
Ask me anything about building data lakes on AWS - ADB209 - New York AWS SummitAmazon Web Services
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSAmazon Web Services
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Amazon Web Services
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAdir Sharabi
 
Scale - Implementing a Data Warehouse on AWS
Scale - Implementing a Data Warehouse on AWSScale - Implementing a Data Warehouse on AWS
Scale - Implementing a Data Warehouse on AWSAmazon Web Services
 
Implementing a Data Warehouse on AWS in a Hybrid Environment
Implementing a Data Warehouse on AWS in a Hybrid EnvironmentImplementing a Data Warehouse on AWS in a Hybrid Environment
Implementing a Data Warehouse on AWS in a Hybrid EnvironmentAmazon Web Services
 
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...Amazon Web Services
 

Similaire à Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019 (20)

Serverless data prep with AWS Glue - ADB306 - New York AWS Summit
Serverless data prep with AWS Glue - ADB306 - New York AWS SummitServerless data prep with AWS Glue - ADB306 - New York AWS Summit
Serverless data prep with AWS Glue - ADB306 - New York AWS Summit
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitPerforming serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
 
Building-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSBuilding-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWS
 
How to go from zero to data lakes in days - ADB202 - New York AWS Summit
How to go from zero to data lakes in days - ADB202 - New York AWS SummitHow to go from zero to data lakes in days - ADB202 - New York AWS Summit
How to go from zero to data lakes in days - ADB202 - New York AWS Summit
 
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
 
Best Practices for Migrating Databases to the Cloud - AWS Summit Sydney
Best Practices for Migrating Databases to the Cloud - AWS Summit SydneyBest Practices for Migrating Databases to the Cloud - AWS Summit Sydney
Best Practices for Migrating Databases to the Cloud - AWS Summit Sydney
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
 
Ask me anything about building data lakes on AWS - ADB209 - New York AWS Summit
Ask me anything about building data lakes on AWS - ADB209 - New York AWS SummitAsk me anything about building data lakes on AWS - ADB209 - New York AWS Summit
Ask me anything about building data lakes on AWS - ADB209 - New York AWS Summit
 
Modern Data Platform on AWS
Modern Data Platform on AWSModern Data Platform on AWS
Modern Data Platform on AWS
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWS
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
Scale - Implementing a Data Warehouse on AWS
Scale - Implementing a Data Warehouse on AWSScale - Implementing a Data Warehouse on AWS
Scale - Implementing a Data Warehouse on AWS
 
Implementing a Data Warehouse on AWS in a Hybrid Environment
Implementing a Data Warehouse on AWS in a Hybrid EnvironmentImplementing a Data Warehouse on AWS in a Hybrid Environment
Implementing a Data Warehouse on AWS in a Hybrid Environment
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019

  • 1. S U M M I T SYDNEY
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Building Serverless Analytics Pipelines with AWS Glue Tom McMeekin Solutions Architect Amazon Web Services Drew Paterson Solutions Architect Amazon Web Services
  • 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T There are more people accessing data And more requirements for making data available
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data Engineering Data stewardship Data pipelines Data structures Data lakes Extract Transform Load Data modelling Data marts Data warehouse
  • 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue Serverless data catalogue and ETL service
  • 6. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue Crawlers Amazon S3 Data Lake Storage AWS Glue Data Catalogue OLTP ERP CRM LOB Devices Web Sensors Social Automatically build your Data Catalogue and keep it in sync Built-in classifiers; custom classifiers using Grok expression Run ad hoc or on a schedule; serverless
  • 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue Data Catalogue Amazon Athena Amazon Redshift Amazon EMR Amazon QuickSight Amazon SageMaker Amazon S3 Data Lake Storage Search metadata for data discovery Single view across all users, accounts, and workloads AWS Glue Data Catalogue
  • 9. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Use AWS Glue to cleanse, prep, and move Serverless Apache Spark or Python environment Auto-generate, write or bring your own Python or Scala code Amazon S3 (Raw data) Amazon S3 (Staging data) Amazon S3 (Processed data) AWS Glue Data Catalogue
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Apache Spark and AWS Glue ETL AWS Glue builds on Apache Spark to offer ETL specific functionality Apache Spark Core: RDDs Apache Spark DataFrames AWS Glue DynamicFrame Apache SparkSQL AWS Glue ETL Apache Spark is a distributed data processing engine for complex analytics
  • 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DataFrames Core data structure for SparkSQL Like structured tables Need schema up-front Each row has same structure Suited for SQL-like analytics DataFrames and DynamicFrames DynamicFrames Like DataFrames for ETL Designed for processing semi-structured data, e.g. JSON, Avro, Apache logs ...
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Developer Endpoints / Notebooks Raw Dataset Amazon SageMaker Notebook Optimised Dataset Connect your IDE to an AWS Glue development endpoint Environment to interactively develop, debug, and test ETL code AWS Glue Data Catalouge
  • 14. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T • Specify the capacity that gets allocated to each job • Pay only for the resources you consume • Auto-configure VPC and role-based access • Connect to on-premises JDBC data stores as source There is no need to provision, configure, or manage servers AWS Glue: Job Execution - Serverless VPC Amazon RDS AWS Glue Corporate data center Database AWS Direct Connect
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Three ways to orchestrate an AWS Glue ETL pipeline • Schedule-driven • Event-driven • State machine–driven
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Schedule driven Crawl raw dataset Run ‘optimise’ job Crawl optimised dataset SLA deadlineReady for reporting Work backwards from a daily SLA deadline
  • 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Event driven Let Amazon CloudWatch Events and AWS Lambda drive the pipeline Crawl raw dataset Run ‘optimise’ job Crawl optimised dataset SLA deadlineReady for reporting
  • 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T State machine–driven Let AWS Step Functions drive the pipeline
  • 20. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data Engineering DevOps CI/CD Canary deployments Feature flags Chaos engineering Configuration management
  • 22. CI/CD for AWS Glue ETL AWS CodePipeline • Help Data Engineers write quality code • Automate the ETL job release management process • Mitigate risk
  • 23. CI/CD for AWS Glue ETL AWS CodePipeline pipe_line_template.yaml etl_job.py live_test.py AWS CodeCommit
  • 24. CI/CD for AWS Glue ETL AWS CloudFormation Amazon S3 (Raw data) Amazon S3 (Test data) AWS CodePipeline AWS CodeCommit pipe_line_template.yaml etl_job.py Role
  • 25. CI/CD for AWS Glue ETL Amazon S3 (Raw data) Amazon S3 (Test data) AWS Glue Data Catalogue AWS CodeBuild AWS CloudFormation AWS CodeCommit live_test.py
  • 26. CI/CD for AWS Glue ETL Amazon Athena AWS CodeBuild AWS CloudFormation AWS CodePipeline Amazon S3 (Data Lake) Amazon S3 (Test Data) SELECT count(*) FROM ”sales".”data_lake”; SELECT count(*) FROM ”sales_parquet".”test_data"; AWS CodeCommit ✓
  • 27. CI/CD for AWS Glue ETL AWS CodeCommit AWS CodeBuild AWS CloudFormation AWS CloudFormation AWS CodePipeline Amazon S3 (Raw data) Amazon S3 (Prd data) pipe_line_template.yaml etl_job.py Role
  • 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Go learn • Remember the three steps to build a serverless data pipeline • Use AWS Glue features • Leverage the breadth of the AWS Platform • Scan your badge to receive links to learning resources
  • 29. Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tom McMeekin Drew Paterson