SlideShare a Scribd company logo
1 of 13
Download to read offline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Serverless Data Prep with AWS Glue
ABD215
R o y H a s s o n – G l o b a l B u s i n e s s D e v e l o p m e n t M a n a g e r
S a n t o s h C h a n d r a c h o o d – S o f t w a r e D e v e l o p m e n t M a n a g e r
L i a V a d e r – E n t e r p r i s e S o l u t i o n s A r c h i t e c t
N o v e m b e r 2 0 1 7
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
Chat on AWS Glue & Spark
Data Transformation Machine Learning Explore
Review workshop architecture
We talk
You build
Check access to required products
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue – Overview
 Hive Metastore compatible with enhanced functionality
 Crawlers automatically extracts metadata and creates tables
 Integrated with Amazon Athena, Amazon Redshift Spectrum
 Run jobs on a serverless Spark platform
 Provides flexible scheduling
 Handles dependency resolution, monitoring and alerting
 Auto-generates ETL code
 Build on open frameworks – Python and Spark
 Developer Endpoint with Interactive Notebook
Job Authoring
Job Execution
Data Catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue – Data Catalog
Unified metadata repository across relational databases, Amazon RDS, Amazon
Redshift, and Amazon S3 accessible via Amazon Athena, Amazon Redshift Spectrum,
Amazon EMR and API
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue – ETL
Automatically generated ETL code running on serverless Apache Spark with the power
and flexibility to bring data together.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue – Developer Endpoint
Explore, visualize and develop using a personal, serverless environment with
interactive REPL and Notebooks.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark
Apache Spark is a fast, easy to use general engine for large-scale data processing and
machine learning.
Spark Core
Spark
SQL
Spark
Streaming
MLlib GraphX
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real World Application
1. Web scraping – Automate a process to crape forum comments to analyze customer
experience and challenges with a product
• Automate scraping, parsing and reformatting of data
• Prepare data for machine learning
• Build machine learning models to extract insight from data
2. Venue Ratings – Build graph representation of users, venues and ratings
• Consume a collection of venue checkins and ratings
• Map users to venues
• Map venues to rating
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architecture
Web
Forums
Venue
Ratings
Zeppelin
Notebook
AWS
Glue
Amazon
S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting Started
1. Make sure your AWS user account has the following permissions:
• AmazonEC2FullAccess
• IAMFullAccess
2. Visit the link below to setup permissions and launch your dev endpoint
3. At the same link, download the 3 workshop notebooks to your machine
4. Login to Zeppelin running on your dev endpoint and upload the notebooks
5. Work through each notebook at your own pace
http://workshop-public.s3-website-
us-east-1.amazonaws.com/
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cleanup
To make sure you don’t incur unnecessary costs please make sure to remove all
resources created.
1. From AWS CloudFormation console, select the AWS Glue Notebook stack, delete it
2. From AWS Glue console, select the Dev Endpoint and delete it
3. From AWS Glue console, select the databases, tables and crawlers created during
the session and delete them
4. From S3 console, select any buckets or prefixes (folders) you used for the workshop
and delete them
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Continue Learning
• AWS Glue
• Apache Spark
• Apache Zeppelin
• Hands on workshop using AWS Glue, Amazon Athena and Amazon Redshift Spectrum
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!

More Related Content

What's hot

AMF305_Autonomous Driving Algorithm Development on Amazon AI
AMF305_Autonomous Driving Algorithm Development on Amazon AIAMF305_Autonomous Driving Algorithm Development on Amazon AI
AMF305_Autonomous Driving Algorithm Development on Amazon AIAmazon Web Services
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services
 
ABD311_Deploying Amazon QuickSight For Enterprise
ABD311_Deploying Amazon QuickSight For EnterpriseABD311_Deploying Amazon QuickSight For Enterprise
ABD311_Deploying Amazon QuickSight For EnterpriseAmazon Web Services
 
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...Amazon Web Services
 
ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...
ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...
ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...Amazon Web Services
 
ABD330_Combining Batch and Stream Processing to Get the Best of Both Worlds
ABD330_Combining Batch and Stream Processing to Get the Best of Both WorldsABD330_Combining Batch and Stream Processing to Get the Best of Both Worlds
ABD330_Combining Batch and Stream Processing to Get the Best of Both WorldsAmazon Web Services
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...Amazon Web Services
 
ABD322_Implementing a Flight Simulator Interface Using AI, Virtual Reality, a...
ABD322_Implementing a Flight Simulator Interface Using AI, Virtual Reality, a...ABD322_Implementing a Flight Simulator Interface Using AI, Virtual Reality, a...
ABD322_Implementing a Flight Simulator Interface Using AI, Virtual Reality, a...Amazon Web Services
 
ABD202_Best Practices for Building Serverless Big Data Applications
ABD202_Best Practices for Building Serverless Big Data ApplicationsABD202_Best Practices for Building Serverless Big Data Applications
ABD202_Best Practices for Building Serverless Big Data ApplicationsAmazon Web Services
 
ABD316_American Heart Association Finding Cures to Heart Disease Through the ...
ABD316_American Heart Association Finding Cures to Heart Disease Through the ...ABD316_American Heart Association Finding Cures to Heart Disease Through the ...
ABD316_American Heart Association Finding Cures to Heart Disease Through the ...Amazon Web Services
 
ABD304-R-Best Practices for Data Warehousing with Amazon Redshift & Spectrum
ABD304-R-Best Practices for Data Warehousing with Amazon Redshift & SpectrumABD304-R-Best Practices for Data Warehousing with Amazon Redshift & Spectrum
ABD304-R-Best Practices for Data Warehousing with Amazon Redshift & SpectrumAmazon Web Services
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...Amazon Web Services
 
DEV337_Deploy a Data Lake with AWS CloudFormation
DEV337_Deploy a Data Lake with AWS CloudFormationDEV337_Deploy a Data Lake with AWS CloudFormation
DEV337_Deploy a Data Lake with AWS CloudFormationAmazon Web Services
 
Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...
Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...
Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...Amazon Web Services
 
ABD301-Analyzing Streaming Data in Real Time with Amazon Kinesis
ABD301-Analyzing Streaming Data in Real Time with Amazon KinesisABD301-Analyzing Streaming Data in Real Time with Amazon Kinesis
ABD301-Analyzing Streaming Data in Real Time with Amazon KinesisAmazon Web Services
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...Amazon Web Services
 
Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...
Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...
Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...Amazon Web Services
 
FSV302_An Architecture for Trade Capture and Regulatory Reporting
FSV302_An Architecture for Trade Capture and Regulatory ReportingFSV302_An Architecture for Trade Capture and Regulatory Reporting
FSV302_An Architecture for Trade Capture and Regulatory ReportingAmazon Web Services
 

What's hot (20)

AMF305_Autonomous Driving Algorithm Development on Amazon AI
AMF305_Autonomous Driving Algorithm Development on Amazon AIAMF305_Autonomous Driving Algorithm Development on Amazon AI
AMF305_Autonomous Driving Algorithm Development on Amazon AI
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
ABD311_Deploying Amazon QuickSight For Enterprise
ABD311_Deploying Amazon QuickSight For EnterpriseABD311_Deploying Amazon QuickSight For Enterprise
ABD311_Deploying Amazon QuickSight For Enterprise
 
ABD217_From Batch to Streaming
ABD217_From Batch to StreamingABD217_From Batch to Streaming
ABD217_From Batch to Streaming
 
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...
 
ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...
ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...
ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...
 
ABD330_Combining Batch and Stream Processing to Get the Best of Both Worlds
ABD330_Combining Batch and Stream Processing to Get the Best of Both WorldsABD330_Combining Batch and Stream Processing to Get the Best of Both Worlds
ABD330_Combining Batch and Stream Processing to Get the Best of Both Worlds
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
 
ABD322_Implementing a Flight Simulator Interface Using AI, Virtual Reality, a...
ABD322_Implementing a Flight Simulator Interface Using AI, Virtual Reality, a...ABD322_Implementing a Flight Simulator Interface Using AI, Virtual Reality, a...
ABD322_Implementing a Flight Simulator Interface Using AI, Virtual Reality, a...
 
ABD202_Best Practices for Building Serverless Big Data Applications
ABD202_Best Practices for Building Serverless Big Data ApplicationsABD202_Best Practices for Building Serverless Big Data Applications
ABD202_Best Practices for Building Serverless Big Data Applications
 
ABD316_American Heart Association Finding Cures to Heart Disease Through the ...
ABD316_American Heart Association Finding Cures to Heart Disease Through the ...ABD316_American Heart Association Finding Cures to Heart Disease Through the ...
ABD316_American Heart Association Finding Cures to Heart Disease Through the ...
 
ABD304-R-Best Practices for Data Warehousing with Amazon Redshift & Spectrum
ABD304-R-Best Practices for Data Warehousing with Amazon Redshift & SpectrumABD304-R-Best Practices for Data Warehousing with Amazon Redshift & Spectrum
ABD304-R-Best Practices for Data Warehousing with Amazon Redshift & Spectrum
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWS
 
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...
 
DEV337_Deploy a Data Lake with AWS CloudFormation
DEV337_Deploy a Data Lake with AWS CloudFormationDEV337_Deploy a Data Lake with AWS CloudFormation
DEV337_Deploy a Data Lake with AWS CloudFormation
 
Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...
Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...
Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...
 
ABD301-Analyzing Streaming Data in Real Time with Amazon Kinesis
ABD301-Analyzing Streaming Data in Real Time with Amazon KinesisABD301-Analyzing Streaming Data in Real Time with Amazon Kinesis
ABD301-Analyzing Streaming Data in Real Time with Amazon Kinesis
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
 
Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...
Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...
Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...
 
FSV302_An Architecture for Trade Capture and Regulatory Reporting
FSV302_An Architecture for Trade Capture and Regulatory ReportingFSV302_An Architecture for Trade Capture and Regulatory Reporting
FSV302_An Architecture for Trade Capture and Regulatory Reporting
 

Similar to Serverless Data Prep with AWS Glue

CON319_Interstella GTC CICD for Containers on AWS
CON319_Interstella GTC CICD for Containers on AWSCON319_Interstella GTC CICD for Containers on AWS
CON319_Interstella GTC CICD for Containers on AWSAmazon Web Services
 
Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017
Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017
Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017Amazon Web Services
 
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)Amazon Web Services
 
Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...
Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...
Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...Amazon Web Services
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Amazon Web Services
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseAmazon Web Services
 
AWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your EnterpriseAWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your EnterpriseAmazon Web Services
 
Building .NET-based Serverless Architectures and Running .NET Core Microservi...
Building .NET-based Serverless Architectures and Running .NET Core Microservi...Building .NET-based Serverless Architectures and Running .NET Core Microservi...
Building .NET-based Serverless Architectures and Running .NET Core Microservi...Amazon Web Services
 
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...Amazon Web Services
 
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...Amazon Web Services
 
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdfDEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdfAmazon Web Services
 
High-Throughput Genomics on AWS - LFS309 - re:Invent 2017
High-Throughput Genomics on AWS - LFS309 - re:Invent 2017High-Throughput Genomics on AWS - LFS309 - re:Invent 2017
High-Throughput Genomics on AWS - LFS309 - re:Invent 2017Amazon Web Services
 
LFS309-High-Throughput Genomics on AWS.pdf
LFS309-High-Throughput Genomics on AWS.pdfLFS309-High-Throughput Genomics on AWS.pdf
LFS309-High-Throughput Genomics on AWS.pdfAmazon Web Services
 
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...Amazon Web Services
 
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...Amazon Web Services
 
DAT317_Migrating Databases and Data Warehouses to the Cloud
DAT317_Migrating Databases and Data Warehouses to the CloudDAT317_Migrating Databases and Data Warehouses to the Cloud
DAT317_Migrating Databases and Data Warehouses to the CloudAmazon Web Services
 
Genomics on aws-webinar-april2018
Genomics on aws-webinar-april2018Genomics on aws-webinar-april2018
Genomics on aws-webinar-april2018Brendan Bouffler
 
Design, Build, and Modernize Your Web Applications with AWS
 Design, Build, and Modernize Your Web Applications with AWS Design, Build, and Modernize Your Web Applications with AWS
Design, Build, and Modernize Your Web Applications with AWSDonnie Prakoso
 
Serverless Architecture Patterns
Serverless Architecture PatternsServerless Architecture Patterns
Serverless Architecture PatternsAmazon Web Services
 

Similar to Serverless Data Prep with AWS Glue (20)

CON319_Interstella GTC CICD for Containers on AWS
CON319_Interstella GTC CICD for Containers on AWSCON319_Interstella GTC CICD for Containers on AWS
CON319_Interstella GTC CICD for Containers on AWS
 
Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017
Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017
Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017
 
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
 
Building Web Apps on AWS
Building Web Apps on AWSBuilding Web Apps on AWS
Building Web Apps on AWS
 
Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...
Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...
Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your Enterprise
 
AWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your EnterpriseAWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
 
Building .NET-based Serverless Architectures and Running .NET Core Microservi...
Building .NET-based Serverless Architectures and Running .NET Core Microservi...Building .NET-based Serverless Architectures and Running .NET Core Microservi...
Building .NET-based Serverless Architectures and Running .NET Core Microservi...
 
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
 
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...
GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...
 
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdfDEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf
DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf
 
High-Throughput Genomics on AWS - LFS309 - re:Invent 2017
High-Throughput Genomics on AWS - LFS309 - re:Invent 2017High-Throughput Genomics on AWS - LFS309 - re:Invent 2017
High-Throughput Genomics on AWS - LFS309 - re:Invent 2017
 
LFS309-High-Throughput Genomics on AWS.pdf
LFS309-High-Throughput Genomics on AWS.pdfLFS309-High-Throughput Genomics on AWS.pdf
LFS309-High-Throughput Genomics on AWS.pdf
 
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
 
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...
I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...
 
DAT317_Migrating Databases and Data Warehouses to the Cloud
DAT317_Migrating Databases and Data Warehouses to the CloudDAT317_Migrating Databases and Data Warehouses to the Cloud
DAT317_Migrating Databases and Data Warehouses to the Cloud
 
Genomics on aws-webinar-april2018
Genomics on aws-webinar-april2018Genomics on aws-webinar-april2018
Genomics on aws-webinar-april2018
 
Design, Build, and Modernize Your Web Applications with AWS
 Design, Build, and Modernize Your Web Applications with AWS Design, Build, and Modernize Your Web Applications with AWS
Design, Build, and Modernize Your Web Applications with AWS
 
Serverless Architecture Patterns
Serverless Architecture PatternsServerless Architecture Patterns
Serverless Architecture Patterns
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Serverless Data Prep with AWS Glue

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Serverless Data Prep with AWS Glue ABD215 R o y H a s s o n – G l o b a l B u s i n e s s D e v e l o p m e n t M a n a g e r S a n t o s h C h a n d r a c h o o d – S o f t w a r e D e v e l o p m e n t M a n a g e r L i a V a d e r – E n t e r p r i s e S o l u t i o n s A r c h i t e c t N o v e m b e r 2 0 1 7
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Chat on AWS Glue & Spark Data Transformation Machine Learning Explore Review workshop architecture We talk You build Check access to required products
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – Overview  Hive Metastore compatible with enhanced functionality  Crawlers automatically extracts metadata and creates tables  Integrated with Amazon Athena, Amazon Redshift Spectrum  Run jobs on a serverless Spark platform  Provides flexible scheduling  Handles dependency resolution, monitoring and alerting  Auto-generates ETL code  Build on open frameworks – Python and Spark  Developer Endpoint with Interactive Notebook Job Authoring Job Execution Data Catalog
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – Data Catalog Unified metadata repository across relational databases, Amazon RDS, Amazon Redshift, and Amazon S3 accessible via Amazon Athena, Amazon Redshift Spectrum, Amazon EMR and API
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – ETL Automatically generated ETL code running on serverless Apache Spark with the power and flexibility to bring data together.
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – Developer Endpoint Explore, visualize and develop using a personal, serverless environment with interactive REPL and Notebooks.
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark Apache Spark is a fast, easy to use general engine for large-scale data processing and machine learning. Spark Core Spark SQL Spark Streaming MLlib GraphX
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real World Application 1. Web scraping – Automate a process to crape forum comments to analyze customer experience and challenges with a product • Automate scraping, parsing and reformatting of data • Prepare data for machine learning • Build machine learning models to extract insight from data 2. Venue Ratings – Build graph representation of users, venues and ratings • Consume a collection of venue checkins and ratings • Map users to venues • Map venues to rating
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architecture Web Forums Venue Ratings Zeppelin Notebook AWS Glue Amazon S3
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting Started 1. Make sure your AWS user account has the following permissions: • AmazonEC2FullAccess • IAMFullAccess 2. Visit the link below to setup permissions and launch your dev endpoint 3. At the same link, download the 3 workshop notebooks to your machine 4. Login to Zeppelin running on your dev endpoint and upload the notebooks 5. Work through each notebook at your own pace http://workshop-public.s3-website- us-east-1.amazonaws.com/
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cleanup To make sure you don’t incur unnecessary costs please make sure to remove all resources created. 1. From AWS CloudFormation console, select the AWS Glue Notebook stack, delete it 2. From AWS Glue console, select the Dev Endpoint and delete it 3. From AWS Glue console, select the databases, tables and crawlers created during the session and delete them 4. From S3 console, select any buckets or prefixes (folders) you used for the workshop and delete them
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Continue Learning • AWS Glue • Apache Spark • Apache Zeppelin • Hands on workshop using AWS Glue, Amazon Athena and Amazon Redshift Spectrum
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU!