Serverless Data Prep with AWS Glue

•

3 likes•1,486 views

In this session, you learn how to set up a crawler to automatically discover your data and build your AWS Glue Data Catalog. You then auto-generate an AWS Glue ETL script, download it, and interactively edit it using a Zeppelin notebook, connected to an AWS Glue development endpoint. After that, you upload this script to Amazon S3, reuse it across multiple jobs, and add trigger conditions to run the jobs. The resulting datasets automatically get registered in the AWS Glue Data Catalog and you can then query these new datasets from Amazon EMR and Amazon Athena. Prerequisites: Knowledge of Python and familiarity with big data applications is preferred but not required. Attendees must bring their own laptops.

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Serverless Data Prep with AWS Glue
ABD215
R o y H a s s o n – G l o b a l B u s i n e s s D e v e l o p m e n t M a n a g e r
S a n t o s h C h a n d r a c h o o d – S o f t w a r e D e v e l o p m e n t M a n a g e r
L i a V a d e r – E n t e r p r i s e S o l u t i o n s A r c h i t e c t
N o v e m b e r 2 0 1 7

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
Chat on AWS Glue & Spark
Data Transformation Machine Learning Explore
Review workshop architecture
We talk
You build
Check access to required products

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue – Overview
 Hive Metastore compatible with enhanced functionality
 Crawlers automatically extracts metadata and creates tables
 Integrated with Amazon Athena, Amazon Redshift Spectrum
 Run jobs on a serverless Spark platform
 Provides flexible scheduling
 Handles dependency resolution, monitoring and alerting
 Auto-generates ETL code
 Build on open frameworks – Python and Spark
 Developer Endpoint with Interactive Notebook
Job Authoring
Job Execution
Data Catalog

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue – Data Catalog
Unified metadata repository across relational databases, Amazon RDS, Amazon
Redshift, and Amazon S3 accessible via Amazon Athena, Amazon Redshift Spectrum,
Amazon EMR and API

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue – ETL
Automatically generated ETL code running on serverless Apache Spark with the power
and flexibility to bring data together.

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue – Developer Endpoint
Explore, visualize and develop using a personal, serverless environment with
interactive REPL and Notebooks.

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark
Apache Spark is a fast, easy to use general engine for large-scale data processing and
machine learning.
Spark Core
Spark
SQL
Spark
Streaming
MLlib GraphX

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real World Application
1. Web scraping – Automate a process to crape forum comments to analyze customer
experience and challenges with a product
• Automate scraping, parsing and reformatting of data
• Prepare data for machine learning
• Build machine learning models to extract insight from data
2. Venue Ratings – Build graph representation of users, venues and ratings
• Consume a collection of venue checkins and ratings
• Map users to venues
• Map venues to rating

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architecture
Web
Forums
Venue
Ratings
Zeppelin
Notebook
AWS
Glue
Amazon
S3

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting Started
1. Make sure your AWS user account has the following permissions:
• AmazonEC2FullAccess
• IAMFullAccess
2. Visit the link below to setup permissions and launch your dev endpoint
3. At the same link, download the 3 workshop notebooks to your machine
4. Login to Zeppelin running on your dev endpoint and upload the notebooks
5. Work through each notebook at your own pace
http://workshop-public.s3-website-
us-east-1.amazonaws.com/

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cleanup
To make sure you don’t incur unnecessary costs please make sure to remove all
resources created.
1. From AWS CloudFormation console, select the AWS Glue Notebook stack, delete it
2. From AWS Glue console, select the Dev Endpoint and delete it
3. From AWS Glue console, select the databases, tables and crawlers created during
the session and delete them
4. From S3 console, select any buckets or prefixes (folders) you used for the workshop
and delete them

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Continue Learning
• AWS Glue
• Apache Spark
• Apache Zeppelin
• Hands on workshop using AWS Glue, Amazon Athena and Amazon Redshift Spectrum

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!

What's hot

AMF305_Autonomous Driving Algorithm Development on Amazon AIAmazon Web Services

How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services

ABD311_Deploying Amazon QuickSight For EnterpriseAmazon Web Services

ABD217_From Batch to StreamingAmazon Web Services

ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...Amazon Web Services

ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...Amazon Web Services

ABD330_Combining Batch and Stream Processing to Get the Best of Both WorldsAmazon Web Services

ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...Amazon Web Services

ABD322_Implementing a Flight Simulator Interface Using AI, Virtual Reality, a...Amazon Web Services

ABD202_Best Practices for Building Serverless Big Data ApplicationsAmazon Web Services

ABD316_American Heart Association Finding Cures to Heart Disease Through the ...Amazon Web Services

ABD304-R-Best Practices for Data Warehousing with Amazon Redshift & SpectrumAmazon Web Services

ABD201-Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services

How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...Amazon Web Services

DEV337_Deploy a Data Lake with AWS CloudFormationAmazon Web Services

Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...Amazon Web Services

ABD301-Analyzing Streaming Data in Real Time with Amazon KinesisAmazon Web Services

(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...Amazon Web Services

Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...Amazon Web Services

FSV302_An Architecture for Trade Capture and Regulatory ReportingAmazon Web Services

What's hot (20)

AMF305_Autonomous Driving Algorithm Development on Amazon AI

How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017

ABD311_Deploying Amazon QuickSight For Enterprise

ABD217_From Batch to Streaming

ABD324_Migrating Your Oracle Data Warehouse to Amazon Redshift Using AWS DMS ...

ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...

ABD330_Combining Batch and Stream Processing to Get the Best of Both Worlds

ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...

ABD322_Implementing a Flight Simulator Interface Using AI, Virtual Reality, a...

ABD202_Best Practices for Building Serverless Big Data Applications

ABD316_American Heart Association Finding Cures to Heart Disease Through the ...

ABD304-R-Best Practices for Data Warehousing with Amazon Redshift & Spectrum

ABD201-Big Data Architectural Patterns and Best Practices on AWS

How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...

DEV337_Deploy a Data Lake with AWS CloudFormation

Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...

ABD301-Analyzing Streaming Data in Real Time with Amazon Kinesis

(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...

Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...

FSV302_An Architecture for Trade Capture and Regulatory Reporting

Similar to Serverless Data Prep with AWS Glue

CON319_Interstella GTC CICD for Containers on AWSAmazon Web Services

Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017Amazon Web Services

透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200）Amazon Web Services

Building Web Apps on AWSAmazon Web Services

Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...Amazon Web Services

Design patterns and best practices for data analytics with amazon emr (ABD305)Amazon Web Services

Integrating Deep Learning into your EnterpriseAmazon Web Services

AWS Machine Learning Week SF: Integrating Deep Learning into Your EnterpriseAmazon Web Services

Building .NET-based Serverless Architectures and Running .NET Core Microservi...Amazon Web Services

Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...Amazon Web Services

GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...Amazon Web Services

DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdfAmazon Web Services

High-Throughput Genomics on AWS - LFS309 - re:Invent 2017Amazon Web Services

LFS309-High-Throughput Genomics on AWS.pdfAmazon Web Services

A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...Amazon Web Services

I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...Amazon Web Services

DAT317_Migrating Databases and Data Warehouses to the CloudAmazon Web Services

Genomics on aws-webinar-april2018Brendan Bouffler

Design, Build, and Modernize Your Web Applications with AWSDonnie Prakoso

Serverless Architecture PatternsAmazon Web Services

Similar to Serverless Data Prep with AWS Glue (20)

CON319_Interstella GTC CICD for Containers on AWS

Interstella 8888: CICD for Containers on AWS - CON319 - re:Invent 2017

透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200）

Building Web Apps on AWS

Leo Zhadanovsky - Building Web Apps with AWS CodeStar and AWS Elastic Beansta...

Design patterns and best practices for data analytics with amazon emr (ABD305)

Integrating Deep Learning into your Enterprise

AWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise

Building .NET-based Serverless Architectures and Running .NET Core Microservi...

Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...

GPSBUS220-Refactor and Replatform .NET Apps to Use the Latest Microsoft SQL S...

DEV305_Manage Your Applications with AWS Elastic Beanstalk.pdf

High-Throughput Genomics on AWS - LFS309 - re:Invent 2017

LFS309-High-Throughput Genomics on AWS.pdf

A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...

I Want to Analyze and Visualize Website Access Logs, but Why Do I Need Server...

DAT317_Migrating Databases and Data Warehouses to the Cloud

Genomics on aws-webinar-april2018

Design, Build, and Modernize Your Web Applications with AWS

Serverless Architecture Patterns

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services

Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services

Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services

Costruire Applicazioni Moderne con AWSAmazon Web Services

Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services

Open banking as a serviceAmazon Web Services

Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services

OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services

Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services

Computer Vision con AWSAmazon Web Services

Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services

Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services

API moderne real-time per applicazioni mobili e webAmazon Web Services

Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services

Tools for building your MVP on AWSAmazon Web Services

How to Build a Winning Pitch DeckAmazon Web Services

Building a web application without serversAmazon Web Services

Fundraising EssentialsAmazon Web Services

AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services

Introduzione a Amazon Elastic Container ServiceAmazon Web Services

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...

Big Data per le Startup: come creare applicazioni Big Data in modalità Server...

Esegui pod serverless con Amazon EKS e AWS Fargate

Costruire Applicazioni Moderne con AWS

Come spendere fino al 90% in meno con i container e le istanze spot

Open banking as a service

Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...

OpsWorks Configuration Management: automatizza la gestione e i deployment del...

Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads

Computer Vision con AWS

Database Oracle e VMware Cloud on AWS i miti da sfatare

Crea la tua prima serverless ledger-based app con QLDB e NodeJS

API moderne real-time per applicazioni mobili e web

Database Oracle e VMware Cloud™ on AWS: i miti da sfatare

Tools for building your MVP on AWS

How to Build a Winning Pitch Deck

Building a web application without servers

Fundraising Essentials

AWS_HK_StartupDay_Building Interactive websites while automating for efficien...

Introduzione a Amazon Elastic Container Service

Serverless Data Prep with AWS Glue

1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Serverless Data Prep with AWS Glue ABD215 R o y H a s s o n – G l o b a l B u s i n e s s D e v e l o p m e n t M a n a g e r S a n t o s h C h a n d r a c h o o d – S o f t w a r e D e v e l o p m e n t M a n a g e r L i a V a d e r – E n t e r p r i s e S o l u t i o n s A r c h i t e c t N o v e m b e r 2 0 1 7

2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Chat on AWS Glue & Spark Data Transformation Machine Learning Explore Review workshop architecture We talk You build Check access to required products

3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – Overview  Hive Metastore compatible with enhanced functionality  Crawlers automatically extracts metadata and creates tables  Integrated with Amazon Athena, Amazon Redshift Spectrum  Run jobs on a serverless Spark platform  Provides flexible scheduling  Handles dependency resolution, monitoring and alerting  Auto-generates ETL code  Build on open frameworks – Python and Spark  Developer Endpoint with Interactive Notebook Job Authoring Job Execution Data Catalog

4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – Data Catalog Unified metadata repository across relational databases, Amazon RDS, Amazon Redshift, and Amazon S3 accessible via Amazon Athena, Amazon Redshift Spectrum, Amazon EMR and API

5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – ETL Automatically generated ETL code running on serverless Apache Spark with the power and flexibility to bring data together.

6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – Developer Endpoint Explore, visualize and develop using a personal, serverless environment with interactive REPL and Notebooks.

7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark Apache Spark is a fast, easy to use general engine for large-scale data processing and machine learning. Spark Core Spark SQL Spark Streaming MLlib GraphX

8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real World Application 1. Web scraping – Automate a process to crape forum comments to analyze customer experience and challenges with a product • Automate scraping, parsing and reformatting of data • Prepare data for machine learning • Build machine learning models to extract insight from data 2. Venue Ratings – Build graph representation of users, venues and ratings • Consume a collection of venue checkins and ratings • Map users to venues • Map venues to rating

10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting Started 1. Make sure your AWS user account has the following permissions: • AmazonEC2FullAccess • IAMFullAccess 2. Visit the link below to setup permissions and launch your dev endpoint 3. At the same link, download the 3 workshop notebooks to your machine 4. Login to Zeppelin running on your dev endpoint and upload the notebooks 5. Work through each notebook at your own pace http://workshop-public.s3-website- us-east-1.amazonaws.com/

11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cleanup To make sure you don’t incur unnecessary costs please make sure to remove all resources created. 1. From AWS CloudFormation console, select the AWS Glue Notebook stack, delete it 2. From AWS Glue console, select the Dev Endpoint and delete it 3. From AWS Glue console, select the databases, tables and crawlers created during the session and delete them 4. From S3 console, select any buckets or prefixes (folders) you used for the workshop and delete them

12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Continue Learning • AWS Glue • Apache Spark • Apache Zeppelin • Hands on workshop using AWS Glue, Amazon Athena and Amazon Redshift Spectrum

Serverless Data Prep with AWS Glue

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Serverless Data Prep with AWS Glue

Similar to Serverless Data Prep with AWS Glue (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Serverless Data Prep with AWS Glue