2. What is AWS Glue?
AWS Glue is a serverless data integration service that makes it easier to discover,
prepare, move, and integrate data from multiple sources for analytics, machine
learning (ML), and application development.
AWS Glue is an event-driven, serverless computing platform provided
by Amazon as a part of Amazon Web Services. It is a computing service that runs
code in response to events and automatically manages the computing resources
required by that code. It was introduced in August 2017.
The primary purpose of Glue is to scan other services in the same Virtual Private
Cloud (or equivalent accessible network element even if not provided by AWS),
particularly S3. The jobs are billed according to compute time, with a minimum
count of 1 minute. Glue discovers the source data to store associated meta-data
(e.g. the table's schema of field names, types lengths) in the AWS Glue Data
Catalog (which is then accessible via AWS console or APIs).
3. What is ETL?
Extract — The script will read all the usage data from the S3 bucket to a single data
frame (you can think of a data frame in Pandas)
Transform — Let’s say that the original data contains 10 different logs per second on
average. The analytics team wants the data to be aggregated per each 1 minute with
a specific logic.
Load — Write the processed data back to another S3 bucket for the analytics team.
4. How AWS Glue work?
AWS Glue can run your extract, transform, and load (ETL) jobs as new data arrives.
For example, you can configure AWS Glue to initiate your ETL jobs to run as soon as
new data becomes available in Amazon Simple Storage Service (S3).
5. How AWS Glue work?
Choose your preferred data integration engine in AWS Glue to support your users
and workloads.
6. How AWS Glue work?
You can use the Data Catalog to quickly discover and search multiple AWS datasets
without moving the data. Once the data is cataloged, it is immediately available for
search and query using Amazon Athena, Amazon EMR, and Amazon Redshift
Spectrum.
7. How AWS Glue work?
AWS Glue Studio makes it easier to visually create, run, and monitor AWS Glue ETL
jobs. You can build ETL jobs that move and transform data using a drag-and-drop
editor, and AWS Glue automatically generates the code.
8. Use case where AWS Glue fits?
Simplify ETL pipeline development
Remove infrastructure management with automatic provisioning and worker
management, and consolidate all your data integration needs into a single service.
Discover data efficiently
Quickly identify data across multiple AWS datasets, and then make it instantly
available for querying and transforming.
Interactively explore, experiment on, and process data
Using AWS Glue interactive sessions, data engineers can interactively explore and
prepare data using the integrated development environment (IDE) or notebook of
their choice.
Support various processing frameworks and workloads
More easily support various data processing frameworks, such as ETL and ELT, and
various workloads, including batch, micro-batch, and streaming.
9. What is AWS Glue Studio?
AWS Glue Studio is a graphical interface that makes it easy to create, run, and
monitor data integration jobs in AWS Glue. You can visually compose data
transformation workflows and seamlessly run them on the Apache Spark–based
serverless ETL engine in AWS Glue. For more information, see What is AWS Glue
Studio.
With AWS Glue Studio, you can create and manage jobs that gather, transform, and
clean data. You can also use AWS Glue Studio to troubleshoot and edit job scripts.
AWS Glue features
AWS Glue features fall into three major categories:
•Discover and organize data
•Transform, prepare, and clean data for analysis
•Build and monitor data pipelines
10. How to Access AWS Glue ?
You can create, view, and manage your AWS Glue jobs using the following interfaces:
•AWS Glue console – Provides a web interface for you to create, view, and manage
your AWS Glue jobs.
•AWS Glue Studio – Provides a graphical interface for you to create and edit your
AWS Glue jobs visually.
•AWS Glue section of the AWS CLI Reference – Provides AWS CLI commands that
you can use with AWS Glue.
•AWS Glue API – Provides a complete API reference for developers.
11. AWS Glue console
You use the AWS Glue console to define and orchestrate your ETL workflow. The
console calls several API operations in the AWS Glue Data Catalog and AWS Glue
Jobs system to perform the following tasks:
•Define AWS Glue objects such as jobs, tables, crawlers, and connections.
•Schedule when crawlers run.
•Define events or schedules for job triggers.
•Search and filter lists of AWS Glue objects.
•Edit transformation scripts.
12. AWS Glue
Studio
AWS Glue Studio is a new
graphical interface that makes it
easy to create, run, and monitor
extract, transform, and load
(ETL) jobs in AWS Glue. You
can visually compose data
transformation workflows and
seamlessly run them on AWS
Glue’s Apache Spark-based
serverless ETL engine.
13. Streaming ETL in AWS Glue
AWS Glue enables you to perform ETL operations on streaming data using
continuously-running jobs. AWS Glue streaming ETL is built on the Apache Spark
Structured Streaming engine, and can ingest streams from Amazon Kinesis Data
Streams, Apache Kafka, and Amazon Managed Streaming for Apache Kafka
(Amazon MSK). Streaming ETL can clean and transform streaming data and load it
into Amazon S3 or JDBC data stores. Use Streaming ETL in AWS Glue to process
event data like IoT streams, clickstreams, and network logs.
14. The AWS Glue jobs system
The AWS Glue Jobs system provides managed infrastructure to orchestrate your ETL
workflow. You can create jobs in AWS Glue that automate the scripts you use to
extract, transform, and transfer data to different locations. Jobs can be scheduled and
chained, or they can be triggered by events such as the arrival of new data.
15. Serverless ETL jobs run in isolation
AWS Glue runs your ETL jobs in an Apache Spark serverless environment. AWS
Glue runs these jobs on virtual resources that it provisions and manages in its own
service account.
AWS Glue is designed to do the following:
•Segregate customer data.
•Protect customer data in transit and at rest.
•Access customer data only as needed in response to customer requests, using
temporary, scoped-down credentials, or with a customer's consent to IAM roles in
their account.
16. Data sources and destinations
AWS Glue allows you to read and write data from multiple systems and databases
including:
•Amazon S3
•Amazon DynamoDB
•Amazon Redshift
•Amazon Relational Database Service (Amazon RDS)
•Third-party JDBC-accessible databases
•MongoDB and Amazon DocumentDB (with MongoDB compatibility)
•Other marketplace connectors and Apache Spark plugins
Data streams
AWS Glue can stream data from the following systems:
•Amazon Kinesis Data Streams
•Apache Kafka
AWS Glue is available in several AWS Regions.
17. Components of AWS Glue
•Data catalog: The data catalog holds the metadata and the structure of the data.
•Database: It is used to create or access the database for the sources and targets.
•Table: Create one or more tables in the database that can be used by the source
and target.
•Crawler and Classifier: A crawler is used to retrieve data from the source using
built-in or custom classifiers. It creates/uses metadata tables that are pre-defined in
the data catalog.
•Job: A job is business logic that carries out an ETL task. Internally, Apache Spark
with python or scala language writes this business logic.
•Trigger: A trigger starts the ETL job execution on-demand or at a specific time.
•Development endpoint: It creates a development environment where the ETL job
script can be tested, developed, and debugged.
18. Summary
AWS Glue makes it easy to integrate data across your architecture. It integrates with
AWS analytics services and Amazon S3 data lakes. AWS Glue has integration
interfaces and job-authoring tools that are easy to use for all users, from developers
to business users, with tailored solutions for varied technical skill sets.
With the ability to scale on demand, AWS Glue helps you focus on high-value
activities that maximize the value of your data. It scales for any data size, and
supports all data types and schema variances. To increase agility and optimize costs,
AWS Glue provides built-in high availability and pay-as-you-go billing.
AWS Glue consolidates major data integration capabilities into a single service.
These include data discovery, modern ETL, cleansing, transforming, and centralized
cataloging. It's also serverless, which means there's no infrastructure to manage.
With flexible support for all workloads like ETL, ELT, and streaming in one service,
AWS Glue supports users across various workloads and types of users.