Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

AWS Data Glue.pptx

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Prochain SlideShare
What is AWS Glue
What is AWS Glue
Chargement dans…3
×

Consultez-les par la suite

1 sur 13 Publicité

Plus De Contenu Connexe

Similaire à AWS Data Glue.pptx (20)

Plus récents (20)

Publicité

AWS Data Glue.pptx

  1. 1. AWS Data Glue
  2. 2. ● AWS Glue is a serverless data integration service that helps you discover, prepare, and combine data for analytics, machine learning (ML), and application development. What Is Aws Glue ● Users can find and access data through the AWS Glue Data Catalog. Data engineers can extract, transform, and load (ETL). Developers can visually create, run, and monitor ETL workflows in AWS Glue Studio. Data analysts and data scientists use AWS Glue DataBrew to visually enrich, clean, and normalize data without writing code.
  3. 3. What problem does AWS Glue solve? 1. Taking care of provisioning and managing resources, such as servers, storage, and runtime environments, which are required to run your ETL operations. When an AWS Glue ETL job is started, the service allocates capacity from its warm pool of resources to run the workload. 2. Avoiding tasks such as installing, patching, or updating ETL software because AWS Glue is a fully managed service. Such tasks can be time consuming and exhaust resources. 3. Generating code, based on your job configuration, to transform your data from source to target. You can also provide scripts in the AWS Glue console or API to process your data. 4. Including both visual and code-based interfaces to help make data preparation and movement fast and cost optimized.
  4. 4. AWS Glue streamlines many tasks by: ● Taking care of provisioning and managing resources ● Avoiding tasks such as installing, patching, or updating ETL software ● Generating code based on your job configuration ● Including both visual and code-based interfaces
  5. 5. Data-glue Component descriptions Data stores – AWS Glue has the ability to connect to many types of data stores on AWS and even outside of AWS. It is common to use an AWS Glue crawler to automatically discover and catalog data in the Data Catalog across many different data stores. Data sources and targets – AWS Glue can read and write to Amazon Simple Storage Service (Amazon S3) or databases on AWS or on premises. It can also use a JDBC connection for your data sources and targets. For a comprehensive list of supported connections, refer to the Resources section at the end of the course. Data Catalog – As a persistent metadata store, you can use this managed service to store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore. The Data Catalog holds all of the information about your tables and table schemas. You can quickly view information in your tables. Examples include the physical location of where the data is stored and table properties such as file type, compression type, record size, record count, and more. AWS Glue crawler – Using AWS Glue, you can also set up crawlers to scan data in all kinds of repositories, classify them, extract schema information, and store the metadata automatically in the Data Catalog. The Data Catalog can then be used to guide ETL operations. Some examples for using a crawler include: ● Crawling clickstream data landing in Amazon S3 on an hourly schedule for schema validation ● Crawling database files that were migrated to Amazon S3 for further consumption by analytical engines like Athena or Amazon Redshift; as the crawler runs over time, it can automatically add new partitions that it discovers in Amazon S3 ● Crawling an on-premises database or a database running in AWS to perform ETL jobs against it
  6. 6. AWS Glue Jobs – The AWS Glue Jobs system provides managed infrastructure to orchestrate ETL workflow. You can create jobs in AWS Glue that automate the scripts you use to extract, transform, and transfer data to different locations. Jobs can be scheduled and chained, or they can be activated by events such as the arrival of new data. Examples of using an AWS Glue job for ETL include: ● Raw source data coming in as uncompressed CSV files. To reduce cost and improve query performance, you can use AWS Glue ETL to transform those files into Apache Parquet format and compress with snappy compression. ● Use AWS Glue ETL to preaggregate data to speed up analytical queries. ● AWS Glue comes with powerful built-in transformations to help flatten JSON, join datasets, map fields to data types, and much more. You are able to use these transformations in AWS Glue Studio, which offers a graphical user interface (GUI) and generates the code for you. AWS Glue DataBrew – Using this visual data preparation tool, you can clean, enrich, format, and normalize your datasets with over 250 built-in transformations. You can create a “recipe” for a dataset using the transformations of your choice, and then reuse that recipe repeatedly as your business continues to collect new data. AWS Glue Streaming ETL – You can consume real-time data from either an Amazon Kinesis data stream or an Amazon Managed Streaming for Apache Kafka stream. Use the Data Catalog to register the stream as a source. As data comes in from a stream, you are able to use all of the powerful AWS Glue ETL transformations. Then, you can output your transformed data to an S3 bucket or another target that is JDBC compatible, such as the Amazon Redshift data warehouse.
  7. 7. What are typical use cases for AWS Glue? ETL pipeline building – If you want to extract data, transform, and load or store the data, then AWS Glue ETL will be very handy. You can write your own, or auto- generate ETL scripts in Python or Scala. For example, in Section 2, the architecture diagram shows the many ways that you can crawl and catalog your different data stores. After that, you can bring different types of AWS Glue ETL jobs to your data. Depending on your use cases, you may use an AWS Glue job for real- time data. You might also use AWS Glue to do a batch job for a different data source on a nightly basis.
  8. 8. Data preparation and data profiling without coding Business analysts, data scientists, and data engineers who want to analyze and format raw data can consider using DataBrew. DataBrew will clean and normalize data without writing any code. You can evaluate your raw data by profiling and identifying patterns to detect anomalies. You can choose from over 250 prebuilt transformations in DataBrew to automate data preparation tasks, such as filtering anomalies, standardizing formats, and correcting invalid values. After the data is prepared, you can immediately use it for analytics and ML. For more information about DataBrew.
  9. 9. Quick job orchestration and visualization with drag and drop – AWS Glue Studio is a great option if you want to go fast and not write all of your ETL processes out by hand. You can use it to quickly create, run, and monitor AWS Glue ETL jobs. The drag-and-drop editor means that you can visualize and create ETL jobs, and then AWS Glue will generate the code for you. AWS Glue Studio also has built-in monitoring and dashboarding features to help scale and manage hundreds or thousands of jobs.
  10. 10. Real-time data processing – Batch processing is good when processed data is used or visualized at certain times of the day, such as for a daily sales summary. However, sometimes you want to process and use data when it becomes available. Examples include user login patterns, social media data, network logs, and clickstream data, which are logs of how users navigate through a website. In these cases, AWS Glue Streaming ETL can process data in real time. You can create streaming ETL jobs that run continuously and consume data from streaming sources. These ETL jobs process data in real time, and then load the data into Amazon S3 or JDBC data stores.
  11. 11. AWS Glue cannot support the conventional relational database systems. It can only support structured databases. Hence, you need to have a SQL system for database storage to implement the AWS Glue successfully

×