Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

AWS re:Invent 2016: Workshop: Building Your First Big Data Application with AWS (BDM202)

862 vues

Publié le

Want to get ramped up on how to use Amazon's big data web services and launch your first big data application on AWS? Join us in this workshop as we build a big data application in real time using Amazon EMR, Amazon Redshift, Amazon Kinesis, Amazon DynamoDB, and Amazon S3. We review architecture design patterns for big data solutions on AWS, and give you access to a take-home lab so that you can rebuild and customize the application yourself.

Publié dans : Technologie
  • Login to see the comments

AWS re:Invent 2016: Workshop: Building Your First Big Data Application with AWS (BDM202)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rahul Bhartia, Principal Solutions Architect, AWS Ryan Nienhuis, Senior Product Manager, AWS Tony Gibbs, Solutions Architect, AWS Radhika Ravirala, Solutions Architect, AWS November 29, 2016 BDM202 Workshop Building Your First Big Data Application on AWS
  2. 2. Objectives for Today Build an end-to-end big data application that captures, stores, processes, and analyzes web logs. At the end of the workshop, you will: 1. Understand how a common scenario is implemented using AWS big data tools 2. Be able to understand how to use 1. Amazon Kinesis for real-time data 2. Amazon Redshift for data warehousing 3. Amazon EMR for data processing 4. Amazon QuickSight for data visualization
  3. 3. Your Big Data Application Architecture Amazon Kinesis Producer UI Amazon Kinesis Firehose Amazon Kinesis Analytics Amazon Kinesis Firehose Amazon EMR Amazon Redshift Amazon QuickSight Generate web logs Collect web logs and deliver to S3 Process & compute aggregate metrics Deliver processed web logs to Amazon Redshift Raw web logs from Firehose Run SQL queries on processed web logs Visualize web logs to discover insights Amazon S3 Bucket Ad hoc analysis of web logs
  4. 4. Getting started
  5. 5. What is qwikLABS? Access to AWS services for this bootcamp No need to provide a credit card Automatically deleted when you’re finished http://events-aws.qwiklab.com Create an account with the same email that you used to register for this bootcamp
  6. 6. Sign in and start course Once the course is started, you will see “Create in Progress” in the upper right corner.
  7. 7. Navigating qwikLABS Connect tab: Access and login information Addl Info tab: Links to Interfaces Lab Instruction tab - Scripts for your labs
  8. 8. Everything you need for the lab Open the AWS Management Console, log in and verify, • Two Amazon Kinesis Firehose delivery streams • One Amazon EMR cluster • One Amazon Redshift cluster Sign up for • Amazon QuickSight
  9. 9. Real-time data with Amazon Kinesis
  10. 10. Amazon Kinesis: Streaming Data Made Easy Services make it easy to capture, deliver, process streams on AWS Amazon Kinesis Streams Amazon Kinesis Analytics Amazon Kinesis Firehose
  11. 11. Amazon Kinesis Streams • Easy administration • Build real time applications with framework of choice • Low cost
  12. 12. Amazon Kinesis Firehose • Zero administration • Direct-to-data store integration • Seamless elasticity
  13. 13. Amazon Kinesis - Firehose vs. Streams Amazon Kinesis Streams is for use cases that require custom processing, per incoming record, with sub-1 second processing latency, and a choice of stream processing frameworks. Amazon Kinesis Firehose is for use cases that require zero administration, ability to use existing analytics tools based on Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service, and a data latency of 60 seconds or higher.
  14. 14. Activity 1 Collect web logs using a Firehose delivery stream
  15. 15. Capture data using a Firehose delivery stream Time: 5 minutes We are going to: A. Use a simple producer applications that writes web logs into Firehose delivery stream configured to write data into Amazon S3. 75.35.230.210 - - [20/Jul/2009:22:22:42 -0700] "GET /images/pigtrihawk.jpg " 200 29236 "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)"
  16. 16. Activity 1A: Writing to a Firehose delivery stream To simplify this, we are going to use simple tool for creating sample data and writing it to a Amazon Kinesis Firehose delivery stream. 1. Go to the simple data producer UI http://bit.ly/kinesis-producer 2. Copy and past your credentials into the form. The credentials do not leave your client (locally hosted site). Additionally, the AWS JavaScript SDK is implemented in the site which uses HTTPS. In normal situations, you do not want to use your credentials in this way unless you trust the provider. 3. Specify region (US West (Oregon)) and delivery stream name. (qls- somerandomnumber-FirehoseDeliveryStream-somerandom-11111111111) 4. Specify a data rate between 100 – 500, pick a unique number. This will control the rate at which records are sent to the stream.
  17. 17. Activity 1A: Writing to a Firehose delivery stream 5. Copy and paste this into the text editor {{internet.ip}} - - [{{date.now("DD/MMM/YYYY:HH:mm:ss Z")}}] "{{random.weightedArrayElement({"weights":[0.6,0.1,0.1,0.2],"data":["GET","POST","DEL ETE","PUT"]})}} {{random.arrayElement(["/list","/wp-content","/wp- admin","/explore","/search/tag/list","/app/main/posts","/posts/posts/explore"])}}" {{random.weightedArrayElement({"weights": [0.9,0.04,0.02,0.04], "data":["200","404","500","301"]})}} {{random.number(10000)}} "-" "{{internet.userAgent}}" Make sure there is a new line after your record when you copy and paste The tool will generate records based upon the above format. The tool generates hundreds of these records and send them in a single put using PutRecords. Finally, the tool will drain your battery. Only keep it running while we are completing the activities if you don’t have a power cord.
  18. 18. Activity 1A: Writing to a Firehose delivery stream
  19. 19. Review: Monitoring a Delivery Stream Go to the Amazon CloudWatch Metrics Console and search “IncomingRecords”. Select this metric for your Firehose delivery stream and choose a 1 Minute SUM. What are the most important metrics to monitor?
  20. 20. Activity 2 Real-time processing using Amazon Kinesis Analytics
  21. 21. Amazon Kinesis Analytics • Apply SQL on streams • Build real time, stream processing applications • Easy scalability
  22. 22. Use SQL to build real-time applications Easily write SQL code to process streaming data Connect to streaming source Continuously deliver SQL results
  23. 23. Process Data using Amazon Kinesis Analytics Time: 20 minutes We are going to: A. Create an Amazon Kinesis Analytics application that reads from the delivery stream with log data B. Write a SQL query to parse and transform the log C. Add another query to compute an aggregate metrics for an interesting statistic on the incoming data
  24. 24. Activity 2A: Create an Amazon Kinesis Analytics Application 1. Go to the Amazon Kinesis Analytics Console 1. https://us-west-2.console.aws.amazon.com/kinesisanalytics/home?region=us- west-2 2. Click “Create new application” and provide a name 3. Click “Connect to a source” and select the stream you created in the previous exercise (qls-randomnumber-FirehoseDeliveryStream- randomnumber) 4. Amazon Kinesis Analytics will discover the schema for you if your data is UTF-8 encoded CSV or JSON data. In this case, we have custom log, so discovery incorrectly assumes a delimiter
  25. 25. Activity 2A: Create an Amazon Kinesis Analytics Application Amazon Kinesis Analytics comes with powerful string manipulation and log parsing functions. So we are going to define our own schema to handle this format and then parse it in SQL code. 5. Click “Edit schema” 6. Verify that the format is CSV, Row delimiter is ‘n’, and change the column delimiter to ‘t’ 7. Remove all the existing columns, and create a single column named DATA of type VARCHAR(5000). 8. Click “Save schema and update stream samples”. This will run your application (takes about a minute).
  26. 26. Activity 2A: Create an Amazon Kinesis Analytics Application There are two system columns added to each record. • The ROWTIME represents the time the application read the record, and is a special column used for time series analytics. This is also known as the process time. • The APPROXIMATE_ARRIVAL_TIME is the time the delivery stream received the record. This is also known as ingest time.
  27. 27. Activity 2A: Create an Amazon Kinesis Analytics Application 10. Click “Exit (done)” on Schema editing page. 11. Click “Cancel” on Source configuration page. 12. Click “Go To SQL Editor”. You now have a running stream processing application!
  28. 28. Activity 2B: Writing SQL over Streaming Data Writing SQL over streaming data using Amazon Kinesis Analytics follows a two part model: 1. Create an in-application stream for storing intermediate SQL results. An in- application stream is like a SQL table, but is continuously updated. 2. Create a PUMP which will continuously read FROM one in-application stream and INSERT INTO a target in-application stream DESTINATION_SQL_STREAM Part 1 (Activity 2-B) AGGREGATE_STREAM Part 2 (Activity 2-C) SOURCE_SQL_STREAM_001 Source Stream TRANSFORM_PUMP AGGREGATE_PUMP Send to Redshift
  29. 29. Activity 2B: Add your first query Our first query will use a special function called W3C_LOG_PARSE to separate the log into specific columns. Remember, this is a two part process: 1. We will create the in-application stream to store intermediate SQL results. 2. We will create a pump that selects data from the source stream and inserts into the target in- application stream. Your application code is replaced each time you click “Save and Run”. As you add queries, append them to the bottom of the SQL editor. Note: You can download all of the application code from qwikLABS. Click on the Lab instructions tab in qwikLABS and then download the Kinesis Analytics SQL file.
  30. 30. Activity 2B: Calculate an aggregate metric Calculate a count using a tumbling window and a GROUP BY clause. A tumbling window is similar to a periodic report, where you specify your query and a time range, and results are emitted at the end of a range. (EX: COUNT number of items by key for 10 seconds)
  31. 31. Activity 2B: Calculate an aggregate metric The window is defined by the following statement in the SELECT statement. Note that the ROWTIME column is implicitly included in every stream query, and represents the processing time of the application. FLOOR("SOURCE_SQL_STREAM_001".ROWTIME TO SECOND) (DO NOT INSERT THIS INTO YOUR SQL EDITOR) This is known as a tumbling window.
  32. 32. Review – In-Application SQL streams Your application has multiple in-application SQL streams including TRANSFORMED_STREAM and AGGREGATE_STREAM. These in-application streams which are like SQL tables that are continuously updated. What else is unique about an in-application stream aside from its continuous nature?
  33. 33. Data Warehousing with Amazon Redshift
  34. 34. Columnar MPP OLAP Amazon Redshift
  35. 35. Amazon Redshift Cluster Architecture Massively parallel, shared nothing Leader node • SQL endpoint • Stores metadata • Coordinates parallel SQL processing Compute nodes • Local, columnar storage • Executes queries in parallel • Load, backup, restore 10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores S3 / EMR / DynamoDB / SSH JDBC/ODBC 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Leader Node
  36. 36. Designed for I/O Reduction Columnar storage Data compression Zone maps aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 • Accessing dt with row storage: – Need to read everything – Unnecessary I/O aid loc dt CREATE TABLE loft_deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date );
  37. 37. Designed for I/O Reduction Columnar storage Data compression Zone maps aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 • Accessing dt with columnar storage: – Only scan blocks for relevant column aid loc dt CREATE TABLE loft_deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date );
  38. 38. Designed for I/O Reduction Columnar storage Data compression Zone maps aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 • Columns grow and shrink independently • Effective compression ratios due to like data • Reduces storage requirements • Reduces I/O aid loc dt CREATE TABLE loft_deep_dive ( aid INT ENCODE LZO ,loc CHAR(3) ENCODE BYTEDICT ,dt DATE ENCODE RUNLENGTH );
  39. 39. Designed for I/O Reduction Columnar storage Data compression Zone maps aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 aid loc dt CREATE TABLE loft_deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ); • In-memory block metadata • Contains per-block MIN and MAX value • Effectively prunes blocks which cannot contain data for a given query • Eliminates unnecessary I/O
  40. 40. SELECT COUNT(*) FROM LOGS WHERE DATE = '09-JUNE-2013' MIN: 01-JUNE-2013 MAX: 20-JUNE-2013 MIN: 08-JUNE-2013 MAX: 30-JUNE-2013 MIN: 12-JUNE-2013 MAX: 20-JUNE-2013 MIN: 02-JUNE-2013 MAX: 25-JUNE-2013 Unsorted Table MIN: 01-JUNE-2013 MAX: 06-JUNE-2013 MIN: 07-JUNE-2013 MAX: 12-JUNE-2013 MIN: 13-JUNE-2013 MAX: 18-JUNE-2013 MIN: 19-JUNE-2013 MAX: 24-JUNE-2013 Sorted By Date Zone Maps
  41. 41. Activity 4 Deliver streaming results to Amazon Redshift
  42. 42. Activity 4: Deliver data to Amazon Redshift using Firehose Time: 5 minutes We are going to: A. Connect to an Amazon Redshift cluster and create a table to hold weblogs data. B. Update the Amazon Kinesis Analytics application to send data to Amazon Redshift, via the Firehose delivery stream.
  43. 43. Activity 4A: Connect to Amazon Redshift You can connect with pgweb • Installed and configured for the cluster • Just navigate to pgWeb and start interacting Note: Click on the Addl. Info tab in qwikLABS and then open the pgWeb link in a new window. Or, Use any JDBC/ODBC/libpq client • Aginity Workbench for Amazon Redshift • SQL Workbench/J • DBeaver • Datagrip
  44. 44. Activity 4B: Create table in Amazon Redshift Create a table weblogs to capture the in-coming data from Firehose delivery stream CREATE TABLE weblogs ( row_time timestamp encode raw, host_address varchar(512) encode lzo, request_time timestamp encode raw, request_method varchar(5) encode lzo, request_path varchar(1024) encode lzo, request_protocol varchar(10) encode lzo, response_code int encode delta, response_size int encode delta, referrer_host varchar(1024) encode lzo, user_agent varchar(512) encode lzo ) DISTSTYLE EVEN SORTKEY(request_time); Note: You can download all of the application code from qwikLabs. Click on the lab instructions tab in qwikLABS and then download the Amazon Redshift SQL file.
  45. 45. Activity 4C: Deliver Data to Amazon Redshift using Firehose Update Amazon Kinesis Analytics application to send data to a Firehose delivery stream. Firehose delivers the streaming data to Amazon Redshift. 1. Go to the Amazon Kinesis Analytics console 2. Select your application and choose Application details 3. Choose Connect to a destination
  46. 46. Activity 4C: Deliver Data to Amazon Redshift using Firehose 4. Configure your destination 1. Choose the Firehose “qls-xxxxxxx-RedshiftDeliveryStream- xxxxxxxx” delivery stream. 2. Choose CSV as the “Output format” 3. Choose “Create/update <app name> role” 4. Click “Save and continue” 5. It will take about 1 – 2 minutes for everything to be updated and for data to start appearing in Amazon Redshift.
  47. 47. Review: Amazon Redshift Test Queries • Find distribution of response codes over days SELECT TRUNC(request_time), response_code, COUNT(1) FROM weblogs GROUP BY 1,2 ORDER BY 1,3 DESC • Find the 404 status codes SELECT COUNT(1) FROM weblogs WHERE response_code = 404; • Show all requests for status as PAGE NOT FOUND SELECT TOP 1 request_path, COUNT(1) FROM weblogs WHERE response_code = 404 GROUP BY 1 ORDER BY 2 DESC;
  48. 48. Data processing with Amazon EMR
  49. 49. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Manage firewalls Flexible Control the cluster
  50. 50. The Hadoop ecosystem can run in Amazon EMR
  51. 51. Spot Instances for task nodes Up to 90% off Amazon EC2 on-demand pricing On-Demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Easy to use Spot Instances Meet SLA at predictable cost Exceed SLA at lower cost
  52. 52. Amazon S3 as your persistent data store Separate compute and storage Resize and shut down Amazon EMR clusters with no data loss Point multiple Amazon EMR clusters at same data in Amazon S3 EMR EMR Amazon S3
  53. 53. EMRFS makes it easier to leverage S3 Better performance and error handling options Transparent to applications – Use “s3://” Consistent view • For consistent list and read-after-write for new puts Support for Amazon S3 server-side and client-side encryption Faster listing using EMRFS metadata
  54. 54. Apache Spark • Fast, general-purpose engine for large-scale data processing • Write applications quickly in Java, Scala, or Python • Combine SQL, streaming, and complex analytics
  55. 55. Apache Zeppelin • Web-based notebook for interactive analytics • Multiple language back end • Apache Spark integration • Data visualization • Collaboration https://zeppelin.incubator.apache.org/
  56. 56. Activity 5 Ad hoc analysis using Amazon EMR
  57. 57. Activity 5: Process and query data using Amazon EMR Time: 20 minutes We are going to: A. Use a Zeppelin Notebook to interact with the Amazon EMR cluster B. Process the data delivered to Amazon S3 by Firehose using Apache Spark C. Query the data processed in the earlier stage and create simple charts
  58. 58. Activity 5A: Open the Zeppelin interface 1. Click on the Lab Instructions tab in qwikLABS and then download the Zeppelin Notebook 2. Click on the Addl. Info tab in qwikLABS and then open the zeppelin link into a new window. 3. Import the Notebook using the Import Note link on Zeppelin interface
  59. 59. Using Zeppelin interface Run the Notebook Run the paragraph
  60. 60. Activity 5B: Run the notebook Enter the S3 bucket name created by qwikLabs in the Notebook. • Execute Step #1 (logs-##########-us-west-2) • Execute Step #2 to create an RDD from the dataset delivered by Firehose • Execute Step #3 to print the first row of the output • Notice the broken columns. We’re going to fix that next.
  61. 61. Activity 5B: Run the notebook • Execute Step #4 to process the data • Combine fields: “A B”  A B C • Define a schema for the data • Map the RDD to a data frame • Execute Step #5 to print the weblogsDF • See how different is it from the previous ones
  62. 62. Activity 5B: Run the notebook • Execute Step #6 to register the data frame as a temporary table • Now you can run SQL queries on the temporary tables. • Execute the next 3 steps and observe the charts created. • What did you learn about the dataset?
  63. 63. Review : Ad-hoc analysis using Amazon EMR • You just learned on how to process and query data using Amazon EMR with Apache Spark • Amazon EMR has many other frameworks available for you to use • Hive, Presto, MapReduce, Pig • Hue, Oozie, HBase
  64. 64. Data Visualization with Amazon QuickSight
  65. 65. Fast, Easy Ad-Hoc Analytics for Anyone, Everywhere • Ease of use targeted at business users. • Blazing fast performance powered by SPICE. • Broad connectivity with AWS data services, on-premises data, files and business applications. • Cloud-native solution that scales automatically. • 1/10th the cost of traditional BI solutions. • Create, share and collaborate with anyone in your organization, on the web or on mobile.
  66. 66. Connect, SPICE, Analyze QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on- premises sources and import it to SPICE or query directly. Users can then easily explore, analyze, and share their insights with anyone. Amazon RDS Amazon S3 Amazon Redshift
  67. 67. Activity 6 Visualize results in QuickSight
  68. 68. Activity 6: Visualization with QuickSight We are going to: A. Register for a QuickSight account B. Connect to the Amazon Redshift cluster C. Create visualizations for analysis to answers questions like A. What are the most common http requests and how successful are they B. Which are the most requested URIs
  69. 69. Activity 6A: QuickSight Registration • Go to console, choose QuickSight from the Analytics section. • Choose Signup for Quicksight. • On the registration screen, choose US West Oregon region. • Enter the AWS account number and your email. Note: You can get the AWS account number from qwikLABS in the Connect tab
  70. 70. Activity 6B: Connect to Amazon Redshift • Click Manage Data to create a new data set in QuickSight. • Choose Redshift (Auto-discovered) as the data source.
  71. 71. Activity 6B: Connect to Amazon Redshift Note: You can get the password from qwikLABS by navigating to the Custom Connection Details section on the Connect tab
  72. 72. Activity 6C: Ingest data into SPICE
  73. 73. Activity 6C: Ingest data into SPICE • SPICE is Amazon QuickSight's in-memory optimized calculation engine, designed specifically for fast, ad-hoc data visualization. • For faster analytics, you can import the data into SPICE instead of using a direct query to the database.
  74. 74. Activity 6D: Creating your first analysis What are the most common http requests made? • Simply select request_type, response_code and let AUTOGRAPH create the optimal visualization
  75. 75. Review – Creating your Analysis Exercise: Create a visual to show which URIs are the most requested
  76. 76. Congratulations!!!
  77. 77. Your Big Data Application Architecture Amazon Kinesis Producer UI Amazon Kinesis Firehose Amazon Kinesis Analytics Amazon Kinesis Firehose Amazon EMR Amazon Redshift Amazon QuickSight Generate web logs Collect web logs and deliver to S3 Process & compute aggregate metrics Deliver processed web logs to Amazon Redshift Raw web logs from Firehose Run SQL queries on processed web logs Visualize web logs to discover insights Amazon S3 Bucket Ad-hoc analysis of web logs
  78. 78. Thank you!
  79. 79. Remember to complete your evaluations!

×