Publicité

Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Toronto AWS Summit

Amazon Web Services
20 Sep 2018
Publicité

Contenu connexe

Présentations pour vous(20)

Similaire à Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Toronto AWS Summit(20)

Publicité

Plus de Amazon Web Services(20)

Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA305 - Toronto AWS Summit

  1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Alex Coqueiro Public Sector Solutions Architecture Team Amazon Web Services BDA305 Build Data Lakes and Analytics on AWS: Patterns & Best Practices
  2. VisualizationVariability Big Data Is Defined Many Different Ways Volume Velocity Variety Veracity Value
  3. Data Is Changing → Analytics Are Adopting Capture and store new data at PB-EB scale Do new type of analytics in a cost effective way • Machine learning • Big data processing • Real-time analytics • Full-text search New types of analytics
  4. Organizations that successfully generate business value from their data will outperform their peers. An Aberdeen survey saw organizations who implemented a data lake outperforming similar companies by 9% in organic revenue growth.* 24% 15% Leaders Followers Organic revenue growth *Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence Most Important: Driving Value from Data
  5. Traditionally, Analytics Used to Look Like This OLTP ERP CRM LOB Data warehouse Business intelligence • Relational data • TBs–PBs scale • Schema defined prior to data load • Operational reporting and ad hoc
  6. Data Lakes Extend the Traditional Approach Data warehouse Business intelligence OLTP ERP CRM LOB • Relational and nonrelational data • TBs–EBs scale • Diverse analytical engines • Low-cost storage & analytics Devices Web Sensors Social Data lake Big data processing, real-time, machine learning
  7. Data Lakes from AWS Analytics • Unmatched durability, and availability at EB scale • Best security, compliance, and audit capabilities • Object-level controls for fine-grain access • Fastest performance by retrieving subsets of data • The most ways to bring data in • Analyze with broadest set of analytics & ML services Machine learning Real-time dataOn-premises Data Lake on AWS movementdata movement
  8. Managed ML Service Deep Learning AMIs Video and Image Recognition Conversational Interfaces Deep-Learning Video Camera Natural Language Processing Language Translation Speech Recognition Text-to-Speech Interactive Analysis Hadoop & Spark Data Warehousing Full-text search Real-time analytics Dashboards & Visualizations Dedicated Network connection Secure appliances Ruggedized Shipping Container Database migration Connect Devices to AWS Real-time Data Streams Real-time Video Streams Data Lake on AWS Storage & Archival Storage | Data Catalog AnalyticsMachine learning Real-time dataOn-premises movementdata movement Data Lakes and Analytics Portfolio from AWS Broadest, deepest set of analytic services
  9. Data Lakes and Analytics Portfolio from AWS Broadest, deepest set of analytic services Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Amazon S3 | AWS Glue AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What data do I have?
  11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Gartner: “Through 2018, 80% of data lakes will not include effective metadata management capabilities, making them inefficient." What Data Do I Have? Data Lake on AWS Storage | Archival Storage | Data Catalog
  12. Job AuthoringData Catalog Job Execution Apache Hive Metastore compatible Integrated with AWS services Automatic crawling Discover Auto-generates ETL code Python and Apache Spark Edit, debug, and share Develop Serverless execution Flexible scheduling Monitoring and alerting Deploy AWS Glue
  13. IAM Role AWS Glue Crawler Databases Amazon Redshift Amazon S3 JDBC Connection Object Connection Built-in classifiers MySQL MariaDB PostreSQL Aurora Oracle Amazon Redshift Avro Parquet ORC XML JSON & JSONPaths AWS CloudTrail BSON Logs (Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis, and many others) Delimited (comma, pipe, tab, semicolon) < ALWAYS GROWING…> What can crawlers discover? Create additional custom classifiers Amazon DynamoDB NoSQL Connection
  14. Data Lake on Amazon S3 with AWS Glue On-premises data Web app data Amazon RDS Other databases Streaming data Your data AMAZON QUICKSIGHT AMAZON SAGEMAKER
  15. Other Ways of Populating the Catalog Call the AWS Glue CreateTable API Create table manually Run Hive DDL statement Apache Hive Metastore AWS GLUE ETL AWS GLUE DATA CATALOG Import from Apache Hive Metastore
  16. But I have my own data formats …? − There is a custom classifier for that … Row-Based GROK Classifier A grok pattern is a named set of regular expressions (regex) that are used to match data one line at a time. XML XML Classifier XML tag that defines a table row in the XML document. JSON JSON Classifier JSON path to the object, array, or value that defines a row of the table being created. Type the name in either dot or bracket JSON syntax using AWS Glue supported operators
  17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How do I hydrate my Data Lake?
  18. How do I drive value? Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Storage | Archival Storage | Data Catalog AnalyticsMachine learning Real-time data movementTraditional data movement
  19. Ingest data based on the type of data Open and comprehensive • Data movement from on-premises datacenters • Dedicated network connection • Secure appliances • Ruggedized shipping container • Database migration • Gateway that lets applications write to the cloud • Data movement from real-time sources • Connect devices to AWS • Real-time data streams • Real-time video streams AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS Storage Gateway AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data movement from real-time sources Data movement from your datacenters Amazon S3 Amazon Glacier AWS Glue
  20. Amazon Kinesis Data Firehose Real-time data movement and Data Lakes on AWS AWS Glue Data Catalog Amazon S3 Data Data Lake on AWS Amazon Kinesis Data Streams Data definitionKinesis Agent Apache Kafka AWS SDK LOG4J Flume Fluentd AWS Mobile SDK Kinesis Producer Library
  21. Amazon S3 Amazon Glacier AWS Glue IMPORTANT: Ingest data in its raw form … Open and comprehensive • Store the data in its raw form: • BEFORE • Transforming • Analyzing • Manipulating • Doing … anything … to it CSV ORC Grok Avro Parquet JSON • This becomes your source of record you can always go back to … • Lifecycle policies allow you to shift it to warm and cold storage.
  22. Datasets in the Lake Raw datasets – immutable datasets that you can always go back to. • Abstract out the complexities of how the data is stored through the catalog and SerDes Optimizing Analytics and Machine Learning: Curated datasets – query-optimized for consumption across wide number of tools
  23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Raw data stored in Data Lake: Preparation: No rmalized Partitio ned Co mpressed S to rage Optimized Extract – Load – Transform Preparing raw data for consumption Data Lake on AWS Raw Ingestion Curated DataSets Data Catalog ELT
  24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Which tool should I use to analyze my data?
  25. Different tools for different users … solving different problems Business Reporting Data Scientists Data Engineer IDE Data Catalog Data Lake Central Storage SagemakerMachine Learning/Deep Learning
  26. How Do I Drive Value? Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Amazon S3 | AWS Glue AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena – interactive analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Ability to run SQL queries on data archived in Amazon Glacier (coming soon) $ SQL Query instantly Zero setup cost; just point to Amazon S3 and start querying Pay per query Pay only for queries run; save 30%–90% on per- query costs through compression Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: zero infrastructure, zero administration Integrated with Amazon QuickSight
  28. Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning
  29. Exploring Data with Amazon Athena Dados on-premise Web app data Amazon RDS Outros Banco de Dados Streaming data AMAZON QUICKSIGHT AMAZON SAGEMAKER
  30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR – big data processing Analytics and ML at scale 19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more Enterprise-grade security $ Latest versions Updated with the latest open source frameworks within 30 days of release Low cost Flexible billing with per- second billing, Amazon EC2 Spot, Reserved Instances, and Auto Scaling to reduce costs 50%-80% Use Amazon S3 storage Process data directly in the Amazon S3 data lake securely with high performance using the EMRFS connector Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, cluster tuning Data Lake 100110000100101011100 1010101110010101000 00111100101100101 010001100001
  31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. EMR – Enterprise - Hadoop & Spark Deploy latest releases in Hadoop and Spark ecosystemsHadoop Ganglia HBase Hive& Catalog Hue Mahout Oozie Phoenix Pig Presto Spark Tez Zeppelin Zookeeper Flink Livy MXNet Sqoop Emr-4.0.0 July2015 2.6.0 1.0.0 0.10.0 0.14.0 1.4.1 Emr-4.7.0 June2016 2.7.2 3.7.2 1.2.1 1.0.0 3.7.1 0.12.0 4.2.0 4.7.0 0.14.0 .147 1.6.1 1.4.6 0.8.3 0.5.6 3.4.8 Emr-5.3.0 January2017 2.7.3 3.7.2 1.2.3 + S3 2.1.1 3.11.0 0.12.2 4.3.0 4.7.0 0.16.0 0.157.1 2.1.0 1.4.6 0.8.4 0.6.2 3.4.9 1.1.4 Emr-5.14.0 June2018 2.8.3 3.7.2 1.4.2 + S3 2.3.2 4.1.0 0.13.0 4.3.0 4.13.0 0.17.0 0.194 2.3.0 1.4.7 0.8.4 0.7.3 3.4.10 1.4.2 0.4.0 1.1.0 EMR releases • Nineteen open-source projects: Apache Hadoop, Spark, HBase, Presto, and more • Updated with the latest open source frameworks within 30 days of release
  32. Hadoop/Spark Analytics on AWS YARN (Hadoop Resource Manager) NoSQLMachine learning Real-timeInteractiveScriptBatch Data Lake on AWS Amazon S3 Amazon EMR Managed Hadoop/Spark Object Storage
  33. Amazon S3 – Source of Truth, Multiple Clusters Amazon S3 Interactive Spark Cluster Amazon EMR Amazon EMR HDFS HDFS EC2 Instance Memory Intermediates stored on local disk or HDFSLocal HDFS EC2 Instance Memory Intermediates stored on local disk or HDFSLocal Transient ETL Job Source of Truth HDFS HDFS HDFS Local Intermediate HDFS/Storage Local Intermediate HDFS/Storage
  34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fitting this into the Common Data Catalog Amazon S3 Interactive Spark cluster Amazon EMR Amazon EMR EMRFS HDFS Transient ETL job Source of Truth EMRFS HDFS Describes the data MySQL DB instance Unifieddataview AWS Glue Data Catalog Stores the data …
  35. Data processing with Amazon EMR (Spark) Dados on-premise Web app data Amazon RDS Outros Banco de Dados Streaming data AMAZON QUICKSIGHT AMAZON SAGEMAKER
  36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What if I implement machine learning to identify complex business insights?
  37. Machine Learning on Your Data Lake Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Amazon S3 | AWS Glue AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  38. Vision AWS Machine Learning Frameworks & Infrastructure Services GPU MobileCPU IoT (Greengrass) Platform Services Application Services Amazon SageMaker Rekognition Image Rekognition Video Speech Polly Transcribe Language Translate ComprehendLex TensorFlow GluonApache MXNet Cognitive Toolkit Caffe2 & Caffe PyTorch Keras
  39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker 1 2 3 4 I I I I Notebook Instances Algorithms ML Training Service ML Hosting Service
  40. Machine Learning with Amazon Sagemaker Dados on-premise Web app data Amazon RDS Outros Banco de Dados Streaming data AMAZON QUICKSIGHT AMAZON SAGEMAKER
  41. Agility and Innovation Are Key Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis Amazon QuickSight AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Data Lake on AWS Amazon S3 | AWS Glue AnalyticsMachine learning Real-time dataOn-premises movementdata movement
  42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. BDA305 Thank You !!! Alex Coqueiro Public Sector Solutions Architecture Team Amazon Web Services
  43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Please complete the session survey in the summit mobile app.
  44. Submit Session Feedback 1. Tap the Schedule icon. 2. Select the session you attended. 3. Tap Session Evaluation to submit your feedback.
Publicité