Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Using Data Lakes

130 vues

Publié le

A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.

Speakers:
Neel Mitra - Solutions Architect, AWS
Roger Dahlstrom - Solutions Architect, AWS

  • Soyez le premier à commenter

Using Data Lakes

  1. 1. Using Data Lakes Roger Dahlstrom rdahlstr@amazon.com Solutions Architect
  2. 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. E x p o n e n t i a l g r o w t h i n d a t a Reasons for building a data lake Transactions ERP Sensor Data Billing Web logs Social Infrastructure logs
  3. 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. E x p o n e n t i a l g r o w t h i n d a t a Reasons for building a data lake Transactions ERP Sensor Data Billing Web logs Social Infrastructure logs D i v e r s i f i e d c o n s u m e r s Data Scientists Business Analyst External Consumers Applications
  4. 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. E x p o n e n t i a l g r o w t h i n d a t a Reasons for building a data lake Transactions ERP Sensor Data Billing Web logs Social Infrastructure logs D i v e r s i f i e d c o n s u m e r s Data Scientists Business Analyst External Consumers Applications M u l t i p l e a c c e s s m e c h a n i s m s API Access BI Tools Notebooks
  5. 5. Characteristics of a data lake Future Proof Flexible Access Dive in Anywhere Collect Anything
  6. 6. Challenges with legacy data architectures • Can’t move data across silos • Can’t handle semi-structured or unstructured data • Can’t deal with dynamic data and real-time processing • Complex ETL processes • Difficulty applying consistent data governance across all data sets
  7. 7. Legacy data architectures exist as isolated data silos Hadoop cluster SQL database Data warehouse appliance
  8. 8. Navigating the Data Lake… Data Lake is a new and increasingly popular architecture to store and analyze massive volumes and heterogenous types of data in a centralized repository
  9. 9. Data Lake – Centralized data, single source of truth Store and analyze all of your data, from all of your sources, in one centralized location “Why is the data distributed in many locations? Where is the single source of truth?”
  10. 10. Data Lake – Quick ingest Quickly ingest data without needing to force it into a predefined schema “How can I collect data quickly from various sources and store it efficiently?”
  11. 11. Building a Data Lake on AWS
  12. 12. Why AWS? Implementing a Data Lake architecture requires a broad set of tools and technologies to serve an increasingly diverse set of applications and use cases. Boils down to: Picking the right tool for the right job…on a consumption- based model…
  13. 13. Central Storage AWS Database Migration Service AWS Snowball* AWS Direct Connect Amazon Kinesis Firehose Data Ingestion Catalog Amazon DynamoDB Amazon Elasticsearch Service Amazon Kinesis Amazon EMR Amazon RDS Amazon Athena Amazon QuickSight Amazon Redshift* Processing and Analytics S3 Access and Secure Example of AWS Services for Data Lake Amazon Elasticsearch Service AWS Glue
  14. 14. Catalog & Search Access and search metadata Access & User Interface Give your users easy and secure access DynamoDB Elasticsearch API Gateway Identity & Access Management Cognito QuickSight Amazon AI EMR Redshift Athena Kinesis RDS Central Storage Secure, cost-effective Storage in Amazon S3 S3 Snowball Database Migration Service Kinesis Firehose Direct Connect Data Ingestion Get your data into S3 Quickly and securely Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding Security Token Service CloudWatch CloudTrail Key Management Service Data Lake reference architecture Catalog & Search Access and search metadata Access & User Interface Give your users easy and secure access DynamoDB Elasticsearch API Gateway Identity & Access Management Cognito QuickSight Central Storage Secure, cost-effective Storage in Amazon S3
  15. 15. S3 – Center of the Data Lake
  16. 16. Designed for 11 9s of durability Designed for 99.99% availability Durable Available High performance  Multiple Upload  Range GET  Store as much as you need  Scale storage and compute independently  No minimum usage commitments Scalable  Amazon EMR  Amazon Redshift  Amazon DynamoDB Integrated  Simple REST API  AWS SDKs  Read-after-create consistency  Event notification  Lifecycle policies Easy to use Why Amazon S3 for Data Lake?
  17. 17. Encryption ComplianceSecurity  AWS Identity and Access Management (IAM) policies  Bucket policies  Access control lists (ACLs)  Private VPC endpoints to Amazon S3  SSL endpoints  Server-Side Encryption (SSE-S3)  Amazon S3 Server- Side Encryption with provided keys (SSE-C, SSE-KMS)  Client-Side Encryption  Buckets access logs  Lifecycle management policies  ACLs  Versioning & MFA deletes  Certifications – HIPAA, PCI, SOC 1/2/3, etc. Implement the right cloud security controls
  18. 18. • Decouple storage and compute by making Amazon S3 object-based storage. • Consolidate and centrally manage and govern your data. • Flexibility to use any and all tools in the ecosystem. The right tool for the job. • Future proof your architecture. As new use cases and new tools emerge, you can plug and play current best of breed. Benefits of an AWS Data Lake
  19. 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – Data Catalog Unified metadata repository across relational databases, Amazon RDS, Amazon Redshift, and Amazon S3 accessible via Amazon Athena, Amazon Redshift Spectrum, Amazon EMR and API
  20. 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue – ETL Automatically generated ETL code running on serverless Apache Spark with the power and flexibility to bring data together.
  21. 21. Process and analyze…
  22. 22. Hadoop/HDFS clusters Hive, Pig, Impala, Hbase, Spark, Presto Easy to use, fully managed On-demand, reserved instance, and Spot pricing Tight integration with Amazon S3, DynamoDB, and Kinesis Amazon Elastic MapReduce
  23. 23. Server rack 1 (20 nodes) Server rack 2 (20 nodes) Server rack N (20 nodes) Core On-premises Hadoop clusters • A cluster of 1U machines • Typically 12 Cores, 32/64 GB RAM, and 6 - 8 TB of HDD ($3-4K) • Networking switches and racks • Open-source distribution of Hadoop or a fixed licensing term by commercial distributions • Different node roles • HDFS uses local disk and is sized for 3x data replication
  24. 24. Workload types running on the same cluster • Large Scale ETL: Apache Spark, Apache Hive with Apache Tez, or Apache Hadoop MapReduce • Interactive Queries: Apache Impala, Spark SQL, Presto, Apache Phoenix • Machine Learning and Data Science: Spark ML, Apache Mahout • NoSQL: Apache HBase • Stream Processing: Apache Kafka, Spark Streaming, Apache Flink, Apache NiFi, Apache Storm • Search: Elasticsearch, Apache Solr • Job Submission: Client Edge Node, Apache Oozie • Data warehouses like Pivotal Greenplum or Teradata
  25. 25. Why Amazon EMR? Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Secure Easy-to-manage options Flexible Customize the cluster Easy to Use Launch a cluster in minutes
  26. 26. Many storage layers to choose from Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR Amazon Elasticsearch Service
  27. 27. Decouple compute and storage by using Amazon S3 as your data layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Intermediates stored on local disk or HDFS Local
  28. 28. HBase on Amazon S3 for scalable NoSQL
  29. 29. Amazon Athena Interactive query service • Query directly from Amazon S3 • Use ANSI SQL • Serverless • Multiple data formats • Pay per query
  30. 30. Why use Athena? • Decouple storage from compute • Serverless – No infrastructure or resources to manage • Pay only for data scanned • Schema on read – Same data, many views • Encrypted • Standard compliant and open storage formats • Built on powerful community supported OSS solutions
  31. 31. Simple Pricing • DDL operations – FREE • SQL operations – FREE • Query concurrency – FREE • Data scanned - $5 / TB • Standard S3 rates for storage, requests, and data transfer apply
  32. 32. Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning
  33. 33. Hive Metadata Definition • Hive Data Definition Language • Data Manipulation Language (INSERT, UPDATE) • Create Table As • User Defined Functions • Hive compatible SerDe (serializer/deserializer) • CSV, JSON, RegEx, Parquet, Avro, ORC, CloudTrail
  34. 34. Presto SQL • ANSI SQL compliant • Complex joins, nested queries & window functions • Complex data types (arrays, structs, maps) • Partitioning of data by any key • date, time, custom keys • Presto built-in functions
  35. 35. Relational data warehouse Massively parallel; petabyte scale Fully managed HDD and SSD platforms $1,000/TB/year; starts at $0.25/hour Amazon Redshift a lot faster a lot simpler a lot cheaper
  36. 36. Amazon Redshift works with third-party analysis tools JDBC/ODBC Amazon Redshift Amazon QuickSight
  37. 37. Amazon Redshift has security built in SSL to secure data in transit Encryption to secure data at rest • AES-256; hardware accelerated • All blocks on disks and in Amazon S3 encrypted • HSM support No direct access to compute nodes Audit logging, AWS CloudTrail, AWS KMS integration Amazon VPC support SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA 10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores Amazon S3/Amazon DynamoDB Customer VPC Internal VPC JDBC/ODBC Leader Node Compute Node Compute Node Compute Node
  38. 38. Redshift Spectrum ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC JDBC/ODBC Leverages Amazon Redshift’s advanced cost- based optimizer Pushes down projections, filters, aggregations and join reduction Dynamic partition pruning to minimize data processed Automatic parallelization of query execution against Amazon S3 data Efficient join processing within the Amazon Redshift cluster Spectrum Nodes Redshift Nodes
  39. 39. Translate use cases to the right tools - Low-latency SQL -> Athena or Presto or Amazon Redshift - Data warehouse/Reporting -> Spark or Hive or Glue or Amazon Redshift - Management and monitoring -> EMR console or Ganglia metrics - HDFS -> Amazon S3 - Notebooks -> Zeppelin Notebook or Jupyter (via bootstrap action) - Query console -> Athena or Hue - Security -> Ranger (CF template) or HiveServer2 or IAM roles Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop HBase/Phoenix Presto Athena Streaming Flink Glue Amazon Redshift
  40. 40. Real-time distributed search and analytics engine Managed service using Elasticsearch and Kibana Fully managed - zero admin Highly available and reliable Tightly integrated with other AWS services Amazon Elasticsearch Service
  41. 41. Amazon Elasticsearch Service leading use cases Log Analytics & Operational Monitoring • Monitor the performance of applications, web servers, and hardware • Easy to use, yet powerful data visualization tools to detect issues in near real-time • Dig into logs in an intuitive, fine-grained way • Kibana provides fast, easy visualization Traditional Search • Application or website provides search capabilities over diverse documents • Tasked with making this knowledge base searchable and accessible • Key search features including text matching, faceting, filtering, fuzzy search, autocomplete, and highlighting • Query API to support application search
  42. 42. Real-time processing High throughput; elastic Easy to use Amazon EMR, Amazon S3, Amazon Redshift, DynamoDB Integrations Amazon Kinesis
  43. 43. Amazon Kinesis Streams • For technical developers • Build your own custom applications that process or analyze streaming data Amazon Kinesis Firehose • For all developers, data scientists • Easily load massive volumes of streaming data into S3, Amazon Redshift, and Amazon Elasticsearch Service Amazon Kinesis Analytics • For all developers, data scientists • Easily analyze data streams using standard SQL queries Amazon Kinesis: Streaming data made easy Services make it easy to capture, deliver, and process streams on AWS
  44. 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Frameworks & Infrastructure AWS Deep Learning AMI GPU (P3 Instances) MobileCPU IoT (Greengrass) Vision: Rekognition Image Rekognition Video Speech: Polly Transcribe Language: Lex Translate Comprehend Apache MXNet PyTorch Cognitive Toolkit Keras Caffe2 & Caffe TensorFlow Gluon AWS ML Stack Application Services Platform Services Amazon Machine Learning Mechanical Turk Spark & EMR Amazon SageMaker AWS DeepLens
  45. 45. Proven customer success The vast majority of big data use cases deployed in the cloud today run on AWS.
  46. 46. “For our market surveillance systems, we are looking at about 40% [savings with AWS], but the real benefits are the business benefits: We can do things that we physically weren’t able to do before, and that is priceless.” - Steve Randich, CIO Case study: Re-architecting compliance What FINRA needed • Infrastructure for its market surveillance platform • Support of analysis and storage of approximately 75 billion market events every day Why they chose AWS • Fulfillment of FINRA’s security requirements • Ability to create a flexible platform using dynamic clusters (Hadoop, Hive, and HBase), Amazon EMR, and Amazon S3 Benefits realized • Increased agility, speed, and cost savings • Estimated savings of $10-20M annually by using AWS
  47. 47. Fraud detection FINRA uses Amazon EMR and Amazon S3 to process up to 75 billion trading events per day and securely store over 5 petabytes of data, attaining savings of $10-20M per year.
  48. 48. Building a Data Lake An architectural approach that allows you to store massive amounts of “raw” data into a central location It’s readily available to be categorized, processed, analyzed, and consumed by diverse groups Catalog & Search Access and search metadata Access & User Interface Give your users easy and secure access DynamoDB Elasticsearch API Gateway Identity & Access Management Cognito QuickSight Amazon AI EMR Redshift Athena Kinesis RDS Central Storage Secure, cost-effective Storage in Amazon S3 S3 Snowball Database Migration Service Kinesis Firehose Direct Connect Data Ingestion Get your data into S3 Quickly and securely Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding Security Token Service CloudWatch CloudTrail Key Management Service
  49. 49. aws.amazon.com/activate Everything and Anything Startups Need to Get Started on AWS

×