Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Using Data Lakes

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 42 Publicité

Using Data Lakes

by Mamoon Chowdry, Solutions Architect

AWS Data & Analytics Week is an opportunity to learn about Amazon’s family of managed analytics services. These services provide easy, scalable, reliable, and cost-effective ways to manage your data in the cloud. We explain the fundamentals and take a technical deep dive into Amazon Redshift data warehouse; Data Lake services including Amazon EMR, Amazon Athena, & Amazon Redshift Spectrum; Log Analytics with Amazon Elasticsearch Service; and data preparation and placement services with AWS Glue and Amazon Kinesis. You'll will learn how to get started, how to support applications, and how to scale.

by Mamoon Chowdry, Solutions Architect

AWS Data & Analytics Week is an opportunity to learn about Amazon’s family of managed analytics services. These services provide easy, scalable, reliable, and cost-effective ways to manage your data in the cloud. We explain the fundamentals and take a technical deep dive into Amazon Redshift data warehouse; Data Lake services including Amazon EMR, Amazon Athena, & Amazon Redshift Spectrum; Log Analytics with Amazon Elasticsearch Service; and data preparation and placement services with AWS Glue and Amazon Kinesis. You'll will learn how to get started, how to support applications, and how to scale.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Using Data Lakes (20)

Publicité

Plus par Amazon Web Services (20)

Using Data Lakes

  1. 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Pop-up Loft Using Data Lakes Mamoon Chowdry chowdry@amazon.com Solutions Architect Ben Willett benwille@amazon.com Solutions Architect
  2. 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. E x p o n e n t i a l g r o w t h i n d a t a Reasons for building a data lake Transactions ERP Sensor Data Billing Web logs Social Infrastructure logs
  3. 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. E x p o n e n t i a l g r o w t h i n d a t a Reasons for building a data lake Transactions ERP Sensor Data Billing Web logs Social Infrastructure logs D i v e r s i f i e d c o n s u m e r s Data Scientists Business Analyst External Consumers Applications
  4. 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. E x p o n e n t i a l g r o w t h i n d a t a Reasons for building a data lake Transactions ERP Sensor Data Billing Web logs Social Infrastructure logs D i v e r s i f i e d c o n s u m e r s Data Scientists Business Analyst External Consumers Applications M u l t i p l e a c c e s s m e c h a n i s m s API Access BI Tools Notebooks
  5. 5. Characteristics of a data lake Future Proof Flexible Access Dive in Anywhere Collect Anything
  6. 6. Server%rack%1 (20%nodes) Server%rack%2 (20%nodes) Server%rack%N% (20%nodes) Core On-premises Hadoop clusters • A cluster of 1U machines • Typically 12 Cores, 32/64 GB RAM, and 6 - 8 TB of HDD ($3-4K) • Networking switches and racks • Open-source distribution of Hadoop or a fixed licensing term by commercial distributions • Different node roles • HDFS uses local disk and is sized for 3x data replication
  7. 7. Workload types running on the same cluster • Large Scale ETL: Apache Spark, Apache Hive with Apache Tez, or Apache Hadoop MapReduce • Interactive Queries: Apache Impala, Spark SQL, Presto, Apache Phoenix • Machine Learning and Data Science: Spark ML, Apache Mahout • NoSQL: Apache HBase • Stream Processing: Apache Kafka, Spark Streaming, Apache Flink, Apache NiFi, Apache Storm • Search: Elasticsearch, Apache Solr • Job Submission: Client Edge Node, Apache Oozie • Data warehouses like Pivotal Greenplum or Teradata
  8. 8. Security • Authentication: Kerberos with local KDC or Active Directory, LDAP integration, local user management, Apache Knox • Authorization: Open-source native authZ (i.e., HiveServer2 authZ or HDFS ACLs), Apache Ranger, Apache Sentry • Encryption: local disk encryption with LUKS, HDFS transparent-data encryption, in-flight encryption for each framework (i.e., Hadoop MapReduce encrypted shuffle) • Configuration: Different tools for management based on vendor
  9. 9. Swim lane of jobs Over-utilized Under-utilized
  10. 10. Role of a Hadoop administrator • Management of the cluster (failures, hardware replacement, restarting services, expanding cluster) • Configuration management • Tuning of specific jobs or hardware • Managing development and test environments • Backing up data and disaster recovery
  11. 11. On-prem: Over-utilization and idle capacity • Tightly coupled compute and storage requires buying excess capacity • Can be over-utilized during peak hours and under- utilized at other times • Results in high costs and low efficiency
  12. 12. On-prem: System management difficulties • Managing distributed applications and availability • Durable storage and disaster recovery • Adding new frameworks and doing upgrades • Multiple environments • Need team to manage cluster and procure hardware
  13. 13. Why Amazon EMR? Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Secure Easy-to-manage options Flexible Customize the cluster Easy to Use Launch a cluster in minutes
  14. 14. Translate use cases to the right tools - Low-latency SQL -> Athena or Presto or Amazon Redshift - Data warehouse/Reporting -> Spark or Hive or Glue or Amazon Redshift - Management and monitoring -> EMR console or Ganglia metrics - HDFS -> Amazon S3 - Notebooks -> Zeppelin Notebook or Jupyter (via bootstrap action) - Query console -> Athena or Hue - Security -> Ranger (CF template) or HiveServer2 or IAM roles Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop HBase/Phoenix Presto Athena Streaming Flink Glue Amazon Redshift
  15. 15. Many storage layers to choose from Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR Amazon Elasticsearch Service
  16. 16. Decouple compute and storage by using Amazon S3 as your data layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Intermediates stored on local disk or HDFS Local
  17. 17. HBase on Amazon S3 for scalable NoSQL
  18. 18. Options to submit jobs Amazon EMR Step API Submit a Spark application Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on EC2 Create a pipeline to schedule job submission or create complex workflows AWS Lambda Use AWS Lambda to submit applications to EMR Step API or directly to Spark on your cluster Use Oozie on your cluster to build DAGs of jobs
  19. 19. Performance and hardware • Transient or long running • Instance types • Cluster size • Application settings • File formats and Amazon S3 tuning Master Node r4.2xlarge Slave Group - Core c5.2xlarge Slave Group – Task m5.2xlarge (EC2 Spot) Considerations
  20. 20. On-cluster UIs to quickly tune workloads Manage applications SQL editor, Workflow designer, Metastore browser Notebooks Design and execute queries and workloads
  21. 21. Spot for task nodes Up to 80% off Amazon EC2 On-Demand pricing On-Demand for core nodes Standard Amazon EC2 pricing for On-Demand capacity Use Spot and Reserved Instances to lower costs Meet SLA at predictable cost Exceed SLA at lower cost
  22. 22. Instance fleets for advanced Spot provisioning Master Node Core Instance Fleet Task Instance Fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal Availability Zone based on capacity/price • Spot Block support
  23. 23. Lower costs with Auto Scaling
  24. 24. Security – Encryption
  25. 25. Security – Authentication and authorization Tag: user = MyUserIAM user: MyUser EMR role EC2 role SSH key
  26. 26. Security – Authentication and authorization • Plug-ins for Hive, HBase, YARN, and HDFS • Row-level authorization for Hive (with data-masking) • Full auditing capabilities with embedded search • Run Ranger on an edge node – visit the AWS Big Data Blog Apache Ranger
  27. 27. Security – Governance and auditing • AWS CloudTrail for EMR APIs • Custom AMIs • S3 access logs for cluster S3 access • YARN and application logs • Ranger for UI for application level auditing
  28. 28. FINRA: Migrating from on-prem to AWS Petabytes of data generated on-premises, brought to AWS, and stored in Amazon S3 Thousands of analytical queries performed on EMR and Amazon Redshift. Stringent security requirements met by leveraging VPC, VPN, encryption at-rest and in- transit, CloudTrail, and database auditing Flexible Interactive Queries Predefined Queries Surveillance Analytics Data Management Data Movement Data Registration Version Management Amazon S3 Web Applications Analysts; Regulators
  29. 29. Lower cost and higher scale than on-premises
  30. 30. FINRA saved 60% by moving to HBase on EMR
  31. 31. Learn Models ModelsImpressions Clicks Activities Calibrate Evaluate Real Time Bidding Amazon S3 ETL Attribution Machine Learning Amazon S3Amazon Kinesis • 2 petabytes processed daily • 2 million bid decisions per second • Runs 24 X 7 on 5 continents • Thousands of ML models trained per day
  32. 32. Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL
  33. 33. Why use Athena? • Decouple storage from compute • Serverless – No infrastructure or resources to manage • Pay only for data scanned • Schema on read – Same data, many views • Encrypted • Standard compliant and open storage formats • Built on powerful community supported OSS solutions
  34. 34. Simple Pricing • DDL operations – FREE • SQL operations – FREE • Query concurrency – FREE • Data scanned - $5 / TB • Standard S3 rates for storage, requests, and data transfer apply
  35. 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Customers Drive Product Decisions
  36. 36. Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning
  37. 37. Hive Metadata Definition • Hive Data Definition Language • Data Manipulation Language (INSERT, UPDATE) • Create Table As • User Defined Functions • Hive compatible SerDe (serializer/deserializer) • CSV, JSON, RegEx, Parquet, Avro, ORC, CloudTrail
  38. 38. Presto SQL • ANSI SQL compliant • Complex joins, nested queries & window functions • Complex data types (arrays, structs, maps) • Partitioning of data by any key • date, time, custom keys • Presto built-in functions
  39. 39. Fast @ Exabyte scale Elastic & highly available On-demand, pay-per- query High concurrency: Multiple clusters access same data No ETL: Query data in- place using open file formats Full Amazon Redshift SQL support S3 SQL Run SQL queries directly against data in S3 using thousands of nodes Amazon Redshift Spectrum
  40. 40. Query: SELECT COUNT(*) FROM s3.ext_table Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore Redshift Architecture with Spectrum
  41. 41. Characteristics of a data lake Future Proof Flexible Access Dive in Anywhere Collect Anything
  42. 42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Pop-up Loft aws.amazon.com/activate Everything and Anything Startups Need to Get Started on AWS

×