Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Big data journey to the cloud rohit pujari 5.30.18

142 vues

Publié le

We hope this session was valuable in teaching you more about Cloudera Enterprise on AWS, and how fast and easy it is to deploy a modern data management platform—in your cloud and on your terms.

Publié dans : Technologie
  • Soyez le premier à commenter

Big data journey to the cloud rohit pujari 5.30.18

  1. 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rohit Pujari, Solutions Architect Migrating Big Data Workloads to AWS
  2. 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architectural Principles 1. Build decoupled systems 2. Use the right tool for the job 3. Use managed and serverless services 4. Use log-centric design patterns 5. Be cost-conscious 6. AI/ML enable your applications
  3. 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • A cluster of 1/2U machines. Typically 12 cores, 32-128 GB RAM, and 12- 24 TB of HDD • Networking switches and racks • Long Term Hadoop Cluster with fixed-licensing term • HDFS uses local disk and has 50- 200% storage overhead On-premises Hadoop clusters Server rack 1 (20 nodes) Server rack 2 (20 nodes) Server rack N (20 nodes) Core
  4. 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Management of the cluster (failures, hardware replacement, restarting services, expanding cluster) • Configuration management • Tuning of specific jobs or hardware • Managing development and test environments • Backing up data and disaster recovery On premises: Role of a big data administrator
  5. 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. On premises: System management Challenges • Managing distributed applications and availability • Durable storage and disaster recovery • Adding new frameworks and doing upgrades • Multiple environments • Need team to manage cluster and procure hardware
  6. 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. On premises: Workload types running on the same cluster • Large-scale ETL • Interactive queries • Machine learning and data science • NoSQL • Stream processing • Search • Data warehouses
  7. 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. On premises: Swim lane of jobs Over-utilized Under-utilized
  8. 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. On premises: Over-utilization and idle capacity • Tightly coupled compute and storage requires buying excess capacity • Can be over-utilized during peak hours and under-utilized at other times • Results in high costs and low efficiency
  9. 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Key migration considerations • Do not lift and shift • Deconstruct workloads and use the right tool for the job • Decouple storage and compute with Amazon Simple Storage Storage Service (Amazon S3) • Design for cost and scalability
  10. 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Foundational requirements Secure Encryption in flight & at rest Lower TCO Pay as per usage Flexible Customize per workload Full control Infrastructure as code Scalable Compute & storage Managed Reduced administration
  11. 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake - All Data is in One Place Analyze all of your data, from all of your sources, in one stored location “Why is the data distributed in many locations? Where is the single source of truth?”
  12. 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Designed for 11 9s of durability Designed for 99.99% availability Durable Available High performance  Multiple upload  Range GET  Scalable throughput  Store as much as you need  Scale storage and compute independently  No minimum usage commitments Scalable  Cloudera EDH  Cloudera Altus  Cloudera Impala Integrated Partner Tools  Simple REST API  AWS SDKs  Simple management tools  Event notification  Lifecycle policies Easy to use Why Amazon S3 for a Data Lake?
  13. 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Encryption ComplianceSecurity  Identity and access Management (IAM) policies  Bucket policies  Access Control Lists (ACLs)  Private VPC endpoints to Amazon S3  Amazon S3 object tagging to manage access policies  SSL endpoints  Server-side encryption (SSE-S3)  S3 server-side encryption with provided keys (SSE-C, SSE-KMS)  Client-side encryption  Buckets access logs  Lifecycle management policies  Access Control Lists (ACLs)  Versioning and MFA deletes  Certifications—HIPAA, PCI, SOC 1/2/3, etc. Strong Security Controls
  14. 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Encrypt data in transit and at rest with keys managed by our AWS Key Management System (KMS) or managing your own encryption keys with Cloud HSM using FIPS 140-2 Level 3 validated HSMs Meet data residency requirements Choose an AWS Region and AWS will not replicate it elsewhere unless you choose to do so Access services and tools that enable you to buildGDPR-compliant infrastructure on top of AWS Comply with local data privacy laws by controlling who can access content, its lifecycle and disposal Highest standards for privacy
  15. 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “CIOs and CISOs need to stop obsessing over unsubstantiated cloud security worries, and instead apply their imagination and energy to developing new approaches to cloud control, allowing them to securely, compliantly, and reliably leverage the benefits of this increasingly ubiquitous computing model.” Source: Clouds Are Secure: Are You Using Them Securely?
  16. 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What About HDFS & Data Tiering? • Use HDFS for hottest datasets (e.g. iterative read on the same datasets) • Use Amazon S3 Standard for frequently accessed data • Use Amazon S3 Standard – IA for less frequently accessed data • Use Amazon Glacier for archiving cold data • Use S3 Analytics to optimize tiering strategy S3 Analytics
  17. 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake - Quick Ingest Quickly ingest data without needing to force it into a predefined schema “How can I collect data quickly from various sources and store it efficiently?”
  18. 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Direct Connect AWS Snowball ISV Connectors Kafka/Flume Amazon Kinesis Firehose Amazon S3 Transfer Acceleration AWS Storage Gateway Data Ingestion into Amazon S3
  19. 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake - Storage vs. Compute Separating your storage and compute allows you to scale each component as required “How can I scale up with the volume of data being generated?”
  20. 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a Data Lake - Schema on Read “Is there a way I can apply multiple analytics and processing frameworks to the same data?” A data lake enables ad-hoc analysis by applying schemas on read, not write
  21. 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of an AWS Amazon S3 Data Lake Fixed cluster data lake AWS Amazon S3 data lake • Limited to only the single tool contained on the • Long turnaround cycles to add nodes to add storage capacity • Expensive to replicate data against node loss • Complexity in scaling local storage capacity • Long refresh cycles to add additional storage equipment • Decouple storage and compute by making S3 object based storage, not a fixed tool to manage the data lake • Flexibility to use any and all tools in the ecosystem. The right tool for the job. • Catalog, transform, and query in place • Future-proof your architecture. As new use cases and tools emerge you can plug and play current best of breed.
  22. 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Foundation Services Compute Storage Database Networking AWS Global Infrastructure Regions Availability Zones Edge Locations Client-side Data Encryption Server-side Data Encryption Network Traffic Protection Platform, Applications, Identity & Access Management Operating System, Network, & Firewall Configuration Customer applications & content Customers Security & compliance is a shared responsibility Customers have their choice of security configurations IN the Cloud AWS is responsible for the security OF the Cloud
  23. 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Inherit global security and compliance controls
  24. 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Summary • Use Amazon S3 as the storage repository for your data lake • Decouple Compute and Storage - Gain flexibility to use all the analytics tools in the ecosystem • Use managed PaaS like Cloudera Altus or Serverless services where possible to reduce operational overhead • Use granular encryption, roles, and access controls to build a secure, multi- tenant centralized data platform

×