Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Migrating Big Data Workloads to
Amazon EMR
Anthony Nguyen
Senior Big Data Consultant (aanwin@amazon.com)
June 1st, 2017
Agenda
• Deconstructing current big data environments
• Identifying challenges with on-premises or unmanaged architectures...
Deconstructing current
big data environments
On-premises Hadoop clusters
• A cluster of 1U machine
• Typically 12 Cores, 32/64 GB
RAM, and 6 - 8 TB of HDD ($3-4K)
• Ne...
Workload types running on the same cluster
• Large Scale ETL: Apache Spark, Apache Hive with Apache Tez or
Apache Hadoop M...
Security
• Authentication: Kerberos with local KDC or
Active Directory, LDAP integration, local user
management, Apache Kn...
Swim lane of jobs
Over utilized Under utilized
Role of a Hadoop administrator
• Management of the cluster (failures,
hardware replacement, restarting
services, expanding...
Identifying challenges
Over utilization and idle capacity
• Tightly coupled compute and storage requires buying
excess capacity
• Can be over-uti...
Management difficulties
• Managing distributed applications and availability
• Durable storage and disaster recovery
• Add...
Migrating workloads to
Amazon EMR
Why Amazon EMR?
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monito...
Key migration and TCO considerations
• DO NOT LIFT AND SHIFT
• Decouple storage and compute with S3
• Deconstruct workload...
Translate use cases to the right tools
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interacti...
Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
Decouple compute and storage by using S3
as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively ...
HBase on S3 for scalable NoSQL
S3 tips: Partitions, compression, and file formats
• Avoid key names in lexicographical order
• Improve throughput and S3 ...
TCO – Transient or long running clusters
Options to submit jobs
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or othe...
Performance and hardware
• Transient or long running
• Instance types
• Cluster size
• Application settings
• File formats...
On-cluster UIs to quickly tune workloads
Manage applications
SQL editor, Workflow designer,
Metastore browser
Notebooks
De...
Spot for
task nodes
Up to 80%
off EC2
On-Demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand...
Instance fleets for advanced Spot provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list...
Lower costs with Auto Scaling
Security - Encryption
Security – Authentication and Authorization
Tag: user = MyUserIAM user: MyUser
EMR role
EC2 role
SSH key
Security - Authentication and Authorization
• LDAP for HiveServer2, Hue, Presto,
Zeppelin
• Kerberos for Spark, HBase, YAR...
Security - Authentication and Authorization
• Plug-ins for Hive, HBase,
YARN, and HDFS
• Row-level authorization for Hive
...
Security – Governance and Auditing
• AWS CloudTrail for EMR APIs
• S3 access logs for cluster S3 access
• YARN and applica...
Customer Examples
Petabytes of data generated
on-premises, brought to AWS,
and stored in S3
Thousands of analytical
queries performed on EMR...
Lower Cost and Higher Scale than On-Premises
Learn
Models
ModelsImpressions
Clicks
Activities
Calibrate
Evaluate
Real
Time
Bidding
S3
ETL Attribution
Machine
Learning
...
FINRA saved 60% by moving to HBase on EMR
Anthony Nguyen
aanwin@amazon.com
aws.amazon.com/emr
blogs.aws.amazon.com/bigdata
Q+A Thank
you!
Prochain SlideShare
Chargement dans…5
×

Migrating Big Data Workloads to Amazon EMR - June 2017 AWS Online Tech Talks

3 542 vues

Publié le

Learning Objectives:
- Identify best practices to migrate from on-premise deployments to Amazon EMR
- Understand hot to move data from HDFS to Amazon S3
- Use Amazon EC2 Spot instances to save costs

Businesses are migrating their analytics, data processing (ETL), and data science workloads running on Apache Hadoop, Spark, and data warehouse appliances from on-premise deployments to Amazon EMR in order to save costs, increase availability, and improve performance. Amazon EMR is a managed service that lets you process and analyze extremely large data sets using the latest versions of over 15 open-source frameworks in the Apache Hadoop and Spark ecosystems. This tech talk will focus on identifying the components and workflows in your current environment and providing the best practices to migrate these workloads to Amazon EMR. We will explain how to move from HDFS to Amazon S3 as a durable storage layer, and how to lower costs with Amazon EC2 Spot instances and Auto Scaling. Additionally, we will go over common security recommendations and tuning tips to accelerate the time to production.

Publié dans : Technologie
  • Soyez le premier à commenter

Migrating Big Data Workloads to Amazon EMR - June 2017 AWS Online Tech Talks

  1. 1. Migrating Big Data Workloads to Amazon EMR Anthony Nguyen Senior Big Data Consultant (aanwin@amazon.com) June 1st, 2017
  2. 2. Agenda • Deconstructing current big data environments • Identifying challenges with on-premises or unmanaged architectures • Migrating components to Amazon EMR and AWS analytics services - Choosing the right engine for the job - Building out an architecture - Architecting for cost and scalability - Security • Customer migration stories • Q+A
  3. 3. Deconstructing current big data environments
  4. 4. On-premises Hadoop clusters • A cluster of 1U machine • Typically 12 Cores, 32/64 GB RAM, and 6 - 8 TB of HDD ($3-4K) • Networking switches and racks • Open-source distribution of Hadoop or a fixed licensing term by commercial distributions • Different node roles • HDFS uses local disk and is sized for 3x data replication Server rack 1 (20 nodes) Server rack 2 (20 nodes) Server rack N (20 nodes) Core
  5. 5. Workload types running on the same cluster • Large Scale ETL: Apache Spark, Apache Hive with Apache Tez or Apache Hadoop MapReduce • Interactive Queries: Apache Impala, Spark SQL, Presto, Apache Phoenix • Machine Learning and Data Science: Spark ML, Apache Mahout • NoSQL: Apache HBase • Stream Processing: Apache Kafka, Spark Streaming, Apache Flink, Apache NiFi, Apache Storm • Search: Elasticsearch, Apache Solr • Job Submission: Client Edge Node, Apache Oozie • Data warehouses like Pivotal Greenplum or Teradata
  6. 6. Security • Authentication: Kerberos with local KDC or Active Directory, LDAP integration, local user management, Apache Knox • Authorization: Open-source native authZ (i.e., HiveServer2 authZ or HDFS ACLs), Apache Ranger, Apache Sentry • Encryption: local disk encryption with LUKS, HDFS transparent-data encryption, in-flight encryption for each framework (i.e., Hadoop MapReduce encrypted shuffle) • Configuration: different tools for management based on vendor
  7. 7. Swim lane of jobs Over utilized Under utilized
  8. 8. Role of a Hadoop administrator • Management of the cluster (failures, hardware replacement, restarting services, expanding cluster) • Configuration management • Tuning of specific jobs or hardware • Managing development and test environments • Backing up data and disaster recovery
  9. 9. Identifying challenges
  10. 10. Over utilization and idle capacity • Tightly coupled compute and storage requires buying excess capacity • Can be over-utilized during peak hours and underutilized at other times • Results in high costs and low efficiency
  11. 11. Management difficulties • Managing distributed applications and availability • Durable storage and disaster recovery • Adding new frameworks and doing upgrades • Multiple environments • Need team to manage cluster and procure hardware
  12. 12. Migrating workloads to Amazon EMR
  13. 13. Why Amazon EMR? Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Secure Easy to manage options Flexible Customize the cluster Easy to Use Launch a cluster in minutes
  14. 14. Key migration and TCO considerations • DO NOT LIFT AND SHIFT • Decouple storage and compute with S3 • Deconstruct workloads and map to open-source tools • Transient clusters and auto scaling • Choosing instance types and EC2 Spot Instances
  15. 15. Translate use cases to the right tools Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop HBase/Phoenix Presto Athena Streaming Flink - Low-latency SQL -> Athena or Presto or Amazon Redshift - Data Warehouse / Reporting -> Spark or Hive or Glue or Amazon Redshift - Management and monitoring -> EMR console or Ganglia metrics - HDFS -> S3 - Notebooks -> Zeppelin Notebook or Jupyter (via bootstrap action) - Query console -> Athena or Hue - Security -> Ranger (CF template) or HiveServer2 or IAM roles Glue Amazon Redshift
  16. 16. Many storage layers to choose from Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR
  17. 17. Decouple compute and storage by using S3 as your data layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Intermediates stored on local disk or HDFS Local
  18. 18. HBase on S3 for scalable NoSQL
  19. 19. S3 tips: Partitions, compression, and file formats • Avoid key names in lexicographical order • Improve throughput and S3 list performance • Use hashing/random prefixes or reverse the date-time • Compress data set to minimize bandwidth from S3 to EC2 • Make sure you use splittable compression or have each file be the optimal size for parallelization on your cluster • Columnar file formats like Parquet can give increased performance on reads
  20. 20. TCO – Transient or long running clusters
  21. 21. Options to submit jobs Amazon EMR Step API Submit a Spark application Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on EC2 Create a pipeline to schedule job submission or create complex workflows AWS Lambda Use AWS Lambda to submit applications to EMR Step API or directly to Spark on your cluster Use Oozie on your cluster to build DAGs of jobs
  22. 22. Performance and hardware • Transient or long running • Instance types • Cluster size • Application settings • File formats and S3 tuning Master Node r3.2xlarge Slave Group - Core c4.2xlarge Slave Group – Task m4.2xlarge (EC2 Spot) Considerations
  23. 23. On-cluster UIs to quickly tune workloads Manage applications SQL editor, Workflow designer, Metastore browser Notebooks Design and execute queries and workloads
  24. 24. Spot for task nodes Up to 80% off EC2 On-Demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Use Spot and Reserved Instances to lower costs Meet SLA at predictable cost Exceed SLA at lower cost
  25. 25. Instance fleets for advanced Spot provisioning Master Node Core Instance Fleet Task Instance Fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal Availability Zone based on capacity/price • Spot Block support
  26. 26. Lower costs with Auto Scaling
  27. 27. Security - Encryption
  28. 28. Security – Authentication and Authorization Tag: user = MyUserIAM user: MyUser EMR role EC2 role SSH key
  29. 29. Security - Authentication and Authorization • LDAP for HiveServer2, Hue, Presto, Zeppelin • Kerberos for Spark, HBase, YARN, Hive, and authenticated UIs • EMRFS storage-based permissions • SQL standards-based and storage- based authorization AWS Directory Service Self-managed Directory
  30. 30. Security - Authentication and Authorization • Plug-ins for Hive, HBase, YARN, and HDFS • Row-level authorization for Hive (with data-masking) • Full auditing capabilities with embedded search • Run Ranger on an edge node – visit the AWS Big Data Blog Apache Ranger
  31. 31. Security – Governance and Auditing • AWS CloudTrail for EMR APIs • S3 access logs for cluster S3 access • YARN and application logs • Ranger for UI for application level auditing
  32. 32. Customer Examples
  33. 33. Petabytes of data generated on-premises, brought to AWS, and stored in S3 Thousands of analytical queries performed on EMR and Amazon Redshift. Stringent security requirements met by leveraging VPC, VPN, encryption at-rest and in- transit, CloudTrail, and database auditing Flexible Interactive Queries Predefined Queries Surveillance Analytics Data Management Data Movement Data Registration Version Management Amazon S3 Web Applications Analysts; Regulators FINRA: Migrating from on-prem to AWS
  34. 34. Lower Cost and Higher Scale than On-Premises
  35. 35. Learn Models ModelsImpressions Clicks Activities Calibrate Evaluate Real Time Bidding S3 ETL Attribution Machine Learning S3Amazon Kinesis • 2 Petabytes Processed Daily • 2 Million Bid Decisions Per Second • Runs 24 X 7 on 5 Continents • Thousands of ML Models Trained per Day
  36. 36. FINRA saved 60% by moving to HBase on EMR
  37. 37. Anthony Nguyen aanwin@amazon.com aws.amazon.com/emr blogs.aws.amazon.com/bigdata Q+A Thank you!

×