SlideShare une entreprise Scribd logo
1  sur  74
Télécharger pour lire hors ligne
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Design Patterns and Best Practices
for Data Analytics with Amazon EMR
J o n a t h a n F r i t z , P r i n c i p a l P r o d u c t M a n a g e r – A m a z o n E M R
A n y a B i d a , S e n i o r M e m b e r o f T e c h n i c a l S t a f f - S a l e s f o r c e
A B D 3 0 5
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Overview
- Intro and architectures
- Using Amazon EC2 Spot and Auto Scaling
- Security overview
- Ad-hoc and advanced workflows
- Apache Spark and Amazon EMR at Salesforce
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Easy to use
Launch a cluster in minutes
What is Amazon EMR?
Low cost
Pay per-second
Open-source variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy to enable options
Flexible
Full customization and control
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Open-source applications
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop
HBase/Phoenix
Presto
Streaming
Flink
Zeppelin - Notebooks
MXNet
Hue - SQL
Ganglia - Monitoring
Livy – Job Server
Connectorsto
AWSservices
Amazon EMR
service
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use the AWS Glue Data Catalog
• Support for Spark, Hive
and Presto
• Auto-generate schema
and partitions
• Managed table updates
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
HBase for random access at massive scale
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time and batch processing
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
New – Deep learning with GPU instances
Use P3 instances
with GPUs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lower your costs
T i p s t o
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Transient or long-running clusters
• Amazon Linux AMI with preinstalled customizations for faster cluster creation
• Auto Scaling to minimize costs for long-running clusters
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
EC2 Spot and instance fleets
• EMR will select optimal EC2 AZ
• Provision across instance types
• Switch to on-demand
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use Auto Scaling
Scaling options
Threshold
CloudWatch or custom metric
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Auto Scaling
• EMR scales-in at YARN task completion
• Selectively removes nodes with no running tasks
• yarn.resourcemanager.decommissioning.timeout
• Default timeout is one hour
• Spark scale-in contributions
• Spark specific blacklisting of tasks
• Unregistering cached data and shuffle blocks
• Advanced error handling
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Secure your cluster
T i p s t o
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Encryption
• Spark
• Tez
• MapReduce
• Presto
• HBase
• Hive
• Pig
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Authentication LDAP
HiveServer2
Presto Coordinator
Spark Thrift Server
Hue Server
Zeppelin Server
AWS credentials
EMR Step (EMR API)
EC2 key pair
SSH as “hadoop”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
New – Authentication with Kerberos
Microsoft
Active Directory
KDC
Users
YARN RMDoAs
Service principals for
all cluster nodes
Master Node
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Authorization
• Storage-based
• EMRFS/S3
• HDFS
• HiveServer2 and Presto (SQL-based)
• HBase
• YARN queues
• Fine-grained access control by cluster tag (IAM)
• Apache Ranger on edge node (using CloudFormation)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
New – EMRFS fine-grained authorization
Context
User: aduser
Group: analyst
IAM role: analytics_prod
Can map IAM roles to user, group, or S3 prefix
Context
User: aduser2
Group: dev
IAM role: analytics_dev
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security configuration demo
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Submit workflows
T i p s t o
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use Livy as an ad-hoc Spark job server
Custom
Application
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Oozie and Airflow for DAGs of jobs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Select Amazon EMR customers
Apache Spark and Amazon EMR at Salesforce
abida@salesforce.com @anyabida1
Anya Bida, Senior Member of Technical Staff at Salesforce
Fastest Growing Top 5
Enterprise Software Company
$5.4BFY15
$4.1BFY14
$3.1BFY13
$6.7BFY16
$2.3BFY12
$1.7BFY11
$2.56BFY18Q2 revenue
$8.4BFY17 revenue
2009 • 2010 • 2011
2012 • 2013 • 2014
2015 • 2016 • 2017
September
2016
2011 • 2012 • 2013
2014 • 2015 • 2016 • 2017
The world’s most
innovative companies
“Innovator of
the Decade”
Overview
Our Goal
Getting started with EMR
Spark primer
Monitor multiple viewpoints
Use AWS Identity and Access Management (IAM) roles
Isolate Environments
Complete ML Pipeline
ETL
Feature Engineering
Model Training
Model Evaluation
Deploy & Operationalize Models
Score & Update Models
Support Batch & Real Time
Selecting a tool
ETL
Feature Engineering
Model Training
Model Evaluation
Deploy & Operationalize Models
Score & Update Models
Support Batch & Real Time
Supports each step of our ML pipeline
Scales for small & large jobs
Good ML Libraries
Active user base
Ability to deploy production ready code
Selecting a tool
ETL
Feature Engineering
Model Training
Model Evaluation
Deploy & Operationalize Models
Score & Update Models
Support Batch & Real Time
Supports each step of our ML pipeline
Scales for small & large jobs
Good ML Libraries
Active user base
Ability to deploy production ready code
We wanted Spark…now how to deploy it?
EC2
• Support for batch / streaming
• Integrates with our tooling
• Spin up / down clusters
• Larger / smaller clusters
• Support for different versions of Hadoop, Spark
• Storage & compute options
We wanted Spark…now how to deploy it?
EC2
• Support for batch / streaming
• Integrates with our tooling
• Spin up / down clusters
• Larger / smaller clusters
• Support for different versions of Hadoop, Spark
• Storage & Compute options
Need: Management
We wanted Spark…now how to deploy it?
Need: Management
EC2
• Support for batch / streaming
• Integrates with our tooling
• Spin up / down clusters
• Larger / smaller clusters
• Support for different versions of Hadoop, Spark
• Storage & Compute options
Provision EMR
Simplest approach
Input bucket
Monitoring
aws emr create-cluster
--applications Name=Hadoop
Name=Spark Name=Ganglia
--tags
--ec2-attributes
--release-label
--log-uri
…
--name
--instance-groups
--region
Provision EMR
Simplest approach
Input bucket
Monitoring
aws emr create-cluster
--applications Name=Hadoop
Name=Spark Name=Ganglia
# tag for cost analysis by project
--tags
--ec2-attributes
--release-label
# send logs to S3
--log-uri
…
# use naming conventions for service
discovery
# <region>-<project>-<version>-<env>
--name
# CORE nodes used for writes to HDFS
# TASK nodes used for compute - try spot
instances here for starters
--instance-groups
--region
Simplest approach
Input bucket
EMR cluster
Output bucket
Logs
bucket
Spark primer
Apache Spark
Spark Primer
Apache Spark
Driver Program
SparkContext
Node
Executor Cache
TaskTask
Cluster Manager
Node
Executor Cache
TaskTask
Spark on Amazon EMR
Apache Spark
Master Node Core Node
Cluster Manager
ResourceManager
NameNode
NodeManager
Datanode
Task Node
NodeManagerNodeManager
Spark on Amazon EMR
Apache Spark
Master Node Core Node
Executor Cache
TaskTask
Cluster Manager
ResourceManager
NameNode
Task Node
Executor Cache
TaskTask
Driver Program
SparkContext
Executor Cache
TaskTask
NodeManager
Datanode
NodeManagerNodeManager
Properties related to Dynamic Allocation
Property Value
Spark.dynamicAllocation.enabled true
Spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 5
spark.dynamicAllocation.maxExecutors 17
spark.dynamicAllocation.initalExecutors 0
sparkdynamicAllocation.executorIdleTime 60s
spark.dynamicAllocation.schedulerBacklogTimeout 5s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 5s
Optional
Where do I find Metrics?
Ganglia
windowing, dashboarding
Cloudwatch
YARNMemoryAvailablePercentage
Logs?
Anatomy of a Spark Job
High Performance Spark, Karau & Warren, O’Reilly
Spark Context/Spark Session Object
Actions (e.g., collect, saveAsTextFile)
Wide transformations; (sort, groupByKey)
Computation to evaluate one partition
(combine narrow transforms)
Spark
Application
Job
Stage Stage
Task Task
Simplest approach
Input bucket
EMR cluster
Output bucket
Reads from S3
• Jar files too!
Write Intermediate files
• MEM or Disk?
• Local? HDFS? Amazon S3?
Writes to S3
• Data available after cluster is
terminated
Cache Persist Checkpoint Local Checkpoint
local mem cache MEM MEM MEM
local disk DISK DISK
HDFS / S3 Specify dir
If exec is decommed, are
writes available?
No No Yes No
If job finishes are writes
available?
No No Yes No
Preserve lineage graph? Yes Yes No No
RDD Re-use
Cache Persist Checkpoint Local Checkpoint
local mem cache MEM MEM MEM
local disk DISK DISK
HDFS / S3 Specify dir
If exec is decommed, are
writes available?
No No Yes No
If job finishes are writes
available?
No No Yes No
Preserve lineage graph? Yes Yes No No
RDD Re-use
Persist to improve speed, Checkpoint to improve fault tolerance
Overview
Our Goal
Getting started with EMR
Spark primer
Monitor multiple viewpoints
Use IAM Roles
Isolate Environments
Monitor multiple viewpoints
https://light.co/camera
Understand resource allocation
Understanding Memory Management in Spark For Fun And Profit Shivnath Babu (Duke University, Unravel Data Systems)
Mayuresh Kunjir (Duke University)
Node Memory
Container Memory
8Gb
Node Memory
Container
Memory
8Gb
Can my 8Gb container launch on this cluster?
Node
Memory
Node
Memory
Node
Memory
4Gb
used
8Gb
total
8Gb
Can my 8Gb container launch on this cluster?
Scale-out Rule: Num Containers Pending
Node
Memory
Node
Memory
Node
Memory
4Gb
used
8Gb
total
8Gb
Monitor multiple viewpoints
Connectivity viewer
https://www.linkedin.com/in/vaibhavt/
Vaibhav Tandon
Monitor multiple viewpoints
Connectivity viewer
https://www.linkedin.com/in/vaibhavt/
Vaibhav Tandon
Monitor multiple viewpoints
Connectivity viewer
https://www.linkedin.com/in/vaibhavt/
Vaibhav Tandon
Overview
Our Goal
Getting started with EMR
Spark primer
Monitor multiple viewpoints
Use IAM Roles
Isolate Environments
Use IAM roles
Cluster Manager
Scheduler
IAM
IAM Roles
• User has an IAM Role
Use IAM roles
Cluster Manager
Scheduler
IAM
IAM Roles
• User has an IAM Role
• Job has an IAM Role
IAM
Use IAM roles
Cluster Manager
Scheduler
IAM
IAM Roles
• User has an IAM Role
• Job has an IAM Role
• IAM Roles determine read /
write access to data
IAM
Out
Logs
IAM
In
Use IAM roles
Every user, service, & job should have specific, auditable permissions.
New: EMRFS fine-grained access control!!
Cluster Manager
Scheduler
IAM
IAM Roles
• User has an IAM Role
• Job has an IAM Role
• IAM Roles determine read /
write access to data
IAM
Out
Logs
IAM
aws emr create-cluster
…
--service-role
In
Overview
Our Goal
Getting started with EMR
Spark primer
Monitor multiple viewpoints
Use IAM Roles
Isolate Environments
Isolate environments
Need: Build and release? Multitenancy?
Cluster Manager
Scheduler
Out
Logs
In
Cluster Manager
Scheduler
virtual private cloud Out
Logs
In
Isolate environments
Need: Build and release? Multitenancy?
Cluster Manager
Scheduler
VPC subnet
virtual private cloud
VPC subnet
Out
Logs
In
Isolate environments
Need: Build and release? Multitenancy?
Cluster Manager
Scheduler
VPC subnet
virtual private cloud
security group
security group
security group
VPC subnet
Out
Logs
In
Isolate environments
Need: Build and release? Multitenancy?
Cluster Manager
Scheduler
VPC subnet
virtual private cloud
security group
security group
security group
VPC subnet
Out
Logs
In
aws emr create-cluster
…
--ec2-attributes '{
"KeyName":"",
"InstanceProfile":"",
"ServiceAccessSecurityGroup":"",
"SubnetId":"”,
"EmrManagedSlaveSecurityGroup":"",
"EmrManagedMasterSecurityGroup":""}'}'
Isolate environments
Need: Build and release? Multitenancy?
Cluster Manager
Scheduler
In
Out
Logs
VPC subnet
virtual private cloud
security group
IAM
security group
security group
VPC subnet
IAM
Environment
Isolate environments
Need: Build and release? Multitenancy?
VPC subnet
virtual private cloud
security group
security group
security group
VPC subnet
IAM
Environment
Dev Staging
Canary Prod
Isolate environments
Need: Build and release? Multitenancy?
VPC subnet
virtual private cloud
security group
security group
security group
VPC subnet
IAM
Environment
Dev Staging
Canary Prod
Automation
• Use Cloudformation or
Terraform
• Upgrades use the same
provisioning script + DNS
Upsert
Isolate environments
Need: Build and release? Multitenancy?
VPC subnet
virtual private cloud
security group
security group
security group
VPC subnet
IAM
Environment
Availability
Zone
region
Dev Staging
Canary Prod
Isolate environments
Need: Build and release? Multitenancy?
VPC subnet
virtual private cloud
security group
security group
security group
VPC subnet
IAM
Environment
Availability
Zone
region
Dev Staging
Canary Prod
Availability
Zone
region
Dev Staging
Canary Prod
Availability
Zone
Availability
Zone
Isolate environments
Need: Build and release? Multitenancy?
Overview
Our Goal
Getting started with EMR
Spark primer
Monitor multiple viewpoints
Use IAM Roles
Isolate Environments
Did we just automate ourselves
out of our jobs?
Nope. Now we have time to take on new projects and grow…
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!
J o n a t h a n F r i t z – j o n f r i t z @ a m a z o n . c o m
A n y a B i d a – a b i d a @ s a l e s f o r c e . c o m
a w s . a m a z o n . c o m / e m r
a w s . a m a z o n . c o m / b l o g s / b i g - d a t a
a w s . a m a z o n . c o m / b l o g s / a i
Extra slides
`spark.history.fs.cleaner.enabled=true`

Contenu connexe

Tendances

Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala InternalsDavid Groozman
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryDatabricks
 
Jasper reports in 3 easy steps
Jasper reports in 3 easy stepsJasper reports in 3 easy steps
Jasper reports in 3 easy stepsIvaylo Zashev
 
Making The Move To Java 17 (JConf 2022)
Making The Move To Java 17 (JConf 2022)Making The Move To Java 17 (JConf 2022)
Making The Move To Java 17 (JConf 2022)Alex Motley
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingMichelle Holley
 
DPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingDPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingMichelle Holley
 
UserParameter vs Zabbix Sender - 1º ZABBIX MEETUP DO INTERIOR-SP
UserParameter vs Zabbix Sender - 1º ZABBIX MEETUP DO INTERIOR-SPUserParameter vs Zabbix Sender - 1º ZABBIX MEETUP DO INTERIOR-SP
UserParameter vs Zabbix Sender - 1º ZABBIX MEETUP DO INTERIOR-SPAndré Déo
 
Criando e consumindo webservice REST com PHP e JSON
Criando e consumindo webservice REST com PHP e JSONCriando e consumindo webservice REST com PHP e JSON
Criando e consumindo webservice REST com PHP e JSONMarcio Junior Vieira
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Sematext Group, Inc.
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Jmeter Tester Certification
Jmeter Tester CertificationJmeter Tester Certification
Jmeter Tester CertificationVskills
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
 
Yum package manager
Yum package managerYum package manager
Yum package managerLinuxConcept
 
DoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDKDoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDKMarian Marinov
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsHisaki Ohara
 
Load-testing 101 for Startups with Artillery.io
Load-testing 101 for Startups with Artillery.ioLoad-testing 101 for Startups with Artillery.io
Load-testing 101 for Startups with Artillery.ioHassy Veldstra
 

Tendances (20)

Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
 
Jasper reports in 3 easy steps
Jasper reports in 3 easy stepsJasper reports in 3 easy steps
Jasper reports in 3 easy steps
 
Automated deployment
Automated deploymentAutomated deployment
Automated deployment
 
Making The Move To Java 17 (JConf 2022)
Making The Move To Java 17 (JConf 2022)Making The Move To Java 17 (JConf 2022)
Making The Move To Java 17 (JConf 2022)
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
Shell scripting
Shell scriptingShell scripting
Shell scripting
 
DPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet ProcessingDPDK & Layer 4 Packet Processing
DPDK & Layer 4 Packet Processing
 
UserParameter vs Zabbix Sender - 1º ZABBIX MEETUP DO INTERIOR-SP
UserParameter vs Zabbix Sender - 1º ZABBIX MEETUP DO INTERIOR-SPUserParameter vs Zabbix Sender - 1º ZABBIX MEETUP DO INTERIOR-SP
UserParameter vs Zabbix Sender - 1º ZABBIX MEETUP DO INTERIOR-SP
 
Criando e consumindo webservice REST com PHP e JSON
Criando e consumindo webservice REST com PHP e JSONCriando e consumindo webservice REST com PHP e JSON
Criando e consumindo webservice REST com PHP e JSON
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Jmeter Tester Certification
Jmeter Tester CertificationJmeter Tester Certification
Jmeter Tester Certification
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
Mmap failure analysis
Mmap failure analysisMmap failure analysis
Mmap failure analysis
 
Yum package manager
Yum package managerYum package manager
Yum package manager
 
DoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDKDoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDK
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructions
 
Load-testing 101 for Startups with Artillery.io
Load-testing 101 for Startups with Artillery.ioLoad-testing 101 for Startups with Artillery.io
Load-testing 101 for Startups with Artillery.io
 
DPDK In Depth
DPDK In DepthDPDK In Depth
DPDK In Depth
 

Similaire à Design patterns and best practices for data analytics with amazon emr (ABD305)

Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
ABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueAmazon Web Services
 
ABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueAmazon Web Services
 
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWSData freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWSAmazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017Amazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
The Future of Research Computing on AWS - AWS Public Sector Summit Singapore ...
The Future of Research Computing on AWS - AWS Public Sector Summit Singapore ...The Future of Research Computing on AWS - AWS Public Sector Summit Singapore ...
The Future of Research Computing on AWS - AWS Public Sector Summit Singapore ...Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...
STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...
STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...Amazon Web Services
 
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...Amazon Web Services
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseAmazon Web Services
 
AWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your EnterpriseAWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your EnterpriseAmazon Web Services
 
Accelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMRAccelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMRAmazon Web Services
 
DAT309_Best Practices for Migrating from Oracle and SQL Server to Amazon RDS
DAT309_Best Practices for Migrating from Oracle and SQL Server to Amazon RDSDAT309_Best Practices for Migrating from Oracle and SQL Server to Amazon RDS
DAT309_Best Practices for Migrating from Oracle and SQL Server to Amazon RDSAmazon Web Services
 

Similaire à Design patterns and best practices for data analytics with amazon emr (ABD305) (20)

Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
ABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS Glue
 
ABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS Glue
 
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWSData freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWS
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
The Future of Research Computing on AWS - AWS Public Sector Summit Singapore ...
The Future of Research Computing on AWS - AWS Public Sector Summit Singapore ...The Future of Research Computing on AWS - AWS Public Sector Summit Singapore ...
The Future of Research Computing on AWS - AWS Public Sector Summit Singapore ...
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...
STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...
STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...
 
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your Enterprise
 
AWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your EnterpriseAWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
 
Accelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMRAccelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMR
 
DAT309_Best Practices for Migrating from Oracle and SQL Server to Amazon RDS
DAT309_Best Practices for Migrating from Oracle and SQL Server to Amazon RDSDAT309_Best Practices for Migrating from Oracle and SQL Server to Amazon RDS
DAT309_Best Practices for Migrating from Oracle and SQL Server to Amazon RDS
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Dernier

怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schscnajjemba
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 

Dernier (20)

怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 

Design patterns and best practices for data analytics with amazon emr (ABD305)

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Design Patterns and Best Practices for Data Analytics with Amazon EMR J o n a t h a n F r i t z , P r i n c i p a l P r o d u c t M a n a g e r – A m a z o n E M R A n y a B i d a , S e n i o r M e m b e r o f T e c h n i c a l S t a f f - S a l e s f o r c e A B D 3 0 5
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Overview - Intro and architectures - Using Amazon EC2 Spot and Auto Scaling - Security overview - Ad-hoc and advanced workflows - Apache Spark and Amazon EMR at Salesforce
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Easy to use Launch a cluster in minutes What is Amazon EMR? Low cost Pay per-second Open-source variety Latest versions of software Managed Spend less time monitoring Secure Easy to enable options Flexible Full customization and control
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Open-source applications Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop HBase/Phoenix Presto Streaming Flink Zeppelin - Notebooks MXNet Hue - SQL Ganglia - Monitoring Livy – Job Server Connectorsto AWSservices Amazon EMR service
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use the AWS Glue Data Catalog • Support for Spark, Hive and Presto • Auto-generate schema and partitions • Managed table updates
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. HBase for random access at massive scale
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real-time and batch processing
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New – Deep learning with GPU instances Use P3 instances with GPUs
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lower your costs T i p s t o
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Transient or long-running clusters • Amazon Linux AMI with preinstalled customizations for faster cluster creation • Auto Scaling to minimize costs for long-running clusters
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. EC2 Spot and instance fleets • EMR will select optimal EC2 AZ • Provision across instance types • Switch to on-demand
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Auto Scaling Scaling options Threshold CloudWatch or custom metric
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Auto Scaling • EMR scales-in at YARN task completion • Selectively removes nodes with no running tasks • yarn.resourcemanager.decommissioning.timeout • Default timeout is one hour • Spark scale-in contributions • Spark specific blacklisting of tasks • Unregistering cached data and shuffle blocks • Advanced error handling
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Secure your cluster T i p s t o
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Encryption • Spark • Tez • MapReduce • Presto • HBase • Hive • Pig
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Authentication LDAP HiveServer2 Presto Coordinator Spark Thrift Server Hue Server Zeppelin Server AWS credentials EMR Step (EMR API) EC2 key pair SSH as “hadoop”
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New – Authentication with Kerberos Microsoft Active Directory KDC Users YARN RMDoAs Service principals for all cluster nodes Master Node
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Authorization • Storage-based • EMRFS/S3 • HDFS • HiveServer2 and Presto (SQL-based) • HBase • YARN queues • Fine-grained access control by cluster tag (IAM) • Apache Ranger on edge node (using CloudFormation)
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New – EMRFS fine-grained authorization Context User: aduser Group: analyst IAM role: analytics_prod Can map IAM roles to user, group, or S3 prefix Context User: aduser2 Group: dev IAM role: analytics_dev
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security configuration demo
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Submit workflows T i p s t o
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Livy as an ad-hoc Spark job server Custom Application
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Oozie and Airflow for DAGs of jobs
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Select Amazon EMR customers
  • 25. Apache Spark and Amazon EMR at Salesforce abida@salesforce.com @anyabida1 Anya Bida, Senior Member of Technical Staff at Salesforce
  • 26. Fastest Growing Top 5 Enterprise Software Company $5.4BFY15 $4.1BFY14 $3.1BFY13 $6.7BFY16 $2.3BFY12 $1.7BFY11 $2.56BFY18Q2 revenue $8.4BFY17 revenue 2009 • 2010 • 2011 2012 • 2013 • 2014 2015 • 2016 • 2017 September 2016 2011 • 2012 • 2013 2014 • 2015 • 2016 • 2017 The world’s most innovative companies “Innovator of the Decade”
  • 27. Overview Our Goal Getting started with EMR Spark primer Monitor multiple viewpoints Use AWS Identity and Access Management (IAM) roles Isolate Environments
  • 28. Complete ML Pipeline ETL Feature Engineering Model Training Model Evaluation Deploy & Operationalize Models Score & Update Models Support Batch & Real Time
  • 29. Selecting a tool ETL Feature Engineering Model Training Model Evaluation Deploy & Operationalize Models Score & Update Models Support Batch & Real Time Supports each step of our ML pipeline Scales for small & large jobs Good ML Libraries Active user base Ability to deploy production ready code
  • 30. Selecting a tool ETL Feature Engineering Model Training Model Evaluation Deploy & Operationalize Models Score & Update Models Support Batch & Real Time Supports each step of our ML pipeline Scales for small & large jobs Good ML Libraries Active user base Ability to deploy production ready code
  • 31. We wanted Spark…now how to deploy it? EC2 • Support for batch / streaming • Integrates with our tooling • Spin up / down clusters • Larger / smaller clusters • Support for different versions of Hadoop, Spark • Storage & compute options
  • 32. We wanted Spark…now how to deploy it? EC2 • Support for batch / streaming • Integrates with our tooling • Spin up / down clusters • Larger / smaller clusters • Support for different versions of Hadoop, Spark • Storage & Compute options Need: Management
  • 33. We wanted Spark…now how to deploy it? Need: Management EC2 • Support for batch / streaming • Integrates with our tooling • Spin up / down clusters • Larger / smaller clusters • Support for different versions of Hadoop, Spark • Storage & Compute options
  • 34. Provision EMR Simplest approach Input bucket Monitoring aws emr create-cluster --applications Name=Hadoop Name=Spark Name=Ganglia --tags --ec2-attributes --release-label --log-uri … --name --instance-groups --region
  • 35. Provision EMR Simplest approach Input bucket Monitoring aws emr create-cluster --applications Name=Hadoop Name=Spark Name=Ganglia # tag for cost analysis by project --tags --ec2-attributes --release-label # send logs to S3 --log-uri … # use naming conventions for service discovery # <region>-<project>-<version>-<env> --name # CORE nodes used for writes to HDFS # TASK nodes used for compute - try spot instances here for starters --instance-groups --region
  • 36. Simplest approach Input bucket EMR cluster Output bucket Logs bucket
  • 38. Spark Primer Apache Spark Driver Program SparkContext Node Executor Cache TaskTask Cluster Manager Node Executor Cache TaskTask
  • 39. Spark on Amazon EMR Apache Spark Master Node Core Node Cluster Manager ResourceManager NameNode NodeManager Datanode Task Node NodeManagerNodeManager
  • 40. Spark on Amazon EMR Apache Spark Master Node Core Node Executor Cache TaskTask Cluster Manager ResourceManager NameNode Task Node Executor Cache TaskTask Driver Program SparkContext Executor Cache TaskTask NodeManager Datanode NodeManagerNodeManager
  • 41. Properties related to Dynamic Allocation Property Value Spark.dynamicAllocation.enabled true Spark.shuffle.service.enabled true spark.dynamicAllocation.minExecutors 5 spark.dynamicAllocation.maxExecutors 17 spark.dynamicAllocation.initalExecutors 0 sparkdynamicAllocation.executorIdleTime 60s spark.dynamicAllocation.schedulerBacklogTimeout 5s spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 5s Optional
  • 42. Where do I find Metrics? Ganglia windowing, dashboarding Cloudwatch YARNMemoryAvailablePercentage Logs?
  • 43. Anatomy of a Spark Job High Performance Spark, Karau & Warren, O’Reilly Spark Context/Spark Session Object Actions (e.g., collect, saveAsTextFile) Wide transformations; (sort, groupByKey) Computation to evaluate one partition (combine narrow transforms) Spark Application Job Stage Stage Task Task
  • 44. Simplest approach Input bucket EMR cluster Output bucket Reads from S3 • Jar files too! Write Intermediate files • MEM or Disk? • Local? HDFS? Amazon S3? Writes to S3 • Data available after cluster is terminated
  • 45. Cache Persist Checkpoint Local Checkpoint local mem cache MEM MEM MEM local disk DISK DISK HDFS / S3 Specify dir If exec is decommed, are writes available? No No Yes No If job finishes are writes available? No No Yes No Preserve lineage graph? Yes Yes No No RDD Re-use
  • 46. Cache Persist Checkpoint Local Checkpoint local mem cache MEM MEM MEM local disk DISK DISK HDFS / S3 Specify dir If exec is decommed, are writes available? No No Yes No If job finishes are writes available? No No Yes No Preserve lineage graph? Yes Yes No No RDD Re-use Persist to improve speed, Checkpoint to improve fault tolerance
  • 47. Overview Our Goal Getting started with EMR Spark primer Monitor multiple viewpoints Use IAM Roles Isolate Environments
  • 49. Understand resource allocation Understanding Memory Management in Spark For Fun And Profit Shivnath Babu (Duke University, Unravel Data Systems) Mayuresh Kunjir (Duke University) Node Memory Container Memory 8Gb Node Memory Container Memory 8Gb
  • 50. Can my 8Gb container launch on this cluster? Node Memory Node Memory Node Memory 4Gb used 8Gb total 8Gb
  • 51. Can my 8Gb container launch on this cluster? Scale-out Rule: Num Containers Pending Node Memory Node Memory Node Memory 4Gb used 8Gb total 8Gb
  • 52. Monitor multiple viewpoints Connectivity viewer https://www.linkedin.com/in/vaibhavt/ Vaibhav Tandon
  • 53. Monitor multiple viewpoints Connectivity viewer https://www.linkedin.com/in/vaibhavt/ Vaibhav Tandon
  • 54. Monitor multiple viewpoints Connectivity viewer https://www.linkedin.com/in/vaibhavt/ Vaibhav Tandon
  • 55. Overview Our Goal Getting started with EMR Spark primer Monitor multiple viewpoints Use IAM Roles Isolate Environments
  • 56. Use IAM roles Cluster Manager Scheduler IAM IAM Roles • User has an IAM Role
  • 57. Use IAM roles Cluster Manager Scheduler IAM IAM Roles • User has an IAM Role • Job has an IAM Role IAM
  • 58. Use IAM roles Cluster Manager Scheduler IAM IAM Roles • User has an IAM Role • Job has an IAM Role • IAM Roles determine read / write access to data IAM Out Logs IAM In
  • 59. Use IAM roles Every user, service, & job should have specific, auditable permissions. New: EMRFS fine-grained access control!! Cluster Manager Scheduler IAM IAM Roles • User has an IAM Role • Job has an IAM Role • IAM Roles determine read / write access to data IAM Out Logs IAM aws emr create-cluster … --service-role In
  • 60. Overview Our Goal Getting started with EMR Spark primer Monitor multiple viewpoints Use IAM Roles Isolate Environments
  • 61. Isolate environments Need: Build and release? Multitenancy? Cluster Manager Scheduler Out Logs In
  • 62. Cluster Manager Scheduler virtual private cloud Out Logs In Isolate environments Need: Build and release? Multitenancy?
  • 63. Cluster Manager Scheduler VPC subnet virtual private cloud VPC subnet Out Logs In Isolate environments Need: Build and release? Multitenancy?
  • 64. Cluster Manager Scheduler VPC subnet virtual private cloud security group security group security group VPC subnet Out Logs In Isolate environments Need: Build and release? Multitenancy?
  • 65. Cluster Manager Scheduler VPC subnet virtual private cloud security group security group security group VPC subnet Out Logs In aws emr create-cluster … --ec2-attributes '{ "KeyName":"", "InstanceProfile":"", "ServiceAccessSecurityGroup":"", "SubnetId":"”, "EmrManagedSlaveSecurityGroup":"", "EmrManagedMasterSecurityGroup":""}'}' Isolate environments Need: Build and release? Multitenancy?
  • 66. Cluster Manager Scheduler In Out Logs VPC subnet virtual private cloud security group IAM security group security group VPC subnet IAM Environment Isolate environments Need: Build and release? Multitenancy?
  • 67. VPC subnet virtual private cloud security group security group security group VPC subnet IAM Environment Dev Staging Canary Prod Isolate environments Need: Build and release? Multitenancy?
  • 68. VPC subnet virtual private cloud security group security group security group VPC subnet IAM Environment Dev Staging Canary Prod Automation • Use Cloudformation or Terraform • Upgrades use the same provisioning script + DNS Upsert Isolate environments Need: Build and release? Multitenancy?
  • 69. VPC subnet virtual private cloud security group security group security group VPC subnet IAM Environment Availability Zone region Dev Staging Canary Prod Isolate environments Need: Build and release? Multitenancy?
  • 70. VPC subnet virtual private cloud security group security group security group VPC subnet IAM Environment Availability Zone region Dev Staging Canary Prod Availability Zone region Dev Staging Canary Prod Availability Zone Availability Zone Isolate environments Need: Build and release? Multitenancy?
  • 71. Overview Our Goal Getting started with EMR Spark primer Monitor multiple viewpoints Use IAM Roles Isolate Environments
  • 72. Did we just automate ourselves out of our jobs? Nope. Now we have time to take on new projects and grow…
  • 73. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU! J o n a t h a n F r i t z – j o n f r i t z @ a m a z o n . c o m A n y a B i d a – a b i d a @ s a l e s f o r c e . c o m a w s . a m a z o n . c o m / e m r a w s . a m a z o n . c o m / b l o g s / b i g - d a t a a w s . a m a z o n . c o m / b l o g s / a i