SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data freedom, migrating
BigData workloads to AWS
Giorgio Nobile - Solutions Architect
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Deconstructing current big data environments
• Identifying challenges with on-premises or unmanaged architectures
• Migrating components to Amazon EMR and AWS analytics services
- Choosing the right engine for the job
- Building out an architecture
- Architecting for cost and scalability
- Security
• Customer migration stories
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deconstructing on premises
BigData environments
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Server rack 1
(20 nodes)
Server rack 2
(20 nodes)
Server rack N
(20 nodes)
Core
On premises Hadoop clusters
• A cluster of 1U machines
• Typically 12 Cores, 32/64 GB RAM,
and 6 - 8 TB of HDD ($3-4K)
• Networking switches and racks
• Open-source distribution of Hadoop
or a fixed licensing term by
commercial distributions
• Different node roles
• HDFS uses local disk and is sized for
3x data replication
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Workload types running on the same cluster
• Large Scale ETL: Apache Spark, Apache Hive with Apache Tez, or Apache
Hadoop MapReduce
• Interactive Queries: Apache Impala, Spark SQL, Presto, Apache Phoenix
• Machine Learning and Data Science: Spark ML, Apache Mahout
• NoSQL: Apache HBase
• Stream Processing: Apache Kafka, Spark Streaming, Apache Flink, Apache
NiFi, Apache Storm
• Search: Elasticsearch, Apache Solr
• Job Submission: Client Edge Node, Apache Oozie
• Data warehouses like Pivotal Greenplum or Teradata
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security
• Authentication: Kerberos with local KDC or Active
Directory, LDAP integration, local user management,
Apache Knox
• Authorization: Open-source native authZ (i.e.,
HiveServer2 authZ or HDFS ACLs), Apache Ranger,
Apache Sentry
• Encryption: local disk encryption with LUKS, HDFS
transparent-data encryption, in-flight encryption for
each framework (i.e., Hadoop MapReduce encrypted
shuffle)
• Configuration: Different tools for management based
on vendor
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Role of a Hadoop administrator
• Management of the cluster (failures, hardware
replacement, restarting services, expanding cluster)
• Configuration management
• Tuning of specific jobs or hardware
• Managing development and test environments
• Backing up data and disaster recovery
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Swim lane of jobs
Over-utilized Under-utilized
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Identifying on premises challenges
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
On premises: Over-utilization and idle capacity
• Tightly coupled compute and storage requires buying excess capacity
• Can be over-utilized during peak hours and under-utilized at other
times
• Results in high costs and low efficiency
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
On premises: System management difficulties
• Managing distributed applications and availability
• Durable storage and disaster recovery
• Adding new frameworks and doing upgrades
• Multiple environments
• Need team to manage cluster and procure hardware
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Migrating workloads to Amazon EMR
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy-to-manage options
Flexible
Customize the cluster
Easy to Use
Launch a cluster in minutes
Why Amazon EMR?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Key migration and TCO considerations
• Decouple storage and compute with Amazon S3
• Transient clusters and auto scaling
• Deconstruct workloads and map to open-source tools
• Choosing instance types and EC2 Spot Instances
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Translate use cases to the right tools
- Low-latency SQL -> Athena or Presto or Amazon Redshift
- Data warehouse/Reporting -> Spark or Hive or Glue or Amazon Redshift
- Management and monitoring -> EMR console or Ganglia metrics
- HDFS -> Amazon S3
- Notebooks -> Zeppelin Notebook or Jupyter (via bootstrap action)
- Query console -> Athena or Hue
- Security -> Ranger (CF template) or HiveServer2 or IAM roles
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop
HBase/Phoenix
Presto
Athena
Streaming
Flink
Glue
Amazon Redshift
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon
Kinesis
Amazon Redshift
Amazon S3
Amazon EMR
Amazon Elasticsearch
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Decouple compute and storage by using Amazon S3 as your
data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Intermediates
stored on local
disk or HDFS
Local
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
HBase on Amazon S3 for scalable NoSQL
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 tips: partitions, compression, and file
formats
• Avoid key names in lexicographical order
• Improve throughput and S3 list performance
• Use hashing/random prefixes or reverse the date-time
• Compress data set to minimize bandwidth from S3 to EC2
• Make sure you use splittable compression or have each file be the
optimal size for parallelization on your cluster
• Columnar file formats like Parquet can give increased performance
on reads
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
TCO – Transient or long-running clusters
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Options to submit jobs
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
Use Oozie on your
cluster to build DAGs
of jobs
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Performance and hardware
• Transient or long running
• Instance types
• Cluster size
• Application settings
• File formats and Amazon S3
tuning
Master Node
r5.2xlarge
Slave Group - Core
c5.4xlarge
Slave Group – Task
m5.2xlarge (EC2 Spot)
Considerations
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
On-cluster UIs to quickly tune workloads
Manage applications
SQL editor, Workflow designer,
Metastore browser
Notebooks
Design and execute
queries and workloads
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Spot for
task nodes
Up to 80%
off Amazon
EC2
On-Demand
pricing
On-Demand for
core nodes
Standard
Amazon EC2
pricing for
On-Demand
capacity
Meet SLA at predictable cost Exceed SLA at lower cost
Use Spot and Reserved Instances to lower costs
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Instance fleets for advanced Spot provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
• Spot Block support
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lower costs with Auto Scaling
EMR Job Amazon Cloudwatch metric Amazon Cloudwatch Alarm
AWS Autoscaling PolicyAWS Autoscaling Activity
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security – Encryption
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security – Authentication and authorization
Tag: user = MyUserIAM user: MyUser
EMR role
EC2 role
SSH key
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security – Authentication and authorization
• LDAP for HiveServer2, Hue, Presto,
Zeppelin
• Kerberos for Spark, HBase, YARN,
Hive, and authenticated UIs
• EMRFS storage-based permissions
• SQL standards-based and storage-
based authorization
AWS Directory Service
Self-Managed Directory
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security – Authentication and authorization
• Plug-ins for Hive, HBase, YARN,
and HDFS
• Row-level authorization for
Hive (with data-masking)
• Full auditing capabilities with
embedded search
• Run Ranger on an edge node –
visit the AWS Big Data Blog
Apache Ranger
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security – Governance and auditing
• AWS CloudTrail for EMR APIs
• Custom AMIs
• S3 access logs for cluster S3 access
• YARN and application logs
• Ranger for UI for application level auditing
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Customer Examples
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FINRA: Migrating from on-prem to AWS
Petabytes of data generated
on-premises, brought to AWS,
and stored in Amazon S3
Thousands of analytical
queries performed on EMR and
Amazon Redshift.
Stringent security
requirements met by
leveraging VPC, VPN,
encryption at-rest and in-
transit, CloudTrail, and
database auditing
Flexible
Interactive
Queries
Predefined
Queries
Surveillance
Analytics
Data Management
Data Movement
Data Registration
Version Management
Amazon S3
Web Applications
Analysts; Regulators
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lower cost and higher scale than on-premises
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FINRA saved 60% by moving to HBase on EMR
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Learn
Models
ModelsImpressions
Clicks
Activities
Calibrate
Evaluate
Real
Time
Bidding
Amazon S3
ETL Attribution
Machine
Learning
Amazon S3Amazon Kinesis
• 2 petabytes processed daily
• 2 million bid decisions per second
• Runs 24 X 7 on 5 continents
• Thousands of ML models
trained per day
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!
aws.amazon.com/emr
blogs.aws.amazon.com/bigdata

Contenu connexe

Tendances

Managed Relational Databases - Amazon RDS
Managed Relational Databases - Amazon RDSManaged Relational Databases - Amazon RDS
Managed Relational Databases - Amazon RDSAmazon Web Services
 
Amazon on Amazon: How Amazon Designs Chips on AWS (MFG305) - AWS re:Invent 2018
Amazon on Amazon: How Amazon Designs Chips on AWS (MFG305) - AWS re:Invent 2018Amazon on Amazon: How Amazon Designs Chips on AWS (MFG305) - AWS re:Invent 2018
Amazon on Amazon: How Amazon Designs Chips on AWS (MFG305) - AWS re:Invent 2018Amazon Web Services
 
Running Mission Critical Workloads on AWS
Running Mission Critical Workloads on AWSRunning Mission Critical Workloads on AWS
Running Mission Critical Workloads on AWSAmazon Web Services
 
物聯網創新應用:車聯網解決方案 IoT Story of Connected Vehicle Solution(Level 300)
物聯網創新應用:車聯網解決方案 IoT Story of Connected Vehicle Solution(Level 300)物聯網創新應用:車聯網解決方案 IoT Story of Connected Vehicle Solution(Level 300)
物聯網創新應用:車聯網解決方案 IoT Story of Connected Vehicle Solution(Level 300)Amazon Web Services
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Amazon Web Services
 
Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...
Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...
Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...Amazon Web Services
 
End User Collaboration on AWS - AWS Online Tech Talks
End User Collaboration on AWS - AWS Online Tech TalksEnd User Collaboration on AWS - AWS Online Tech Talks
End User Collaboration on AWS - AWS Online Tech TalksAmazon Web Services
 
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)Amazon Web Services
 
Why customers run SAP on AWS for Industry 4.0::Douglas Bellin::제조업 이노베이션 데이 S...
Why customers run SAP on AWS for Industry 4.0::Douglas Bellin::제조업 이노베이션 데이 S...Why customers run SAP on AWS for Industry 4.0::Douglas Bellin::제조업 이노베이션 데이 S...
Why customers run SAP on AWS for Industry 4.0::Douglas Bellin::제조업 이노베이션 데이 S...Amazon Web Services Korea
 
How can your business benefit from going Serverless
How can your business benefit from going ServerlessHow can your business benefit from going Serverless
How can your business benefit from going ServerlessAmazon Web Services
 
Top Strategic Priorities You Can Tackle with VMware Cloud on AWS (ENT215-R1) ...
Top Strategic Priorities You Can Tackle with VMware Cloud on AWS (ENT215-R1) ...Top Strategic Priorities You Can Tackle with VMware Cloud on AWS (ENT215-R1) ...
Top Strategic Priorities You Can Tackle with VMware Cloud on AWS (ENT215-R1) ...Amazon Web Services
 
SRV304 IoT Building Blocks From Edge Devices to Analytics in the Cloud
SRV304 IoT Building Blocks From Edge Devices to Analytics in the Cloud SRV304 IoT Building Blocks From Edge Devices to Analytics in the Cloud
SRV304 IoT Building Blocks From Edge Devices to Analytics in the Cloud Amazon Web Services
 
Building for Scale with AWS Media Services
Building for Scale with AWS Media ServicesBuilding for Scale with AWS Media Services
Building for Scale with AWS Media ServicesAmazon Web Services
 
SID303 Navigating GDPR Compliance on AWS
 SID303 Navigating GDPR Compliance on AWS SID303 Navigating GDPR Compliance on AWS
SID303 Navigating GDPR Compliance on AWSAmazon Web Services
 
Enabling Compliance with the GDPR on AWS
Enabling Compliance with the GDPR on AWSEnabling Compliance with the GDPR on AWS
Enabling Compliance with the GDPR on AWSAmazon Web Services
 
Securely Deliver Desktop Applications with Amazon AppStream 2.0 (BAP201) - AW...
Securely Deliver Desktop Applications with Amazon AppStream 2.0 (BAP201) - AW...Securely Deliver Desktop Applications with Amazon AppStream 2.0 (BAP201) - AW...
Securely Deliver Desktop Applications with Amazon AppStream 2.0 (BAP201) - AW...Amazon Web Services
 
What’s New with Device Qualification Program and IoT Services
What’s New with Device Qualification Program and IoT ServicesWhat’s New with Device Qualification Program and IoT Services
What’s New with Device Qualification Program and IoT ServicesAmazon Web Services
 
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...Amazon Web Services
 
SRV203 Optimizing Amazon EC2 for Fun and Profit
 SRV203 Optimizing Amazon EC2 for Fun and Profit SRV203 Optimizing Amazon EC2 for Fun and Profit
SRV203 Optimizing Amazon EC2 for Fun and ProfitAmazon Web Services
 

Tendances (20)

Managed Relational Databases - Amazon RDS
Managed Relational Databases - Amazon RDSManaged Relational Databases - Amazon RDS
Managed Relational Databases - Amazon RDS
 
GDPR x AWS 導覽 (Level 200)
GDPR x AWS 導覽 (Level 200)GDPR x AWS 導覽 (Level 200)
GDPR x AWS 導覽 (Level 200)
 
Amazon on Amazon: How Amazon Designs Chips on AWS (MFG305) - AWS re:Invent 2018
Amazon on Amazon: How Amazon Designs Chips on AWS (MFG305) - AWS re:Invent 2018Amazon on Amazon: How Amazon Designs Chips on AWS (MFG305) - AWS re:Invent 2018
Amazon on Amazon: How Amazon Designs Chips on AWS (MFG305) - AWS re:Invent 2018
 
Running Mission Critical Workloads on AWS
Running Mission Critical Workloads on AWSRunning Mission Critical Workloads on AWS
Running Mission Critical Workloads on AWS
 
物聯網創新應用:車聯網解決方案 IoT Story of Connected Vehicle Solution(Level 300)
物聯網創新應用:車聯網解決方案 IoT Story of Connected Vehicle Solution(Level 300)物聯網創新應用:車聯網解決方案 IoT Story of Connected Vehicle Solution(Level 300)
物聯網創新應用:車聯網解決方案 IoT Story of Connected Vehicle Solution(Level 300)
 
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
Big Data on AWS - To infinity and beyond! - Tel Aviv Summit 2018
 
Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...
Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...
Proven Methodologies for Accelerating Your Cloud Journey (ENT308-S) - AWS re:...
 
End User Collaboration on AWS - AWS Online Tech Talks
End User Collaboration on AWS - AWS Online Tech TalksEnd User Collaboration on AWS - AWS Online Tech Talks
End User Collaboration on AWS - AWS Online Tech Talks
 
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
雲上打造資料湖 (Data Lake):智能化駕馭商機 (Level 300)
 
Why customers run SAP on AWS for Industry 4.0::Douglas Bellin::제조업 이노베이션 데이 S...
Why customers run SAP on AWS for Industry 4.0::Douglas Bellin::제조업 이노베이션 데이 S...Why customers run SAP on AWS for Industry 4.0::Douglas Bellin::제조업 이노베이션 데이 S...
Why customers run SAP on AWS for Industry 4.0::Douglas Bellin::제조업 이노베이션 데이 S...
 
How can your business benefit from going Serverless
How can your business benefit from going ServerlessHow can your business benefit from going Serverless
How can your business benefit from going Serverless
 
Top Strategic Priorities You Can Tackle with VMware Cloud on AWS (ENT215-R1) ...
Top Strategic Priorities You Can Tackle with VMware Cloud on AWS (ENT215-R1) ...Top Strategic Priorities You Can Tackle with VMware Cloud on AWS (ENT215-R1) ...
Top Strategic Priorities You Can Tackle with VMware Cloud on AWS (ENT215-R1) ...
 
SRV304 IoT Building Blocks From Edge Devices to Analytics in the Cloud
SRV304 IoT Building Blocks From Edge Devices to Analytics in the Cloud SRV304 IoT Building Blocks From Edge Devices to Analytics in the Cloud
SRV304 IoT Building Blocks From Edge Devices to Analytics in the Cloud
 
Building for Scale with AWS Media Services
Building for Scale with AWS Media ServicesBuilding for Scale with AWS Media Services
Building for Scale with AWS Media Services
 
SID303 Navigating GDPR Compliance on AWS
 SID303 Navigating GDPR Compliance on AWS SID303 Navigating GDPR Compliance on AWS
SID303 Navigating GDPR Compliance on AWS
 
Enabling Compliance with the GDPR on AWS
Enabling Compliance with the GDPR on AWSEnabling Compliance with the GDPR on AWS
Enabling Compliance with the GDPR on AWS
 
Securely Deliver Desktop Applications with Amazon AppStream 2.0 (BAP201) - AW...
Securely Deliver Desktop Applications with Amazon AppStream 2.0 (BAP201) - AW...Securely Deliver Desktop Applications with Amazon AppStream 2.0 (BAP201) - AW...
Securely Deliver Desktop Applications with Amazon AppStream 2.0 (BAP201) - AW...
 
What’s New with Device Qualification Program and IoT Services
What’s New with Device Qualification Program and IoT ServicesWhat’s New with Device Qualification Program and IoT Services
What’s New with Device Qualification Program and IoT Services
 
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
 
SRV203 Optimizing Amazon EC2 for Fun and Profit
 SRV203 Optimizing Amazon EC2 for Fun and Profit SRV203 Optimizing Amazon EC2 for Fun and Profit
SRV203 Optimizing Amazon EC2 for Fun and Profit
 

Similaire à Data freedom: come migrare i carichi di lavoro Big Data su AWS

Accelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMRAccelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMRAmazon Web Services
 
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...Amazon Web Services
 
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Amazon Web Services
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18Cloudera, Inc.
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Amazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...Amazon Web Services
 
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverlessKim Kao
 
Using Containers and Serverless to Deploy Microservices (ARC214) - AWS re:Inv...
Using Containers and Serverless to Deploy Microservices (ARC214) - AWS re:Inv...Using Containers and Serverless to Deploy Microservices (ARC214) - AWS re:Inv...
Using Containers and Serverless to Deploy Microservices (ARC214) - AWS re:Inv...Amazon Web Services
 
Databases - EBC on the road Brazil Edition [Portuguese]
Databases - EBC on the road Brazil Edition [Portuguese]Databases - EBC on the road Brazil Edition [Portuguese]
Databases - EBC on the road Brazil Edition [Portuguese]Amazon Web Services
 
AWS SUMMIT TEL AVIV - 2018
AWS SUMMIT TEL AVIV - 2018AWS SUMMIT TEL AVIV - 2018
AWS SUMMIT TEL AVIV - 2018Ayaz Hussain
 
Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...
Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...
Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...Amazon Web Services
 
Gluecon 2018 - The Best Practices and Hard Lessons Learned of Serverless Appl...
Gluecon 2018 - The Best Practices and Hard Lessons Learned of Serverless Appl...Gluecon 2018 - The Best Practices and Hard Lessons Learned of Serverless Appl...
Gluecon 2018 - The Best Practices and Hard Lessons Learned of Serverless Appl...Chris Munns
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 

Similaire à Data freedom: come migrare i carichi di lavoro Big Data su AWS (20)

Accelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMRAccelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMR
 
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
One Data Lake, Many Uses: Enabling Multi-Tenant Analytics with Amazon EMR (AN...
 
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
Migrate Your Hadoop/Spark Workload to Amazon EMR and Architect It for Securit...
 
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
 
Using Containers and Serverless to Deploy Microservices (ARC214) - AWS re:Inv...
Using Containers and Serverless to Deploy Microservices (ARC214) - AWS re:Inv...Using Containers and Serverless to Deploy Microservices (ARC214) - AWS re:Inv...
Using Containers and Serverless to Deploy Microservices (ARC214) - AWS re:Inv...
 
Databases - EBC on the road Brazil Edition [Portuguese]
Databases - EBC on the road Brazil Edition [Portuguese]Databases - EBC on the road Brazil Edition [Portuguese]
Databases - EBC on the road Brazil Edition [Portuguese]
 
AWS 101 - Tel Aviv Summit 2018
AWS 101 - Tel Aviv Summit 2018AWS 101 - Tel Aviv Summit 2018
AWS 101 - Tel Aviv Summit 2018
 
AWS SUMMIT TEL AVIV - 2018
AWS SUMMIT TEL AVIV - 2018AWS SUMMIT TEL AVIV - 2018
AWS SUMMIT TEL AVIV - 2018
 
Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...
Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...
Optimizing Storage for Enterprise Workloads and Migrations (STG202) - AWS re:...
 
Gluecon 2018 - The Best Practices and Hard Lessons Learned of Serverless Appl...
Gluecon 2018 - The Best Practices and Hard Lessons Learned of Serverless Appl...Gluecon 2018 - The Best Practices and Hard Lessons Learned of Serverless Appl...
Gluecon 2018 - The Best Practices and Hard Lessons Learned of Serverless Appl...
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Oracle on AWS
Oracle on AWSOracle on AWS
Oracle on AWS
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Data freedom: come migrare i carichi di lavoro Big Data su AWS

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data freedom, migrating BigData workloads to AWS Giorgio Nobile - Solutions Architect
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Deconstructing current big data environments • Identifying challenges with on-premises or unmanaged architectures • Migrating components to Amazon EMR and AWS analytics services - Choosing the right engine for the job - Building out an architecture - Architecting for cost and scalability - Security • Customer migration stories
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Deconstructing on premises BigData environments
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Server rack 1 (20 nodes) Server rack 2 (20 nodes) Server rack N (20 nodes) Core On premises Hadoop clusters • A cluster of 1U machines • Typically 12 Cores, 32/64 GB RAM, and 6 - 8 TB of HDD ($3-4K) • Networking switches and racks • Open-source distribution of Hadoop or a fixed licensing term by commercial distributions • Different node roles • HDFS uses local disk and is sized for 3x data replication
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Workload types running on the same cluster • Large Scale ETL: Apache Spark, Apache Hive with Apache Tez, or Apache Hadoop MapReduce • Interactive Queries: Apache Impala, Spark SQL, Presto, Apache Phoenix • Machine Learning and Data Science: Spark ML, Apache Mahout • NoSQL: Apache HBase • Stream Processing: Apache Kafka, Spark Streaming, Apache Flink, Apache NiFi, Apache Storm • Search: Elasticsearch, Apache Solr • Job Submission: Client Edge Node, Apache Oozie • Data warehouses like Pivotal Greenplum or Teradata
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security • Authentication: Kerberos with local KDC or Active Directory, LDAP integration, local user management, Apache Knox • Authorization: Open-source native authZ (i.e., HiveServer2 authZ or HDFS ACLs), Apache Ranger, Apache Sentry • Encryption: local disk encryption with LUKS, HDFS transparent-data encryption, in-flight encryption for each framework (i.e., Hadoop MapReduce encrypted shuffle) • Configuration: Different tools for management based on vendor
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Role of a Hadoop administrator • Management of the cluster (failures, hardware replacement, restarting services, expanding cluster) • Configuration management • Tuning of specific jobs or hardware • Managing development and test environments • Backing up data and disaster recovery
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Swim lane of jobs Over-utilized Under-utilized
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Identifying on premises challenges
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. On premises: Over-utilization and idle capacity • Tightly coupled compute and storage requires buying excess capacity • Can be over-utilized during peak hours and under-utilized at other times • Results in high costs and low efficiency
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. On premises: System management difficulties • Managing distributed applications and availability • Durable storage and disaster recovery • Adding new frameworks and doing upgrades • Multiple environments • Need team to manage cluster and procure hardware
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Migrating workloads to Amazon EMR
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Secure Easy-to-manage options Flexible Customize the cluster Easy to Use Launch a cluster in minutes Why Amazon EMR?
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Key migration and TCO considerations • Decouple storage and compute with Amazon S3 • Transient clusters and auto scaling • Deconstruct workloads and map to open-source tools • Choosing instance types and EC2 Spot Instances
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Translate use cases to the right tools - Low-latency SQL -> Athena or Presto or Amazon Redshift - Data warehouse/Reporting -> Spark or Hive or Glue or Amazon Redshift - Management and monitoring -> EMR console or Ganglia metrics - HDFS -> Amazon S3 - Notebooks -> Zeppelin Notebook or Jupyter (via bootstrap action) - Query console -> Athena or Hue - Security -> Ranger (CF template) or HiveServer2 or IAM roles Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop HBase/Phoenix Presto Athena Streaming Flink Glue Amazon Redshift
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Many storage layers to choose from Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR Amazon Elasticsearch
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Decouple compute and storage by using Amazon S3 as your data layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Intermediates stored on local disk or HDFS Local
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. HBase on Amazon S3 for scalable NoSQL
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 tips: partitions, compression, and file formats • Avoid key names in lexicographical order • Improve throughput and S3 list performance • Use hashing/random prefixes or reverse the date-time • Compress data set to minimize bandwidth from S3 to EC2 • Make sure you use splittable compression or have each file be the optimal size for parallelization on your cluster • Columnar file formats like Parquet can give increased performance on reads
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. TCO – Transient or long-running clusters
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Options to submit jobs Amazon EMR Step API Submit a Spark application Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on EC2 Create a pipeline to schedule job submission or create complex workflows AWS Lambda Use AWS Lambda to submit applications to EMR Step API or directly to Spark on your cluster Use Oozie on your cluster to build DAGs of jobs
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Performance and hardware • Transient or long running • Instance types • Cluster size • Application settings • File formats and Amazon S3 tuning Master Node r5.2xlarge Slave Group - Core c5.4xlarge Slave Group – Task m5.2xlarge (EC2 Spot) Considerations
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. On-cluster UIs to quickly tune workloads Manage applications SQL editor, Workflow designer, Metastore browser Notebooks Design and execute queries and workloads
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Spot for task nodes Up to 80% off Amazon EC2 On-Demand pricing On-Demand for core nodes Standard Amazon EC2 pricing for On-Demand capacity Meet SLA at predictable cost Exceed SLA at lower cost Use Spot and Reserved Instances to lower costs
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Instance fleets for advanced Spot provisioning Master Node Core Instance Fleet Task Instance Fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal Availability Zone based on capacity/price • Spot Block support
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lower costs with Auto Scaling EMR Job Amazon Cloudwatch metric Amazon Cloudwatch Alarm AWS Autoscaling PolicyAWS Autoscaling Activity
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security – Encryption
  • 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security – Authentication and authorization Tag: user = MyUserIAM user: MyUser EMR role EC2 role SSH key
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security – Authentication and authorization • LDAP for HiveServer2, Hue, Presto, Zeppelin • Kerberos for Spark, HBase, YARN, Hive, and authenticated UIs • EMRFS storage-based permissions • SQL standards-based and storage- based authorization AWS Directory Service Self-Managed Directory
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security – Authentication and authorization • Plug-ins for Hive, HBase, YARN, and HDFS • Row-level authorization for Hive (with data-masking) • Full auditing capabilities with embedded search • Run Ranger on an edge node – visit the AWS Big Data Blog Apache Ranger
  • 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security – Governance and auditing • AWS CloudTrail for EMR APIs • Custom AMIs • S3 access logs for cluster S3 access • YARN and application logs • Ranger for UI for application level auditing
  • 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Customer Examples
  • 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FINRA: Migrating from on-prem to AWS Petabytes of data generated on-premises, brought to AWS, and stored in Amazon S3 Thousands of analytical queries performed on EMR and Amazon Redshift. Stringent security requirements met by leveraging VPC, VPN, encryption at-rest and in- transit, CloudTrail, and database auditing Flexible Interactive Queries Predefined Queries Surveillance Analytics Data Management Data Movement Data Registration Version Management Amazon S3 Web Applications Analysts; Regulators
  • 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lower cost and higher scale than on-premises
  • 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. FINRA saved 60% by moving to HBase on EMR
  • 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Learn Models ModelsImpressions Clicks Activities Calibrate Evaluate Real Time Bidding Amazon S3 ETL Attribution Machine Learning Amazon S3Amazon Kinesis • 2 petabytes processed daily • 2 million bid decisions per second • Runs 24 X 7 on 5 continents • Thousands of ML models trained per day
  • 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you! aws.amazon.com/emr blogs.aws.amazon.com/bigdata