Analytics at Scale with Apache Spark on AWS with Jonathan Fritz

•

1 j'aime•1,444 vues

Organizations from small startups to large enterprises are rapidly adopting Apache Spark on Amazon EMR in Amazon Web Services (AWS) to run streaming analytics, data science, machine learning, and batch processing workloads. These customers can quickly create big data architectures within minutes, and decouple compute and storage with Amazon S3 as a highly scalable, durable, and secure data lake, lower costs using Amazon EC2 Spot Instances and Auto Scaling, and utilize a wide range of encryption and access control features. In this session, we discuss how customers are using Spark on AWS and common architectures for easily running performant Spark clusters at scale and low cost with Amazon EMR.

Données & analyses

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jonathan Fritz, Amazon EMR
June 6, 2017
Analytics at Scale with
Apache Spark on AWS

Agenda
• Integration with Amazon S3 and other AWS services
• Lower costs Amazon EC2 Spot instances and Auto Scaling
• Spark Security Tips
• Customer Stories

What is Amazon EMR?
Low Cost
Pay an hourly rate
Open-Source Variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy to manage options
Flexible
Customize the cluster
Easy to Use
Launch a cluster in minutes

Many storage layers to choose from
Amazon DynamoDB
Amazon RDS Amazon Kinesis
Amazon Redshift
Amazon S3
Amazon EMR

Spot for
task nodes
Up to 80%
off EC2
On-Demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Use Spot and Reserved Instances to lower costs
Meet SLA at predictable cost Exceed SLA at lower cost

Instance fleets for advanced Spot provisioning
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
• Spot Block support

Security – Authentication and Authorization
Tag: user = MyUserIAM user: MyUser
EMR role
EC2 role
SSH key
Application authN

Learn
Models
ModelsImpressions
Clicks
Activities
Calibrate
Evaluate
Real
Time
Bidding
S3
ETL Attribution
Machine
Learning
S3Amazon
Kinesis
• 2 Petabytes Processed Daily
• 2 Million Bid Decisions Per Second
• Runs 24 X 7 on 5 Continents
• Thousands of ML Models
Trained per Day

DataXu Workflow
CDN
Real Time
Bidding
Retargeting
Platform
Amazon
Kinesis
Attribution & ML
S3
Reporting
Data Visualization
Data
Pipeline
ETL(Spark SQL)
Event Data
• Impressions
• Activities
• Attributions
• (Facts)
Reference Data
(Dimensions)
Application Logs
Exceptions Data
Reporting Data
Zeppelin notebooks

Architecture
RECOMMENDATION API
(Python, R, Flask)
Zillow Group
Data Lake
(S3 / Kinesis)
Property Featurization
(Spark EMR)
User Profiles
(Spark EMR)
Ranking
(Spark EMR)
Wedge Counting
Collaborative Filtering
(Spark EMR)
Property Aggregate Features
(Spark EMR)
Data Collection Systems
(Java/Python/SQL)

Ad hoc environment
Scale cluster to accommodate more users

Some of our customers running Spark on EMR
Internet of things
(IOT)

Thank you!
jonfritz@amazon.com
aws.amazon.com/emr
aws.amazon.com/blogs/big-data/

Recommandé

Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks

Spark Summit EU talk by Rolf JagermanSpark Summit

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit

Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Databricks

Building Operational Data Lake using Spark and SequoiaDB with Yang PengDatabricks

Top 5 mistakes when writing Streaming applicationshadooparchbook

Recommandé

Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks

Spark Summit EU talk by Rolf JagermanSpark Summit

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit

Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Databricks

Building Operational Data Lake using Spark and SequoiaDB with Yang PengDatabricks

Top 5 mistakes when writing Streaming applicationshadooparchbook

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank

A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks

A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...Databricks

Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks

SSR: Structured Streaming for R and Machine Learningfelixcss

Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit

Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks

Spark Summit EU talk by Michael NitschingerSpark Summit

Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouDatabricks

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks

Building a unified data pipeline in Apache SparkDataWorks Summit

Scaling Data Analytics Workloads on DatabricksDatabricks

What's New in Upcoming Apache Spark 2.3Databricks

OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks

Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Databricks

Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Spark Summit

Degrading Performance? You Might be Suffering From the Small Files SyndromeDatabricks

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Databricks

Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services

Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services

Contenu connexe

Tendances