SlideShare une entreprise Scribd logo
1  sur  35
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Practices for Distributed
Machine Learning & Predictive Analytics
on AWS
K e i t h S t e w a r d , P h . D .
S p e c i a l i s t S A , A W S
A B D 4 0 3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What we’ll cover:
1. Predictive Analytics: What is it? Why is it important?
2. Predictive Analytics: How can you do it at scale and with agility on
AWS?
• How to do a Data Lake with Amazon S3 and AWS Glue?
• How to Train a Predictive Machine Learning Model in a distributed fashion, using
Amazon EMR with Apache Spark ML + Apache Zeppelin?
• How to perform distributed SQL queries on big data using Amazon Athena?
• How to visualize data using Amazon QuickSight?
3. Demo: Use predictive analytics to identify target customers.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
1. Predictive Analytics:
What is it?
Why is it important?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Wikipedia Definitions:
Analytics: the discovery, interpretation, and
communication of meaningful patterns in data.
Predictive Analytics: encompasses a variety of
statistical techniques from predictive modeling, machine
learning, and data mining that analyze current and
historical facts to make predictions about future or
otherwise unknown events.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Predictive Analytics Use Cases:
Customer predictions
• “Which customers are likely to be the most profitable?”
• “How much revenue should I expect this customer to generate?”
• “Which customers are likely to churn?”
• “Among all of our customers, which are likely to respond to a given
offer?”
• “What’s the probability that a given customer X will respond to a
given offer Y”
Product or Service predictions
• “What products should we offer or develop?”
• “What items are likely to be purchased together?” (basket analysis)
Business Operations predictions
• “Are the metrics for service X nominal or anomalous?”
• “Is equipment X likely to fail within time Y?”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why is Predictive Analytics Important?
• Companies have been accumulating “Big Data” about customers,
products/services, and operations for many years now.
• Big Data technologies have provided proven solutions to address
this data management need.
• But… Increasing pressure to turn data into insights about trends,
classifications, detect anomalies, and provide feedback loops to
improve their business
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why is Predictive Analytics Important?
• Desire to evolve from backwards-looking monthly or quarterly
reports to real-time alerts, and now to predicting the future
Business’
Evolving
Needs for
Data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
2. Predictive Analytics:
How can you do it at scale
and with agility on AWS?
Predictive Analytics Architecture
Storage
Processing,
ML &
Inferencing
Visualization
Plethora of Tools
Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data Pipeline
Amazon
Kinesis
Kinesis-enabled
app
Lambda ML
SQS
ElastiCache
DynamoDB
Streams
Amazon Elasticsearch
Service
Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
Amazon Elasticsearch
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
HotHotWarm
FastSlowFast
BatchMessageInteractiveStreamML
SearchSQLNoSQLCacheFileQueueStream
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
Reference architecture
LoggingIoTApplicationsTransportMessaging
ETL
AWS
Glue
Amazon
Athena
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon
EMR
Amazon S3
FastSlow
File
Amazon QuickSight
Reference architecture
ETL
AWS
Glue
Amazon
Athena
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon
EMR
Amazon S3
FastSlow
File
Amazon QuickSight
Reference architecture
ETL
AWS
Glue
Amazon
Athena
Store anything (object storage)
Scalable / Elastic
99.999999999% durability
Effectively infinite inbound bandwidth
Extremely low cost: $0.023/GB-Mo; $23/TB-Mo
Data layer for virtually all AWS services
Amazon S3
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
§ Multiple upload
§ Range GET
§ Store as much as you need
§ Scale storage and compute
independently
§ No minimum usage commitments
Scalable
§ Amazon EMR
§ Amazon Redshift
§ Amazon DynamoDB
Integrated
§ Simple REST API
§ AWS SDKs
§ Read-after-create consistency
§ Event notification
§ Lifecycle policies
Easy to use
Why Amazon S3 for data lake?
Data Catalog
Managed ETL Engine
Job Scheduler
Built on Apache Spark
Integrated with S3, RDS, Redshift
Integrates with any JDBC data store
AWS Glue
Glue data catalog
Discover and organize your data sets
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Glue data catalog
Manage table metadata through a Hive
metastore API or Hive SQL. Supported by
tools such as Hive, Presto, Spark, etc.
We added a few extensions:
§ Search metadata for data discovery
§ Connection info – JDBC URLs, credentials
§ Classification for identifying and parsing files
§ Versioning of table metadata as schemas
evolve and other metadata are updated
Populate using Hive DDL, bulk import, or
automatically through crawlers.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Crawlers: auto-populate data catalog
Automatic schema inference:
• Built-in classifiers detect file type and
extract schema: record structure and
data types.
• Add your own or share with others in the
Glue community - It's all Grok and Python.
Auto-detects Hive-style partitions,
grouping similar files into one table.
Run crawlers on schedule to discover
new data and schema changes.
Serverless – only pay when crawls run.
Scalable Hadoop clusters as a service
Hadoop, Hive, Spark, Presto, Hbase, etc.
Easy to use; fully managed
On demand, reserved, spot pricing
HDFS, S3, and Amazon EBS filesystems
Integrates with AWS Glue
End to end security
Amazon EMR
EMR: Just some of the organizations using
Amazon EMR / Spark ML offers these Models:
• Classification
• Logistic regression
• Decision tree classifier
• Random forest classifier
• Gradient-boosted tree classifier
• Multilayer perceptron classifier
• One-vs-Rest classifier (a.k.a. One-vs-All)
• Naive Bayes
• Regression
• Linear regression
• Generalized linear regression
• Decision tree regression
• Random forest regression
• Gradient-boosted tree regression
• Survival regression
• Isotonic regression
• Linear methods
• Decision trees
• Tree Ensembles
• Random Forests
• Gradient-Boosted Trees (GBTs)
What are Decision Trees?
Weather predictors for Golf
Decision Trees: Training Data
Bank loan
write-off
predictions
Fast Business Analytics / Data Visualization
Service running in AWS
Access multiple AWS data sources (S3, Athena,
Aurora, Redshift, etc) or data you upload
Perform ad hoc analyses to gain insights from
your data
Supports 100’s of thousands of users
Smart visualizations dynamically optimized for
your data
Generate fast, interactive visualizations on
large datasets
Amazon
QuickSight
Amazon QuickSight is a Business Analytics Service that lets business users
quickly and easily visualize, explore, and share insights from their data.
QuickSight Overview
Integrated with AWS
Redshift, RDS, Athena, S3, IAM, Roles, etc…
Cloud Native
Fully managed, serverless analytics at scale
Super Fast and Easy to Use
Backed by SPICE and a beautiful UI
Cost Effective
Starts at $9 per user per month
QuickSight is deeply integrated
with AWS data sources like
Redshift, RDS, S3, Athena and
others, as well as third-party
sources like Excel, Salesforce, as
well as on-premises databases.
Deep Integration with
AWS Data Sources
Amazon RDS,
Aurora
Amazon
Redshift
Amazon
Athena
Amazon S3
Files
QuickSight is powered by SPICE, a super-fast calculation engine that delivers unprecedented
performance and scale, delivering insights at the speed of thought.
SPICE
SPICE
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
3. Demo:
Use predictive analytics to
identify target customers.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Our Scenario: We are LuxCars, Inc.
The Challenge:
q We’re bringing an expensive new luxury car (the
“Luxmobile”) to market.
q Want to focus marketing on customers likely to be
wealthy enough to afford our luxury automobile.
q Have demographic data for our target geography, but we
don’t have income data.
q Need to predict salary from the demographic data.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Our Scenario: We are LuxCars Inc.
Strategy:
1. Collect U.S. Census Income data & put into the Cloud.
2. Schematize the data.
3. Train a Spark ML model on the data.
4. Do batch predictions of income for people in the target geography.
5. Compute mean incomes per sub-geography (“income density”)
6. Visualize those predictions.
7. Share results with our Marketing Director.
Demo: Predictive Analytics architecture
Amazon EMR
+ Zeppelin
+ Spark ML
EC2 instance
U.S.
Census
(income
data)
S3
bucket
Amazon
QuickSight
AWS Glue
Amazon
Athena
AWS cloud
download
+ unpack
schematize
Marketing
Director
Model
(decision tree
classifier)
Demo: Predictive Analytics architecture
Let’s Build It!
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!
D on ’t forg et t o p r ovid e eva lua t ion s for t h is
p res en t a t ion !

Contenu connexe

Tendances

WPS301-Navigating HIPAA and HITRUST_QuickStart Guide to Account Gov Strat.pdf
WPS301-Navigating HIPAA and HITRUST_QuickStart Guide to Account Gov Strat.pdfWPS301-Navigating HIPAA and HITRUST_QuickStart Guide to Account Gov Strat.pdf
WPS301-Navigating HIPAA and HITRUST_QuickStart Guide to Account Gov Strat.pdfAmazon Web Services
 
GPSBUS202_Driving Customer Value with Big Data Analytics
GPSBUS202_Driving Customer Value with Big Data AnalyticsGPSBUS202_Driving Customer Value with Big Data Analytics
GPSBUS202_Driving Customer Value with Big Data AnalyticsAmazon Web Services
 
AMF302-Alexa Wheres My Car A Test Drive of the AWS Connected Car Reference.pdf
AMF302-Alexa Wheres My Car A Test Drive of the AWS Connected Car Reference.pdfAMF302-Alexa Wheres My Car A Test Drive of the AWS Connected Car Reference.pdf
AMF302-Alexa Wheres My Car A Test Drive of the AWS Connected Car Reference.pdfAmazon Web Services
 
ABD202_Best Practices for Building Serverless Big Data Applications
ABD202_Best Practices for Building Serverless Big Data ApplicationsABD202_Best Practices for Building Serverless Big Data Applications
ABD202_Best Practices for Building Serverless Big Data ApplicationsAmazon Web Services
 
GPSTEC201_Building an Artificial Intelligence Practice for Consulting Partners
GPSTEC201_Building an Artificial Intelligence Practice for Consulting PartnersGPSTEC201_Building an Artificial Intelligence Practice for Consulting Partners
GPSTEC201_Building an Artificial Intelligence Practice for Consulting PartnersAmazon Web Services
 
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...Amazon Web Services
 
BAP202_Amazon Connect Delivers Personalized Customer Experiences for Your Clo...
BAP202_Amazon Connect Delivers Personalized Customer Experiences for Your Clo...BAP202_Amazon Connect Delivers Personalized Customer Experiences for Your Clo...
BAP202_Amazon Connect Delivers Personalized Customer Experiences for Your Clo...Amazon Web Services
 
LFS301-SAGE Bionetworks, Digital Mammography DREAM Challenge and How AWS Enab...
LFS301-SAGE Bionetworks, Digital Mammography DREAM Challenge and How AWS Enab...LFS301-SAGE Bionetworks, Digital Mammography DREAM Challenge and How AWS Enab...
LFS301-SAGE Bionetworks, Digital Mammography DREAM Challenge and How AWS Enab...Amazon Web Services
 
Build a Website & Mobile App for your first 10 million users
Build a Website & Mobile App for your first 10 million usersBuild a Website & Mobile App for your first 10 million users
Build a Website & Mobile App for your first 10 million usersAmazon Web Services
 
RET303_Drive Warehouse Efficiencies with the Same AWS IoT Technology that Pow...
RET303_Drive Warehouse Efficiencies with the Same AWS IoT Technology that Pow...RET303_Drive Warehouse Efficiencies with the Same AWS IoT Technology that Pow...
RET303_Drive Warehouse Efficiencies with the Same AWS IoT Technology that Pow...Amazon Web Services
 
Analyzing Streaming Data in Real-time with Amazon Kinesis
Analyzing Streaming Data in Real-time with Amazon KinesisAnalyzing Streaming Data in Real-time with Amazon Kinesis
Analyzing Streaming Data in Real-time with Amazon KinesisAmazon Web Services
 
ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...
ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...
ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...Amazon Web Services
 
Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...
Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...
Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...Amazon Web Services
 
ABD303_Developing an Insights Platform—the Sysco Journey from Disparate Syste...
ABD303_Developing an Insights Platform—the Sysco Journey from Disparate Syste...ABD303_Developing an Insights Platform—the Sysco Journey from Disparate Syste...
ABD303_Developing an Insights Platform—the Sysco Journey from Disparate Syste...Amazon Web Services
 
ABD302_Real-Time Data Exploration and Analytics with Amazon Elasticsearch Ser...
ABD302_Real-Time Data Exploration and Analytics with Amazon Elasticsearch Ser...ABD302_Real-Time Data Exploration and Analytics with Amazon Elasticsearch Ser...
ABD302_Real-Time Data Exploration and Analytics with Amazon Elasticsearch Ser...Amazon Web Services
 
ABD208_Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores...
ABD208_Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores...ABD208_Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores...
ABD208_Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores...Amazon Web Services
 
MAE304-Turners Cloud Archive for CNN's Video Library and Global Multiplatform...
MAE304-Turners Cloud Archive for CNN's Video Library and Global Multiplatform...MAE304-Turners Cloud Archive for CNN's Video Library and Global Multiplatform...
MAE304-Turners Cloud Archive for CNN's Video Library and Global Multiplatform...Amazon Web Services
 
ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...
ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...
ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...Amazon Web Services
 
ABD307_Deep Analytics for Global AWS Marketing Organization
ABD307_Deep Analytics for Global AWS Marketing OrganizationABD307_Deep Analytics for Global AWS Marketing Organization
ABD307_Deep Analytics for Global AWS Marketing OrganizationAmazon Web Services
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueAmazon Web Services
 

Tendances (20)

WPS301-Navigating HIPAA and HITRUST_QuickStart Guide to Account Gov Strat.pdf
WPS301-Navigating HIPAA and HITRUST_QuickStart Guide to Account Gov Strat.pdfWPS301-Navigating HIPAA and HITRUST_QuickStart Guide to Account Gov Strat.pdf
WPS301-Navigating HIPAA and HITRUST_QuickStart Guide to Account Gov Strat.pdf
 
GPSBUS202_Driving Customer Value with Big Data Analytics
GPSBUS202_Driving Customer Value with Big Data AnalyticsGPSBUS202_Driving Customer Value with Big Data Analytics
GPSBUS202_Driving Customer Value with Big Data Analytics
 
AMF302-Alexa Wheres My Car A Test Drive of the AWS Connected Car Reference.pdf
AMF302-Alexa Wheres My Car A Test Drive of the AWS Connected Car Reference.pdfAMF302-Alexa Wheres My Car A Test Drive of the AWS Connected Car Reference.pdf
AMF302-Alexa Wheres My Car A Test Drive of the AWS Connected Car Reference.pdf
 
ABD202_Best Practices for Building Serverless Big Data Applications
ABD202_Best Practices for Building Serverless Big Data ApplicationsABD202_Best Practices for Building Serverless Big Data Applications
ABD202_Best Practices for Building Serverless Big Data Applications
 
GPSTEC201_Building an Artificial Intelligence Practice for Consulting Partners
GPSTEC201_Building an Artificial Intelligence Practice for Consulting PartnersGPSTEC201_Building an Artificial Intelligence Practice for Consulting Partners
GPSTEC201_Building an Artificial Intelligence Practice for Consulting Partners
 
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
STG314-Case Study Learn How HERE Uses JFrog Artifactory w Amazon EFS Support ...
 
BAP202_Amazon Connect Delivers Personalized Customer Experiences for Your Clo...
BAP202_Amazon Connect Delivers Personalized Customer Experiences for Your Clo...BAP202_Amazon Connect Delivers Personalized Customer Experiences for Your Clo...
BAP202_Amazon Connect Delivers Personalized Customer Experiences for Your Clo...
 
LFS301-SAGE Bionetworks, Digital Mammography DREAM Challenge and How AWS Enab...
LFS301-SAGE Bionetworks, Digital Mammography DREAM Challenge and How AWS Enab...LFS301-SAGE Bionetworks, Digital Mammography DREAM Challenge and How AWS Enab...
LFS301-SAGE Bionetworks, Digital Mammography DREAM Challenge and How AWS Enab...
 
Build a Website & Mobile App for your first 10 million users
Build a Website & Mobile App for your first 10 million usersBuild a Website & Mobile App for your first 10 million users
Build a Website & Mobile App for your first 10 million users
 
RET303_Drive Warehouse Efficiencies with the Same AWS IoT Technology that Pow...
RET303_Drive Warehouse Efficiencies with the Same AWS IoT Technology that Pow...RET303_Drive Warehouse Efficiencies with the Same AWS IoT Technology that Pow...
RET303_Drive Warehouse Efficiencies with the Same AWS IoT Technology that Pow...
 
Analyzing Streaming Data in Real-time with Amazon Kinesis
Analyzing Streaming Data in Real-time with Amazon KinesisAnalyzing Streaming Data in Real-time with Amazon Kinesis
Analyzing Streaming Data in Real-time with Amazon Kinesis
 
ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...
ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...
ATC302_How to Leverage AWS Machine Learning Services to Analyze and Optimize ...
 
Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...
Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...
Easy and Scalable Log Analytics with Amazon Elasticsearch Service - ABD326 - ...
 
ABD303_Developing an Insights Platform—the Sysco Journey from Disparate Syste...
ABD303_Developing an Insights Platform—the Sysco Journey from Disparate Syste...ABD303_Developing an Insights Platform—the Sysco Journey from Disparate Syste...
ABD303_Developing an Insights Platform—the Sysco Journey from Disparate Syste...
 
ABD302_Real-Time Data Exploration and Analytics with Amazon Elasticsearch Ser...
ABD302_Real-Time Data Exploration and Analytics with Amazon Elasticsearch Ser...ABD302_Real-Time Data Exploration and Analytics with Amazon Elasticsearch Ser...
ABD302_Real-Time Data Exploration and Analytics with Amazon Elasticsearch Ser...
 
ABD208_Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores...
ABD208_Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores...ABD208_Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores...
ABD208_Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores...
 
MAE304-Turners Cloud Archive for CNN's Video Library and Global Multiplatform...
MAE304-Turners Cloud Archive for CNN's Video Library and Global Multiplatform...MAE304-Turners Cloud Archive for CNN's Video Library and Global Multiplatform...
MAE304-Turners Cloud Archive for CNN's Video Library and Global Multiplatform...
 
ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...
ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...
ABD214_Real-time User Insights for Mobile and Web Applications with Amazon Pi...
 
ABD307_Deep Analytics for Global AWS Marketing Organization
ABD307_Deep Analytics for Global AWS Marketing OrganizationABD307_Deep Analytics for Global AWS Marketing Organization
ABD307_Deep Analytics for Global AWS Marketing Organization
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 

Similaire à Best Practices for Distributed Machine Learning and Predictive Analytics Using Amazon EMR and Open-Source Tools - ABD403 - re:Invent 2017

Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Amazon Web Services
 
ABD311_Deploying Amazon QuickSight For Enterprise
ABD311_Deploying Amazon QuickSight For EnterpriseABD311_Deploying Amazon QuickSight For Enterprise
ABD311_Deploying Amazon QuickSight For EnterpriseAmazon Web Services
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCAmazon Web Services LATAM
 
Deploying Business Analytics at Enterprise Scale - AWS Online Tech Talks
Deploying Business Analytics at Enterprise Scale - AWS Online Tech TalksDeploying Business Analytics at Enterprise Scale - AWS Online Tech Talks
Deploying Business Analytics at Enterprise Scale - AWS Online Tech TalksAmazon Web Services
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansAmazon Web Services
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with ZopaAmazon Web Services
 
A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Mas...
A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Mas...A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Mas...
A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Mas...Amazon Web Services
 
How TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPTHow TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPTAmazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Amazon Web Services
 
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Amazon Web Services
 
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...Amazon Web Services
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design PatternsJohn Yeung
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Amazon Web Services
 
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Amazon Web Services
 
Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
 Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
Big Data Meets AI - Driving Insights and Adding Intelligence to Your SolutionsAmazon Web Services
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Applying AWS Purpose-Built Database Strategy - SRV307 - Anaheim AWS Summit
Applying AWS Purpose-Built Database Strategy - SRV307 - Anaheim AWS SummitApplying AWS Purpose-Built Database Strategy - SRV307 - Anaheim AWS Summit
Applying AWS Purpose-Built Database Strategy - SRV307 - Anaheim AWS SummitAmazon Web Services
 
Real-time Analytics using Data from IoT Devices - AWS Online Tech Talks
Real-time Analytics using Data from IoT Devices - AWS Online Tech TalksReal-time Analytics using Data from IoT Devices - AWS Online Tech Talks
Real-time Analytics using Data from IoT Devices - AWS Online Tech TalksAmazon Web Services
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services
 

Similaire à Best Practices for Distributed Machine Learning and Predictive Analytics Using Amazon EMR and Open-Source Tools - ABD403 - re:Invent 2017 (20)

Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS
 
ABD311_Deploying Amazon QuickSight For Enterprise
ABD311_Deploying Amazon QuickSight For EnterpriseABD311_Deploying Amazon QuickSight For Enterprise
ABD311_Deploying Amazon QuickSight For Enterprise
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
 
Deploying Business Analytics at Enterprise Scale - AWS Online Tech Talks
Deploying Business Analytics at Enterprise Scale - AWS Online Tech TalksDeploying Business Analytics at Enterprise Scale - AWS Online Tech Talks
Deploying Business Analytics at Enterprise Scale - AWS Online Tech Talks
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data Oceans
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with Zopa
 
A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Mas...
A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Mas...A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Mas...
A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Mas...
 
How TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPTHow TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPT
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
 
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
 
Deep Dive on Big Data
Deep Dive on Big Data Deep Dive on Big Data
Deep Dive on Big Data
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
 
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
 
Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
 Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWS
 
Applying AWS Purpose-Built Database Strategy - SRV307 - Anaheim AWS Summit
Applying AWS Purpose-Built Database Strategy - SRV307 - Anaheim AWS SummitApplying AWS Purpose-Built Database Strategy - SRV307 - Anaheim AWS Summit
Applying AWS Purpose-Built Database Strategy - SRV307 - Anaheim AWS Summit
 
Real-time Analytics using Data from IoT Devices - AWS Online Tech Talks
Real-time Analytics using Data from IoT Devices - AWS Online Tech TalksReal-time Analytics using Data from IoT Devices - AWS Online Tech Talks
Real-time Analytics using Data from IoT Devices - AWS Online Tech Talks
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 

Plus de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Best Practices for Distributed Machine Learning and Predictive Analytics Using Amazon EMR and Open-Source Tools - ABD403 - re:Invent 2017

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practices for Distributed Machine Learning & Predictive Analytics on AWS K e i t h S t e w a r d , P h . D . S p e c i a l i s t S A , A W S A B D 4 0 3
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What we’ll cover: 1. Predictive Analytics: What is it? Why is it important? 2. Predictive Analytics: How can you do it at scale and with agility on AWS? • How to do a Data Lake with Amazon S3 and AWS Glue? • How to Train a Predictive Machine Learning Model in a distributed fashion, using Amazon EMR with Apache Spark ML + Apache Zeppelin? • How to perform distributed SQL queries on big data using Amazon Athena? • How to visualize data using Amazon QuickSight? 3. Demo: Use predictive analytics to identify target customers.
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Predictive Analytics: What is it? Why is it important?
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Wikipedia Definitions: Analytics: the discovery, interpretation, and communication of meaningful patterns in data. Predictive Analytics: encompasses a variety of statistical techniques from predictive modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future or otherwise unknown events.
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Predictive Analytics Use Cases: Customer predictions • “Which customers are likely to be the most profitable?” • “How much revenue should I expect this customer to generate?” • “Which customers are likely to churn?” • “Among all of our customers, which are likely to respond to a given offer?” • “What’s the probability that a given customer X will respond to a given offer Y” Product or Service predictions • “What products should we offer or develop?” • “What items are likely to be purchased together?” (basket analysis) Business Operations predictions • “Are the metrics for service X nominal or anomalous?” • “Is equipment X likely to fail within time Y?”
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why is Predictive Analytics Important? • Companies have been accumulating “Big Data” about customers, products/services, and operations for many years now. • Big Data technologies have provided proven solutions to address this data management need. • But… Increasing pressure to turn data into insights about trends, classifications, detect anomalies, and provide feedback loops to improve their business
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Why is Predictive Analytics Important? • Desire to evolve from backwards-looking monthly or quarterly reports to real-time alerts, and now to predicting the future Business’ Evolving Needs for Data
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 2. Predictive Analytics: How can you do it at scale and with agility on AWS?
  • 10. Plethora of Tools Amazon Glacier S3 DynamoDB RDS EMR Amazon Redshift Data Pipeline Amazon Kinesis Kinesis-enabled app Lambda ML SQS ElastiCache DynamoDB Streams Amazon Elasticsearch Service
  • 11. Amazon SQS apps Streaming Amazon Kinesis Analytics Amazon KCL apps AWS Lambda Amazon Redshift COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR Amazon Elasticsearch Service Apache Kafka Amazon SQS Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS Amazon DynamoDB Streams HotHotWarm FastSlowFast BatchMessageInteractiveStreamML SearchSQLNoSQLCacheFileQueueStream Amazon EC2 Amazon EC2 Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENTS FILES MESSAGES STREAMS Amazon QuickSight Apps & Services Analysis&visualizationNotebooksIDEAPI Reference architecture LoggingIoTApplicationsTransportMessaging ETL AWS Glue Amazon Athena
  • 12. COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon EMR Amazon S3 FastSlow File Amazon QuickSight Reference architecture ETL AWS Glue Amazon Athena
  • 13. COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon EMR Amazon S3 FastSlow File Amazon QuickSight Reference architecture ETL AWS Glue Amazon Athena
  • 14. Store anything (object storage) Scalable / Elastic 99.999999999% durability Effectively infinite inbound bandwidth Extremely low cost: $0.023/GB-Mo; $23/TB-Mo Data layer for virtually all AWS services Amazon S3
  • 15. Designed for 11 9s of durability Designed for 99.99% availability Durable Available High performance § Multiple upload § Range GET § Store as much as you need § Scale storage and compute independently § No minimum usage commitments Scalable § Amazon EMR § Amazon Redshift § Amazon DynamoDB Integrated § Simple REST API § AWS SDKs § Read-after-create consistency § Event notification § Lifecycle policies Easy to use Why Amazon S3 for data lake?
  • 16. Data Catalog Managed ETL Engine Job Scheduler Built on Apache Spark Integrated with S3, RDS, Redshift Integrates with any JDBC data store AWS Glue
  • 17. Glue data catalog Discover and organize your data sets
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Glue data catalog Manage table metadata through a Hive metastore API or Hive SQL. Supported by tools such as Hive, Presto, Spark, etc. We added a few extensions: § Search metadata for data discovery § Connection info – JDBC URLs, credentials § Classification for identifying and parsing files § Versioning of table metadata as schemas evolve and other metadata are updated Populate using Hive DDL, bulk import, or automatically through crawlers.
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Crawlers: auto-populate data catalog Automatic schema inference: • Built-in classifiers detect file type and extract schema: record structure and data types. • Add your own or share with others in the Glue community - It's all Grok and Python. Auto-detects Hive-style partitions, grouping similar files into one table. Run crawlers on schedule to discover new data and schema changes. Serverless – only pay when crawls run.
  • 20. Scalable Hadoop clusters as a service Hadoop, Hive, Spark, Presto, Hbase, etc. Easy to use; fully managed On demand, reserved, spot pricing HDFS, S3, and Amazon EBS filesystems Integrates with AWS Glue End to end security Amazon EMR
  • 21. EMR: Just some of the organizations using
  • 22. Amazon EMR / Spark ML offers these Models: • Classification • Logistic regression • Decision tree classifier • Random forest classifier • Gradient-boosted tree classifier • Multilayer perceptron classifier • One-vs-Rest classifier (a.k.a. One-vs-All) • Naive Bayes • Regression • Linear regression • Generalized linear regression • Decision tree regression • Random forest regression • Gradient-boosted tree regression • Survival regression • Isotonic regression • Linear methods • Decision trees • Tree Ensembles • Random Forests • Gradient-Boosted Trees (GBTs)
  • 23. What are Decision Trees? Weather predictors for Golf
  • 24. Decision Trees: Training Data Bank loan write-off predictions
  • 25. Fast Business Analytics / Data Visualization Service running in AWS Access multiple AWS data sources (S3, Athena, Aurora, Redshift, etc) or data you upload Perform ad hoc analyses to gain insights from your data Supports 100’s of thousands of users Smart visualizations dynamically optimized for your data Generate fast, interactive visualizations on large datasets Amazon QuickSight
  • 26. Amazon QuickSight is a Business Analytics Service that lets business users quickly and easily visualize, explore, and share insights from their data.
  • 27. QuickSight Overview Integrated with AWS Redshift, RDS, Athena, S3, IAM, Roles, etc… Cloud Native Fully managed, serverless analytics at scale Super Fast and Easy to Use Backed by SPICE and a beautiful UI Cost Effective Starts at $9 per user per month
  • 28. QuickSight is deeply integrated with AWS data sources like Redshift, RDS, S3, Athena and others, as well as third-party sources like Excel, Salesforce, as well as on-premises databases. Deep Integration with AWS Data Sources Amazon RDS, Aurora Amazon Redshift Amazon Athena Amazon S3 Files
  • 29. QuickSight is powered by SPICE, a super-fast calculation engine that delivers unprecedented performance and scale, delivering insights at the speed of thought. SPICE SPICE
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3. Demo: Use predictive analytics to identify target customers.
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Our Scenario: We are LuxCars, Inc. The Challenge: q We’re bringing an expensive new luxury car (the “Luxmobile”) to market. q Want to focus marketing on customers likely to be wealthy enough to afford our luxury automobile. q Have demographic data for our target geography, but we don’t have income data. q Need to predict salary from the demographic data.
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Our Scenario: We are LuxCars Inc. Strategy: 1. Collect U.S. Census Income data & put into the Cloud. 2. Schematize the data. 3. Train a Spark ML model on the data. 4. Do batch predictions of income for people in the target geography. 5. Compute mean incomes per sub-geography (“income density”) 6. Visualize those predictions. 7. Share results with our Marketing Director.
  • 33. Demo: Predictive Analytics architecture Amazon EMR + Zeppelin + Spark ML EC2 instance U.S. Census (income data) S3 bucket Amazon QuickSight AWS Glue Amazon Athena AWS cloud download + unpack schematize Marketing Director Model (decision tree classifier)
  • 34. Demo: Predictive Analytics architecture Let’s Build It!
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU! D on ’t forg et t o p r ovid e eva lua t ion s for t h is p res en t a t ion !