SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lyon Data Science – 07/01/2016
julsimon@amazon.fr
@julsimon
Building prediction models with
Amazon Redshift and Amazon ML
Julien Simon, Technical Evangelist, AWS
Navigating the seven seas of Big Data
Universal Pictures
You
Your boss
Your infrastructure
You need a better boat, not a bigger one
Let’s face it, Big Data is great fun until:
a) terabytes of data are lost forever
b) tons of money are spent on non-scalable systems
c) months are wasted on plumbing
d) all of the above…
How about simple, cost-efficient, managed services that
non-experts could use to build Big Data apps?
Collect Store Analyze Consume
A
iOS
 Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ElasticSearch
Amazon

S3
Apache
Kafka
Amazon

Glacier
Amazon

Kinesis
Amazon

DynamoDB
Amazon
Redshift
Impala
Pig
Amazon ML
Streaming
Amazon

Kinesis
AWS
Lambda
AmazonElasticMapReduce
Amazon
ElastiCache
SearchSQLNoSQLCache
StreamProcessing
Batch
Interactive
Logging
StreamStorage
IoT
Applications
FileStorage
Analysis&Visualization
Hot
Cold
Warm
Hot
Slow
Hot
ML
Fast
Fast
Amazon
QuickSight
Transactional Data
File Data
Stream Data
Notebooks
Predictions
Apps & APIs
Mobile
Apps
IDE
Search Data
ETL
AWS re:Invent 2015 | (BDT310) Big Data Architectural Patterns and Best Practices on AWS
Interactive &
Batch analytics
Producer Amazon S3
Amazon EMR
Hive
Pig
Spark
Amazon
ML
process
store
Consume
Amazon
Redshift
Amazon EMR
Presto
Impala
Spark
Batch
Interactive
Batch Prediction
Real-time Prediction
AWS re:Invent 2015 | (BDT310)
Big Data Architectural Patterns and Best Practices on AWS
Real-time analytics
Producer
Apache
Kafka
Kinesis
Client Library
AWS Lambda
Spark
Streaming
Apache
Storm
Amazon
SNS
Amazon
ML
Notifications
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
Alert
App state
Real-time Prediction
KPI
process
store
DynamoDB
Streams
Amazon
Kinesis
AWS re:Invent 2015 | (BDT310)
Big Data Architectural Patterns and Best Practices on AWS
Amazon Redshift
an enterprise-level, petabyte scale, fully managed
data warehousing service
Column-oriented database
Optimized for OLAP and BI workloads
Based on PostgreSQL 8.0.2
•  SQL is all you need to know
•  PostgreSQL ODBC and JDBC drivers are supported
•  https://docs.aws.amazon.com/fr_fr/redshift/latest/dg/c_redshift-
and-postgres-sql.html
Free tier: https://aws.amazon.com/fr/redshift/free-trial/
Available on-demand from $0.25 / hour / node (us-east-1)
As low as $0.094 / hour / node (us-east-1, 3-year RI)
https://aws.amazon.com/redshift
Amazon Redshift architecture
https://docs.aws.amazon.com/fr_fr/redshift/latest/dg/c_high_level_system_architecture.html
Parallel processing
Columnar data storage
Data compression
Query optimization
Compiled code
Workload management
https://docs.aws.amazon.com/fr_fr/redshift/latest/mgmt/working-with-clusters.html#rs-about-clusters-and-nodes
Instance types
Case study: Photobox
Maxime Mezin, Data & Photo Science Director:
“L’entrepôt de données ne comportait que les données du site e-commerce liées aux ventes.
Alors que nous avions la volonté d’intégrer des données du service clients et des données
d’analyse (…) Nous avions atteint la limite du stockage de la base Oracle, et cela ne marchait
pas très bien en termes de performances”
“Avec Redshift, la rapidité d’exécution des traitements a été multipliée par 10. Sans parler de
la vitesse de chargement des données”
“On paie en fonction de la quantité de données que l’on va stocker. Chez Google, cela était
plus compliqué”
•  2 Redshift clusters : 1 for historical data, 1 for real-time processing (SSD)
•  TCO divided by 7 (90K€à13K€)
http://www.lemagit.fr/etude/Photobox-consolide-et-analyse-ses-donnees-avec-AWS-RedShift
Case study: Financial Times
•  BI analysis of customer usage, to decide which stories to cover
•  Conventional data warehouse built using Microsoft technologies
•  Scalability issues, impossible to perform real-time analytics à Amazon Redshift PoC
•  Amazon Redshift performed so quickly that some analysts thought it was malfunctioning J
John O’Donovan, CTO:
“Some of the queries we’re running are 98 percent faster, and most things are running 90
percent faster (…) and the ability to try Redshift out before having to invest a significant
amount of capital was a huge bonus.”
“Amazon Redshift is the single source of truth for our user data.”
“Being able to explore near-real-time data improves our decision making massively. We can
make decisions based on what’s happening now rather than what happened three or four
days ago.”
https://aws.amazon.com/solutions/case-studies/financial-times/
Amazon Redshift performance
No indexes, no partitioning, etc.
Distribution key
•  How data is spread across nodes
•  EVEN (default), ALL, KEY
Sort key
•  How data is sorted inside of disk blocks
•  Compound and interleaved keys are possible
Both are crucial to query performance! Universal Pictures
DEMO #1
Demo gods, I’m your humble servant,
please be good to me
Create tables:
no key (1), sort key (2),
sort key + distribution key (3)
Load 1 billion lines (~45GB) from Amazon S3
Run queries (A, B, C) on a 4-node cluster
… while resizing the cluster to 16 nodes J
932
529
233
0
100
200
300
400
500
600
700
800
900
1000
4 8 16
Load 1B lines
241
111
50,00
153
73
42,50
24,8
13,2
6,00
0
50
100
150
200
250
300
4 8 16
A1
A2
A3
27,1
15,1
6,80
18,6
9,8
5,10
14,2
7,2
3,70
0
5
10
15
20
25
30
4 8 16
B1
B2
B3
18,5
9,2
4,20
1,2 0,72 0,65
0,48 0,49 0,450
2
4
6
8
10
12
14
16
18
20
4 8 16
C1
C2
C3
Amazon Machine Learning
A managed service for building ML models
and generating predictions
Integration with Amazon S3, Redshift and RDS
Data transformation, visualization and exploration
Model evaluation and interpretation tools
API for inspection and automation
API for batch and real-time predictions
$0.42 / hour for analysis and model building (eu-west-1)
$0.10 per 1000 batch predictions
$0.0001 per real-time prediction
https://aws.amazon.com/machine-learning
Amazon ML prediction algorithms
•  Binary attributes à binary classification
•  Categorical attributes à multi-class classification
•  Numeric attributes à linear regression
•  Code samples on
https://github.com/awslabs/machine-learning-samples
Case study: BuildFax
“Amazon Machine Learning
democratizes the process
of building predictive
models. It's easy and fast to
use, and has machine-
learning best practices
encapsulated in the
product, which lets us
deliver results significantly
faster than in the past”
Joe Emison, Founder &
Chief Technology Officer
https://aws.amazon.com/solutions/case-studies/buildfax/
Other Amazon ML use cases
ICAO classifies and
categorizes international
aviation incidents daily
Plans to train ML models to predict
fuel efficiency based on vehicle
metadata, to push near-real-time data
to customers
Securely sorts millions of mail pieces
for clients every day for tremendous
cost savings then brings it to the USPS
for delivery
DEMO #2
Demo gods, I know I’m pushing it,
but please don’t let me down now
Load data from Amazon S3 to Redshift and explore
Train and evaluate a regression model with Amazon ML
Perform batch prediction from data stored in Amazon S3
Load data from Amazon Redshift
Train and evaluate a regression model with Amazon ML
Create a real-time prediction API
Perform real-time prediction from Java app
(https://raw.githubusercontent.com/juliensimon/aws/master/ML/MLSample.java)
Thank you! Let’s keep in touch J
Want to start a local AWS User Group? https://aws.amazon.com/fr/usergroups/
Many events and meetups in 2016 all across France
à Please follow us at @aws_actus @julsimon
AWS Summit Paris : 31/05/2016 (free!) https://aws.amazon.com/fr/paris16/
14/01
13/01
BONUS SLIDES
https://docs.aws.amazon.com/fr_fr/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html
Row storage vs columnar storage
Amazon Redshift & Machine Learning resources
Documentation
https://aws.amazon.com/documentation/redshift/
https://aws.amazon.com/documentation/machine-learning/
Big Data videos from AWS re:Invent 2015
https://blogs.aws.amazon.com/bigdata/post/Tx3D3UYOXB9XG6Z/Videos-now-available-for-
AWS-re-Invent-2015-Big-Data-Analytics-sessions
If you’re going to watch only one: https://www.youtube.com/watch?v=K7o5OlRLtvU
More Amazon Redshift resources
Articles by Werner Vogel, CTO, Amazon.com
http://www.allthingsdistributed.com/2012/11/amazon-redshift.html
http://www.allthingsdistributed.com/2013/02/amazon-redshift-resilience.html
http://www.allthingsdistributed.com/2013/05/amazon-redshift-designing-for-security.html
Tuning Amazon Redshift
https://docs.aws.amazon.com/fr_fr/redshift/latest/dg/t_Sorting_data.html
https://docs.aws.amazon.com/fr_fr/redshift/latest/dg/t_Distributing_data.html
http://blogs.aws.amazon.com/bigdata/post/Tx31034QG0G3ED1/Top-10-Performance-Tuning-
Techniques-for-Amazon-Redshift
Creating and deleting an Amazon Redshift cluster
$ aws redshift create-cluster --cluster-identifier CLUSTER_NAME
--node-type dc1.large --number-of-nodes 4
--db-name DATABASE_NAME
--master-username USER_NAME
--master-user-password USER_PASSWORD
--publicly-accessible ß probably not what you want!
$ aws redshift delete-cluster --cluster-identifier CLUSTER_NAME
--skip-final-cluster-snapshot ß data will be lost!
Connecting to Amazon Redshift with psql
$ psql -h xxx.redshift.amazonaws.com -p 5439
-d DB_NAME -U USER_NAME
Force SSL:
$ psql -h xxx.redshift.amazonaws.com -p 5439
-U USER_NAME "dbname=DB_NAME sslmode=require”
Loading data from Amazon S3 to Amazon Redshift
COPY command example
$ copy TABLE_NAME
from 's3://BUCKET_NAME/FOLDER_NAME/'
region 'eu-west-1’
credentials ‘aws_access_key_id=MY_ACCESS_KEY;
aws_secret_access_key=MY_SECRET_KEY’
delimiter ',’ bzip2 maxerror 1000;
View last 10 load errors
select * from stl_load_errors order by starttime desc limit 10;
Resizing an Amazon Redshift cluster
$ aws redshift modify-cluster
--cluster-identifier mycluster
--number-of-nodes 8
Listing Amazon ML models
$ aws machinelearning describe-ml-models
--query "Results[*].{Name:Name, Id:MLModelId,
Type:MLModelType}"

Contenu connexe

Tendances

AWS Cloud Kata 2014 | Jakarta - Startup Best Practices
AWS Cloud Kata 2014 | Jakarta - Startup Best PracticesAWS Cloud Kata 2014 | Jakarta - Startup Best Practices
AWS Cloud Kata 2014 | Jakarta - Startup Best Practices
Amazon Web Services
 
AWS Summit Tel Aviv - Startup Track - Architecting for High Availability
AWS Summit Tel Aviv - Startup Track - Architecting for High AvailabilityAWS Summit Tel Aviv - Startup Track - Architecting for High Availability
AWS Summit Tel Aviv - Startup Track - Architecting for High Availability
Amazon Web Services
 
AWS Cloud Kata | Taipei - Getting to Profitability
AWS Cloud Kata | Taipei - Getting to ProfitabilityAWS Cloud Kata | Taipei - Getting to Profitability
AWS Cloud Kata | Taipei - Getting to Profitability
Amazon Web Services
 
AWS Cloud Kata | Taipei - Getting to MVP
AWS Cloud Kata | Taipei - Getting to MVPAWS Cloud Kata | Taipei - Getting to MVP
AWS Cloud Kata | Taipei - Getting to MVP
Amazon Web Services
 
AWS Cloud Kata | Taipei - Getting to Scale
AWS Cloud Kata | Taipei - Getting to ScaleAWS Cloud Kata | Taipei - Getting to Scale
AWS Cloud Kata | Taipei - Getting to Scale
Amazon Web Services
 

Tendances (20)

AWS Customer Presentation - NASDAQ OMX
AWS Customer Presentation - NASDAQ OMX AWS Customer Presentation - NASDAQ OMX
AWS Customer Presentation - NASDAQ OMX
 
Cloudonomics
CloudonomicsCloudonomics
Cloudonomics
 
Introduction to AWS
Introduction to AWSIntroduction to AWS
Introduction to AWS
 
AWS AWSome Day London October 2015
AWS AWSome Day London October 2015 AWS AWSome Day London October 2015
AWS AWSome Day London October 2015
 
Picking the right AWS backend for your Java application
Picking the right AWS backend for your Java applicationPicking the right AWS backend for your Java application
Picking the right AWS backend for your Java application
 
Innovating in the Cloud from #NimbusIGNITE
Innovating in the Cloud from #NimbusIGNITEInnovating in the Cloud from #NimbusIGNITE
Innovating in the Cloud from #NimbusIGNITE
 
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...
 
#EarthOnAWS: How the Cloud Is Transforming Earth Observation | AWS Public Sec...
#EarthOnAWS: How the Cloud Is Transforming Earth Observation | AWS Public Sec...#EarthOnAWS: How the Cloud Is Transforming Earth Observation | AWS Public Sec...
#EarthOnAWS: How the Cloud Is Transforming Earth Observation | AWS Public Sec...
 
AWS Cloud Kata 2014 | Jakarta - Startup Best Practices
AWS Cloud Kata 2014 | Jakarta - Startup Best PracticesAWS Cloud Kata 2014 | Jakarta - Startup Best Practices
AWS Cloud Kata 2014 | Jakarta - Startup Best Practices
 
AWS Summit Tel Aviv - Startup Track - Architecting for High Availability
AWS Summit Tel Aviv - Startup Track - Architecting for High AvailabilityAWS Summit Tel Aviv - Startup Track - Architecting for High Availability
AWS Summit Tel Aviv - Startup Track - Architecting for High Availability
 
Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale Modern Data Architectures for Business Insights at Scale
Modern Data Architectures for Business Insights at Scale
 
AWS Cloud Kata | Taipei - Getting to Profitability
AWS Cloud Kata | Taipei - Getting to ProfitabilityAWS Cloud Kata | Taipei - Getting to Profitability
AWS Cloud Kata | Taipei - Getting to Profitability
 
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...
 
SRV301 Getting the Most Bang for your Buck with #EC2 #Winning
SRV301 Getting the Most Bang for your Buck with #EC2 #WinningSRV301 Getting the Most Bang for your Buck with #EC2 #Winning
SRV301 Getting the Most Bang for your Buck with #EC2 #Winning
 
Scale, baby, scale! (June 2016)
Scale, baby, scale! (June 2016)Scale, baby, scale! (June 2016)
Scale, baby, scale! (June 2016)
 
Workshop : Wild Rydes Takes Off - The Dawn of a New Unicorn
Workshop : Wild Rydes Takes Off - The Dawn of a New UnicornWorkshop : Wild Rydes Takes Off - The Dawn of a New Unicorn
Workshop : Wild Rydes Takes Off - The Dawn of a New Unicorn
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
AWS Cloud Kata | Taipei - Getting to MVP
AWS Cloud Kata | Taipei - Getting to MVPAWS Cloud Kata | Taipei - Getting to MVP
AWS Cloud Kata | Taipei - Getting to MVP
 
AWS Cloud Kata | Taipei - Getting to Scale
AWS Cloud Kata | Taipei - Getting to ScaleAWS Cloud Kata | Taipei - Getting to Scale
AWS Cloud Kata | Taipei - Getting to Scale
 
(ISM402) Cost Optimization at Scale
(ISM402) Cost Optimization at Scale(ISM402) Cost Optimization at Scale
(ISM402) Cost Optimization at Scale
 

En vedette

A product-focused introduction to Machine Learning
A product-focused introduction to Machine LearningA product-focused introduction to Machine Learning
A product-focused introduction to Machine Learning
Satpreet Singh
 

En vedette (20)

Amazon Machine Learning Case Study: Predicting Customer Churn
Amazon Machine Learning Case Study: Predicting Customer ChurnAmazon Machine Learning Case Study: Predicting Customer Churn
Amazon Machine Learning Case Study: Predicting Customer Churn
 
Build a Recommendation Engine using Amazon Machine Learning in Real-time
Build a Recommendation Engine using Amazon Machine Learning in Real-timeBuild a Recommendation Engine using Amazon Machine Learning in Real-time
Build a Recommendation Engine using Amazon Machine Learning in Real-time
 
Amazon Machine Learning for Developers
Amazon Machine Learning for DevelopersAmazon Machine Learning for Developers
Amazon Machine Learning for Developers
 
Amazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart ApplicationsAmazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart Applications
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
A product-focused introduction to Machine Learning
A product-focused introduction to Machine LearningA product-focused introduction to Machine Learning
A product-focused introduction to Machine Learning
 
(BDT302) Real-World Smart Applications With Amazon Machine Learning
(BDT302) Real-World Smart Applications With Amazon Machine Learning(BDT302) Real-World Smart Applications With Amazon Machine Learning
(BDT302) Real-World Smart Applications With Amazon Machine Learning
 
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
 
Introduction to Machine Learning (case studies)
Introduction to Machine Learning (case studies)Introduction to Machine Learning (case studies)
Introduction to Machine Learning (case studies)
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
Real-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon KinesisReal-Time Streaming: Intro to Amazon Kinesis
Real-Time Streaming: Intro to Amazon Kinesis
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
 
Building Scalable Services with Amazon API Gateway - Technical 201
Building Scalable Services with Amazon API Gateway - Technical 201Building Scalable Services with Amazon API Gateway - Technical 201
Building Scalable Services with Amazon API Gateway - Technical 201
 
Slashing Big Data Complexity: How Comcast X1 Syndicates Streaming Analytics w...
Slashing Big Data Complexity: How Comcast X1 Syndicates Streaming Analytics w...Slashing Big Data Complexity: How Comcast X1 Syndicates Streaming Analytics w...
Slashing Big Data Complexity: How Comcast X1 Syndicates Streaming Analytics w...
 
Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...
Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...
Stream Data Analytics with Amazon Kinesis Firehose & Redshift - AWS August We...
 
Building a Real-Time Geospatial-Aware Recommendation Engine
 Building a Real-Time Geospatial-Aware Recommendation Engine Building a Real-Time Geospatial-Aware Recommendation Engine
Building a Real-Time Geospatial-Aware Recommendation Engine
 
Getting Started with Amazon Kinesis
Getting Started with Amazon KinesisGetting Started with Amazon Kinesis
Getting Started with Amazon Kinesis
 
AWS Real-Time Event Processing
AWS Real-Time Event ProcessingAWS Real-Time Event Processing
AWS Real-Time Event Processing
 

Similaire à Building prediction models with Amazon Redshift and Amazon ML

Similaire à Building prediction models with Amazon Redshift and Amazon ML (20)

데이터 기반 의사결정을 통한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)
데이터 기반 의사결정을 통한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)데이터 기반 의사결정을 통한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)
데이터 기반 의사결정을 통한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)
 
클라우드 기반 데이터 분석 및 인공 지능을 위한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)
클라우드 기반 데이터 분석 및 인공 지능을 위한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)클라우드 기반 데이터 분석 및 인공 지능을 위한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)
클라우드 기반 데이터 분석 및 인공 지능을 위한 비지니스 혁신 - 윤석찬 (AWS 테크에반젤리스트)
 
Building prediction models with Amazon Redshift and Amazon Machine Learning -...
Building prediction models with Amazon Redshift and Amazon Machine Learning -...Building prediction models with Amazon Redshift and Amazon Machine Learning -...
Building prediction models with Amazon Redshift and Amazon Machine Learning -...
 
Why Scale Matters and How the Cloud is Really Different (at scale)
Why Scale Matters and How the Cloud is Really Different (at scale)Why Scale Matters and How the Cloud is Really Different (at scale)
Why Scale Matters and How the Cloud is Really Different (at scale)
 
Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS Build Data Lakes and Analytics on AWS
Build Data Lakes and Analytics on AWS
 
B3 - Business intelligence apps on aws
B3 - Business intelligence apps on awsB3 - Business intelligence apps on aws
B3 - Business intelligence apps on aws
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWS
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
AWS Summit Seoul 2015 - AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
AWS Summit Seoul 2015 -  AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...AWS Summit Seoul 2015 -  AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
AWS Summit Seoul 2015 - AWS 최신 서비스 살펴보기 - Aurora, Lambda, EFS, Machine Learn...
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
 
AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)
AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)
AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)
 
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWSAWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
 
Scale, baby, scale!
Scale, baby, scale!Scale, baby, scale!
Scale, baby, scale!
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Opening Keynote
Opening KeynoteOpening Keynote
Opening Keynote
 
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 

Plus de Julien SIMON

Plus de Julien SIMON (20)

An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face Transformers
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)Building Machine Learning Models Automatically (June 2020)
Building Machine Learning Models Automatically (June 2020)
 
Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)Starting your AI/ML project right (May 2020)
Starting your AI/ML project right (May 2020)
 
Scale Machine Learning from zero to millions of users (April 2020)
Scale Machine Learning from zero to millions of users (April 2020)Scale Machine Learning from zero to millions of users (April 2020)
Scale Machine Learning from zero to millions of users (April 2020)
 
An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)An Introduction to Generative Adversarial Networks (April 2020)
An Introduction to Generative Adversarial Networks (April 2020)
 
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
AIM410R1 Deep learning applications with TensorFlow, featuring Fannie Mae (De...
 
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
AIM361 Optimizing machine learning models with Amazon SageMaker (December 2019)
 
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
AIM410R Deep Learning Applications with TensorFlow, featuring Mobileye (Decem...
 
A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)A pragmatic introduction to natural language processing models (October 2019)
A pragmatic introduction to natural language processing models (October 2019)
 
Building smart applications with AWS AI services (October 2019)
Building smart applications with AWS AI services (October 2019)Building smart applications with AWS AI services (October 2019)
Building smart applications with AWS AI services (October 2019)
 
Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)Build, train and deploy ML models with SageMaker (October 2019)
Build, train and deploy ML models with SageMaker (October 2019)
 
The Future of AI (September 2019)
The Future of AI (September 2019)The Future of AI (September 2019)
The Future of AI (September 2019)
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)
 
Train and Deploy Machine Learning Workloads with AWS Container Services (July...
Train and Deploy Machine Learning Workloads with AWS Container Services (July...Train and Deploy Machine Learning Workloads with AWS Container Services (July...
Train and Deploy Machine Learning Workloads with AWS Container Services (July...
 
Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)Optimize your Machine Learning Workloads on AWS (July 2019)
Optimize your Machine Learning Workloads on AWS (July 2019)
 
Deep Learning on Amazon Sagemaker (July 2019)
Deep Learning on Amazon Sagemaker (July 2019)Deep Learning on Amazon Sagemaker (July 2019)
Deep Learning on Amazon Sagemaker (July 2019)
 
Automate your Amazon SageMaker Workflows (July 2019)
Automate your Amazon SageMaker Workflows (July 2019)Automate your Amazon SageMaker Workflows (July 2019)
Automate your Amazon SageMaker Workflows (July 2019)
 
Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)Build, train and deploy ML models with Amazon SageMaker (May 2019)
Build, train and deploy ML models with Amazon SageMaker (May 2019)
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Building prediction models with Amazon Redshift and Amazon ML

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lyon Data Science – 07/01/2016 julsimon@amazon.fr @julsimon Building prediction models with Amazon Redshift and Amazon ML Julien Simon, Technical Evangelist, AWS
  • 2. Navigating the seven seas of Big Data
  • 4. You need a better boat, not a bigger one Let’s face it, Big Data is great fun until: a) terabytes of data are lost forever b) tons of money are spent on non-scalable systems c) months are wasted on plumbing d) all of the above… How about simple, cost-efficient, managed services that non-experts could use to build Big Data apps?
  • 5. Collect Store Analyze Consume A iOS Android Web Apps Logstash Amazon RDS Amazon DynamoDB Amazon ElasticSearch Amazon
 S3 Apache Kafka Amazon
 Glacier Amazon
 Kinesis Amazon
 DynamoDB Amazon Redshift Impala Pig Amazon ML Streaming Amazon
 Kinesis AWS Lambda AmazonElasticMapReduce Amazon ElastiCache SearchSQLNoSQLCache StreamProcessing Batch Interactive Logging StreamStorage IoT Applications FileStorage Analysis&Visualization Hot Cold Warm Hot Slow Hot ML Fast Fast Amazon QuickSight Transactional Data File Data Stream Data Notebooks Predictions Apps & APIs Mobile Apps IDE Search Data ETL AWS re:Invent 2015 | (BDT310) Big Data Architectural Patterns and Best Practices on AWS
  • 6. Interactive & Batch analytics Producer Amazon S3 Amazon EMR Hive Pig Spark Amazon ML process store Consume Amazon Redshift Amazon EMR Presto Impala Spark Batch Interactive Batch Prediction Real-time Prediction AWS re:Invent 2015 | (BDT310) Big Data Architectural Patterns and Best Practices on AWS
  • 7. Real-time analytics Producer Apache Kafka Kinesis Client Library AWS Lambda Spark Streaming Apache Storm Amazon SNS Amazon ML Notifications Amazon ElastiCache Amazon DynamoDB Amazon RDS Amazon ES Alert App state Real-time Prediction KPI process store DynamoDB Streams Amazon Kinesis AWS re:Invent 2015 | (BDT310) Big Data Architectural Patterns and Best Practices on AWS
  • 8. Amazon Redshift an enterprise-level, petabyte scale, fully managed data warehousing service Column-oriented database Optimized for OLAP and BI workloads Based on PostgreSQL 8.0.2 •  SQL is all you need to know •  PostgreSQL ODBC and JDBC drivers are supported •  https://docs.aws.amazon.com/fr_fr/redshift/latest/dg/c_redshift- and-postgres-sql.html Free tier: https://aws.amazon.com/fr/redshift/free-trial/ Available on-demand from $0.25 / hour / node (us-east-1) As low as $0.094 / hour / node (us-east-1, 3-year RI) https://aws.amazon.com/redshift
  • 9. Amazon Redshift architecture https://docs.aws.amazon.com/fr_fr/redshift/latest/dg/c_high_level_system_architecture.html Parallel processing Columnar data storage Data compression Query optimization Compiled code Workload management
  • 11. Case study: Photobox Maxime Mezin, Data & Photo Science Director: “L’entrepôt de données ne comportait que les données du site e-commerce liées aux ventes. Alors que nous avions la volonté d’intégrer des données du service clients et des données d’analyse (…) Nous avions atteint la limite du stockage de la base Oracle, et cela ne marchait pas très bien en termes de performances” “Avec Redshift, la rapidité d’exécution des traitements a été multipliée par 10. Sans parler de la vitesse de chargement des données” “On paie en fonction de la quantité de données que l’on va stocker. Chez Google, cela était plus compliqué” •  2 Redshift clusters : 1 for historical data, 1 for real-time processing (SSD) •  TCO divided by 7 (90K€à13K€) http://www.lemagit.fr/etude/Photobox-consolide-et-analyse-ses-donnees-avec-AWS-RedShift
  • 12. Case study: Financial Times •  BI analysis of customer usage, to decide which stories to cover •  Conventional data warehouse built using Microsoft technologies •  Scalability issues, impossible to perform real-time analytics à Amazon Redshift PoC •  Amazon Redshift performed so quickly that some analysts thought it was malfunctioning J John O’Donovan, CTO: “Some of the queries we’re running are 98 percent faster, and most things are running 90 percent faster (…) and the ability to try Redshift out before having to invest a significant amount of capital was a huge bonus.” “Amazon Redshift is the single source of truth for our user data.” “Being able to explore near-real-time data improves our decision making massively. We can make decisions based on what’s happening now rather than what happened three or four days ago.” https://aws.amazon.com/solutions/case-studies/financial-times/
  • 13. Amazon Redshift performance No indexes, no partitioning, etc. Distribution key •  How data is spread across nodes •  EVEN (default), ALL, KEY Sort key •  How data is sorted inside of disk blocks •  Compound and interleaved keys are possible Both are crucial to query performance! Universal Pictures
  • 14. DEMO #1 Demo gods, I’m your humble servant, please be good to me Create tables: no key (1), sort key (2), sort key + distribution key (3) Load 1 billion lines (~45GB) from Amazon S3 Run queries (A, B, C) on a 4-node cluster … while resizing the cluster to 16 nodes J
  • 15. 932 529 233 0 100 200 300 400 500 600 700 800 900 1000 4 8 16 Load 1B lines 241 111 50,00 153 73 42,50 24,8 13,2 6,00 0 50 100 150 200 250 300 4 8 16 A1 A2 A3 27,1 15,1 6,80 18,6 9,8 5,10 14,2 7,2 3,70 0 5 10 15 20 25 30 4 8 16 B1 B2 B3 18,5 9,2 4,20 1,2 0,72 0,65 0,48 0,49 0,450 2 4 6 8 10 12 14 16 18 20 4 8 16 C1 C2 C3
  • 16. Amazon Machine Learning A managed service for building ML models and generating predictions Integration with Amazon S3, Redshift and RDS Data transformation, visualization and exploration Model evaluation and interpretation tools API for inspection and automation API for batch and real-time predictions $0.42 / hour for analysis and model building (eu-west-1) $0.10 per 1000 batch predictions $0.0001 per real-time prediction https://aws.amazon.com/machine-learning
  • 17. Amazon ML prediction algorithms •  Binary attributes à binary classification •  Categorical attributes à multi-class classification •  Numeric attributes à linear regression •  Code samples on https://github.com/awslabs/machine-learning-samples
  • 18. Case study: BuildFax “Amazon Machine Learning democratizes the process of building predictive models. It's easy and fast to use, and has machine- learning best practices encapsulated in the product, which lets us deliver results significantly faster than in the past” Joe Emison, Founder & Chief Technology Officer https://aws.amazon.com/solutions/case-studies/buildfax/
  • 19. Other Amazon ML use cases ICAO classifies and categorizes international aviation incidents daily Plans to train ML models to predict fuel efficiency based on vehicle metadata, to push near-real-time data to customers Securely sorts millions of mail pieces for clients every day for tremendous cost savings then brings it to the USPS for delivery
  • 20. DEMO #2 Demo gods, I know I’m pushing it, but please don’t let me down now Load data from Amazon S3 to Redshift and explore Train and evaluate a regression model with Amazon ML Perform batch prediction from data stored in Amazon S3 Load data from Amazon Redshift Train and evaluate a regression model with Amazon ML Create a real-time prediction API Perform real-time prediction from Java app (https://raw.githubusercontent.com/juliensimon/aws/master/ML/MLSample.java)
  • 21. Thank you! Let’s keep in touch J Want to start a local AWS User Group? https://aws.amazon.com/fr/usergroups/ Many events and meetups in 2016 all across France à Please follow us at @aws_actus @julsimon AWS Summit Paris : 31/05/2016 (free!) https://aws.amazon.com/fr/paris16/ 14/01 13/01
  • 24. Amazon Redshift & Machine Learning resources Documentation https://aws.amazon.com/documentation/redshift/ https://aws.amazon.com/documentation/machine-learning/ Big Data videos from AWS re:Invent 2015 https://blogs.aws.amazon.com/bigdata/post/Tx3D3UYOXB9XG6Z/Videos-now-available-for- AWS-re-Invent-2015-Big-Data-Analytics-sessions If you’re going to watch only one: https://www.youtube.com/watch?v=K7o5OlRLtvU
  • 25. More Amazon Redshift resources Articles by Werner Vogel, CTO, Amazon.com http://www.allthingsdistributed.com/2012/11/amazon-redshift.html http://www.allthingsdistributed.com/2013/02/amazon-redshift-resilience.html http://www.allthingsdistributed.com/2013/05/amazon-redshift-designing-for-security.html Tuning Amazon Redshift https://docs.aws.amazon.com/fr_fr/redshift/latest/dg/t_Sorting_data.html https://docs.aws.amazon.com/fr_fr/redshift/latest/dg/t_Distributing_data.html http://blogs.aws.amazon.com/bigdata/post/Tx31034QG0G3ED1/Top-10-Performance-Tuning- Techniques-for-Amazon-Redshift
  • 26. Creating and deleting an Amazon Redshift cluster $ aws redshift create-cluster --cluster-identifier CLUSTER_NAME --node-type dc1.large --number-of-nodes 4 --db-name DATABASE_NAME --master-username USER_NAME --master-user-password USER_PASSWORD --publicly-accessible ß probably not what you want! $ aws redshift delete-cluster --cluster-identifier CLUSTER_NAME --skip-final-cluster-snapshot ß data will be lost!
  • 27. Connecting to Amazon Redshift with psql $ psql -h xxx.redshift.amazonaws.com -p 5439 -d DB_NAME -U USER_NAME Force SSL: $ psql -h xxx.redshift.amazonaws.com -p 5439 -U USER_NAME "dbname=DB_NAME sslmode=require”
  • 28. Loading data from Amazon S3 to Amazon Redshift COPY command example $ copy TABLE_NAME from 's3://BUCKET_NAME/FOLDER_NAME/' region 'eu-west-1’ credentials ‘aws_access_key_id=MY_ACCESS_KEY; aws_secret_access_key=MY_SECRET_KEY’ delimiter ',’ bzip2 maxerror 1000; View last 10 load errors select * from stl_load_errors order by starttime desc limit 10;
  • 29. Resizing an Amazon Redshift cluster $ aws redshift modify-cluster --cluster-identifier mycluster --number-of-nodes 8
  • 30. Listing Amazon ML models $ aws machinelearning describe-ml-models --query "Results[*].{Name:Name, Id:MLModelId, Type:MLModelType}"