Amazon sage maker infinitely scalable machine learning algorithms

Amazon SageMaker
Algorithms
Edo Liberty, Director of Amazon AI Labs
Zohar Karnin, Bing Xiang, Baris Cuskon, Ramesh Nallapati, Phillip
Gautier, Madhav Jha, Ran Ding,Tim Januschowski, David Selinas,
BernieWang, Jan Gasthaus, Laurence Rouesnel, Amir Sadoughi, Piali
Das, Julio Delgado Mangas,Yury Astashonok, Can Balioglu, Saswata
Chakravarty, and Alex Smola

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Amazon SageMaker?
Exploration Training
Hosting

Machine Learning

Large Scale Machine Learning

Our Customers use ML at a massive
scale!
“We collect 160M events
daily in the ML pipeline and
run training over the last 15
days and need it to complete
in one hour. Effectively
there's 100M features in the
model” Valentino Volonghi,
CTO
“We process 3 million ad
requests a second, 100,000
features per request. That’s
250 trillion per day. Not your
run of the mill Data science
problem!”
Bill Simmons, CTO
“Our data warehouse is
100TB and we are
processing 2TB daily. We're
running mostly gradient
boosting (trees), LDA and K-
Means clustering and
collaborative filtering.“
Shahar Cizer Kobrinsky, VP
Architecture

Scalable Training Challenges
• Competency
• Handoff
• Production Readiness
• Model Selection
• Model Freshness
• Ephemeral data
• Pause/Resume
• Incremental Training
• Stability
• Predictability
• Elasticity
• Cost
• Time
• Accuracy
• Scale
• Data Access

Cost vs. Time
$$$$
$$$
$$
$
Minutes Hours Days Weeks Months
Single
Machine

Cost vs. Time
$$$$
$$$
$$
$
Single
Machine
Ideal Case

Cost vs. Time
$$$$
$$$
$$
$
Single
Machine
Distributed, with
Strong Machines
Ideal Case

Model Selection
1
1

Incremental Training
2
3
1
2

Production Readiness
Infeasible region
Data/Model Size
Investment
Acceptable effort
Required effort

Architecture and Design Choices

Streaming
State

Stability + Predictability
Data Size
Memory
Data Size
Time/Cost

Incremental Training
3
1
2

Cost vs. Time
GPU State

Cost vs. Time
GPU State
GPU State
GPU State

GPU
GPU
GPU Local
State
Shared
State
Local
State
Local
State
Cost vs. Time

Production Readiness + Handoff
SageMaker Training Container Management
Optimized Machine Learning Base Container
SageMaker Algorithms SDK
Algorithms Logic

Infeasible region
Data/Model Size
Investment
Acceptable effort
Required effort

Data/Model Size
Investment
Acceptable effort
Required effort No Infeasible region

Streaming Machine Learning -
A Scientific Challenge

Streaming Median Example
Frugal Streaming for Estimating Quantiles: One (or two) memory suffices: Qiang Ma, S. Muthukrishnan, Mark Sandler

sampling
sketching
Optimal Quantile Approximation in Streams Zohar Karnin, Kevin Lang, Edo Liberty

Amazon SageMaker Algorithms

Linear Learner
Regression:
Estimate a real valued function
Binary Classification:
Predict a 0/1 class

Linear Learner
Train
Fit thresholds
and select
Select model with best validation performance
>8x speedup over naïve parallel training!

Linear Learner
Regression (mean squared error)
SageMaker Other
1.02 1.06
1.09 1.02
0.332 0.183
0.086 0.129
83.3 84.5
Classification (F1 Score)
SageMaker Other
0.980 0.981
0.870 0.930
0.997 0.997
0.978 0.964
0.914 0.859
0.470 0.472
0.903 0.908
0.508 0.508
30 GB datasets for web-spam and web-url classification
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20 25 30
CostinDollars
Billable time in Minutes
sagemaker-url sagemaker-spam other-url other-spam

Factorization Machines
Log_loss F1 Score Seconds
SageMaker 0.494 0.277 820
Other (10 Iter) 0.516 0.190 650
Other (20 Iter) 0.507 0.254 1300
Other (50 Iter) 0.481 0.313 3250
Click Prediction 1 TB advertising dataset,
m4.4xlarge machines, perfect scaling.
$-
$20.00
$40.00
$60.00
$80.00
$100.00
$120.00
$140.00
$160.00
$180.00
$200.00
1 2 3 4 5 6 7 8
CostinDollars
Billable Time in Hours
10
machines
20
machines
30
machines
4050

K-Means Clustering

K-Means Clustering
Method Accurate? Passes Efficient
Tuning
Comments
Lloyds [1] Yes* 5-10 No
K-Means ++ [2] Yes k+5 to k+10 No scikit-learn
K-Means|| [3] Yes 7-12 No spark.ml
Online [4] No 1 No
Streaming [5,6] No 1 No Impractical
Webscale [7] No 1 No spark streaming
Coresets [8] No 1 Yes Impractical
SageMaker Yes 1 Yes
[1] Lloyd, IEEE TIT, 1982
[2] Arthur et. al. ACM-SIAM, 2007
[3] Bahmani et. al., VLDB, 2012
[4] Liberty et. al., 2015
[5] Shindler et. al, NIPS, 2011
[6] Guha et. al, IEEE Trans. Knowl. Data Eng. 2003
[7] Sculley, WWW, 2010
[8] Feldman et. al.

0
1
2
3
4
5
6
7
8
10 100 500BillableTimeinMinutes
Number of Clusters
sagemaker other
K-Means Clustering
k SageMaker Other
Text
1.2GB
10 1.18E3 1.18E3
100 1.00E3 9.77E2
500 9.18.E2 9.03E2
Images
9GB
10 3.29E2 3.28E2
100 2.72E2 2.71E2
500 2.17E2 Failed
Videos
27GB
10 2.19E2 2.18E2
100 2.03E2 2.02E2
500 1.86E2 1.85E2
Advertising
127GB
10 1.72E7 Failed
100 1.30E7 Failed
500 1.03E7 Failed
Synthetic
1100GB
10 3.81E7 Failed
100 3.51E7 Failed
500 2.81E7 Failed
Running Time vs. Number of Clusters
~10x Faster!

Principal Component Analysis (PCA)

Principal Component Analysis (PCA)
More than 10x faster
at a fraction the cost!
0.00
20.00
40.00
60.00
80.00
100.00
120.00
8 10 20
Mb/Sec/Machine
Number of Machines
other sagemaker-deterministic sagemaker-randomized
Cost vs. Time Throughput and Scalability
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 5 10 15 20 25 30 35 40 45
CostinDollars
Billable time in Minutesother sagemaker-deterministic sagemaker-randomized

Time Series Forecasting
Mean absolute
percentage error
P90 Loss
DeepAR R DeepAR R
traffic
Hourly occupancy rate of 963
bay area freeways
0.14 0.27 0.13 0.24
electricity
Electricity use of 370
homes over time
0.07 0.11 0.08 0.09
pageviews
Page view hits
of websites
10k 0.32 0.32 0.44 0.31
180k 0.32 0.34 0.29 NA
One hour on p2.xlarge, $1
Input
Network

Topic Modeling: Learning topics in a large
document corpus
What are Topic Models?
• Unsupervised ML algorithm
• A topic is a distribution over words in a vocabulary
• Topics represented in terms of top 10 most
likely words in the distribution
• Words in a document is drawn from a mixture of
topics
• Documents can be soft-tagged with topics
Use cases:
• Discovering topics in the corpus automatically
• Indexing documents by topics
• Searching for similar documents

SageMaker Neural Topic Model (NTM)
• Based on Variational Autoencoders
• Encoder network: q(z|x): BOW  latent variables z
• Decoder network: P(x|z): latent variables z  word
distribution
• Latent variables z represent topic distribution for the
document
0
200
400
600
800
1000
1200
1400
LDA-Mean
Field
LDA-Gibbs NVDM GSM ProdLDA NTM
Perplexity on 20NG data: Lower is better
0
0.05
0.1
0.15
0.2
0.25
0.3
LDA-Mean
Field
LDA-Gibbs NVDM GSM ProdLDA NTM
Topic Coherence (NMPI) on 20NG data: Higher is
better
NTM offers a good balance between perplexity and topic coherence

NTM: Representative topics
Human Assigned Topic
Label
Top words from topics in 20 News Groups Data
Religion jesus, scripture, christian, religion, belief, islam, god, christianity, atheism, christ
Sports scoring, team, season, playoff, win, scorer, detroit, game, league, nhl
Computer Hardware ide, scsi, controller, scsi-2, drive, simms, scsi-1, isa, motherboard, floppy
Computer Security encryption, escrow, encrypted, rsa, crypto, secure, algorithm, key, nsa, clipper
Mechanics tire, noise, rear, engine, lock, brake, inch, radar, mile, detector
Human Assigned Topic
Label
WikiText–103 dataset
Navy admiral, fleet, cruiser, hm, austro, battleship, dreadnought, ship, battlecruisers, squadron
Biology protein, genetic, enzyme, dna, gene, disease, rna, molecule, organism, bacteria]
Games enix, video, remix, xbox, remixes, d, remixed, downloads, playstation, nintendo
Films film, filming, script, animated, filmed, animation, episode, screenplay, disney, movie
Music liner, recording, guitarist, beatles, orchestra, musician, opera, studio, band, concert

Object2Vec: Learning embeddings of high
dimensional objects
• Learns embeddings of entity pairs
• Token pairs
• Sequence pairs
• Token-sequence pairs
• Preserves semantic relationship
between entities in each pair in
the embedding space
• Learned embeddings can be used:
• For nearest neighbor search
• For clustering and visualization
• As features in downstream tasks
Left Input
Left Encoder
Comparator
Label
Right Input
Right Encoder
Can be trained using Cross-Entropy Loss,
MSE
Encoders can be layers of
Pooled Embeddings/CNNs/RNNs;
Left-right can be asymmetric
Inputs can be tokens,
or sequences of tokens
Combination of Hadamard Product,
Absolute difference,
Concatenation,
followed by FF network

Object2Vec: Benchmarking
0.88
0.9
0.92
0.94
0.96
0.98
1
1.02
RMSE
MovieLens Ratings Prediction
50
55
60
65
70
75
80
85
90
InferSent Object2Vec InferSent Object2Vec
CNN RNN
Accuracy
Stanford Natural Language Inference
Prediction of relationship between token pairs:
Movie recommendation
Prediction of relationship between sequence pairs:
Natural Language Inference
Prediction of similarity between embeddings of pairs of
sequences: Sentence similarity
0.5
0.55
0.6
0.65
0.7
0.75
STS'12 STS'13 STS`14 STS`15 STS'16
PearsonCorrelation
Semantic Text Similarity
PooledEmbeddings InferSent Object2Vec
Prediction of relationship between sequences and tokens:
Multi-label document classification

Pipe Mode (Made available May 23rd)
PCA K-Means
Throughput
Job Startup
Time
Job Execution
Time

From Amazon SageMaker Notebooks
Parameters
Hardware
Start Training
Host model

From Amazon EMR
Start Training
Parameters
Hardware
Apply Model

Input Data
profile=<your_profile>
arn_role=<your_arn_role>
training_image=382416733822.dkr.ecr.us-east-1.amazonaws.com/kmeans:1
training_job_name=clutering_text_documents_`date '+%Y_%m_%d_%H_%M_%S'`
aws --profile $profile
--region us-east-1
sagemaker create-training-job
--training-job-name $training_job_name
--algorithm-specification TrainingImage=$training_image,TrainingInputMode=File
--hyper-parameters k=10,feature_dim=1024,mini_batch_size=1000
--role-arn $arn_role
--input-data-config '{"ChannelName": "train", "DataSource": {"S3DataSource":{"S3DataType": "S3Prefix", "S3Uri":
"s3://kmeans_demo/train", "S3DataDistributionType": "ShardedByS3Key"}}, "CompressionType": "None", "RecordWrapperType": "None"}'
--output-data-config S3OutputPath=s3://training_output/$training_job_name
--resource-config InstanceCount=2,InstanceType=ml.c4.8xlarge,VolumeSizeInGB=50
--stopping-condition MaxRuntimeInSeconds=3600
From Command Line
Hardware
Algorithm

Thank you
Edo Liberty, Director of Amazon AI Labs
Zohar Karnin, Bing Xiang, Baris Cuskon, Ramesh Nallapati,
Phillip Gautier, Madhav Jha, Ran Ding, Tim Januschowski, David
Selinas, Bernie Wang, Jan Gasthaus, Laurence Rouesnel, Amir
Sadoughi, Piali Das, Julio Delgado Mangas, Yury Astashonok,
Can Balioglu, Saswata Chakravarty, and Alex Smola

Amazon sage maker infinitely scalable machine learning algorithms

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Amazon sage maker infinitely scalable machine learning algorithms

Similaire à Amazon sage maker infinitely scalable machine learning algorithms (20)

Plus de MLconf

Plus de MLconf (20)

Dernier

Dernier (20)

Amazon sage maker infinitely scalable machine learning algorithms

Notes de l'éditeur