3. Greenplum Integrated Analytics
Data Transformation
Traditional BI
Machine
Learning
Graph
Data Science
Productivity Tools
Geospatial
Text
Deep
Learning
Build
Manage
Deploy
4. ■ Machine learning
■ Deep learning
■ Model management
■ Deployment and
orchestration of models
Agenda
6. Scalable, In-Database
Machine Learning
• Open source https://github.com/apache/madlib
• Downloads and docs http://madlib.apache.org/
• Wiki https://cwiki.apache.org/confluence/display/MADLIB/
Apache MADlib: Big Data Machine Learning in SQL
Open source,
top level
Apache project
For PostgreSQL
and Greenplum
Database
Powerful machine
learning, graph,
statistics and analytics
for data scientists
7. History
MADlib project was initiated in 2011 by EMC/Greenplum architects and
Professor Joe Hellerstein from University of California, Berkeley.
UrbanDictionary.com:
mad (adj.): an adjective used to enhance a
noun.
1- dude, you got skills.
2- dude, you got mad skills.
8. Functions
Data Types and Transformations
Array and Matrix Operations
Matrix Factorization
• Low Rank
• Singular Value Decomposition (SVD)
Norms and Distance Functions
Sparse Vectors
Encoding Categorical Variables
Path Functions
Pivot
Sessionize
Stemming
Apache MADlib 1.15.1
Graph
All Pairs Shortest Path (APSP)
Breadth-First Search
Hyperlink-Induced Topic Search (HITS)
Average Path Length
Closeness Centrality
Graph Diameter
In-Out Degree
PageRank and Personalized PageRank
Single Source Shortest Path (SSSP)
Weakly Connected Components
Model Selection
Cross Validation
Prediction Metrics
Train-Test Split
Statistics
Descriptive Statistics
• Cardinality Estimators
• Correlation and Covariance
• Summary
Inferential Statistics
• Hypothesis Tests
Probability Functions
Supervised Learning
Neural Networks
Support Vector Machines (SVM)
Conditional Random Field (CRF)
Regression Models
• Clustered Variance
• Cox-Proportional Hazards Regression
• Elastic Net Regularization
• Generalized Linear Models
• Linear Regression
• Logistic Regression
• Marginal Effects
• Multinomial Regression
• Naïve Bayes
• Ordinal Regression
• Robust Variance
Tree Methods
• Decision Tree
• Random Forest
Time Series Analysis
• ARIMA
Unsupervised Learning
Association Rules (Apriori)
Clustering (k-Means)
Principal Component Analysis (PCA)
Topic Modelling (Latent Dirichlet Allocation)
Utility Functions
Columns to Vector
Conjugate Gradient
Linear Solvers
• Dense Linear Systems
• Sparse Linear Systems
Mini-Batching
PMML Export
Term Frequency for Text
Vector to Columns
Nearest Neighbors
• k-Nearest Neighbors
Sampling
Balanced
Random
Stratified
Comprehensive and mature
data science library
9. Why MADlib on Greenplum?
• Better parallelism
• Better scalability
• Higher predictive accuracy
• Top level ASF project
“Apache MADlib Comes of Age”, Frank McQuillan, Oct. 2017,
https://content.pivotal.io/blog/apache-madlib-comes-of-age
10. Greenplum Database with MADlib
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Node1
Segment Host
Node2
Segment Host
Node3
Segment Host
NodeN
Local
Storage
Other
RDBMSes
SparkGemFire
Cloud
Object
Storage
HDFS KafkaETL
Spring
Cloud
Data Flow
In-Database
Functions
Machine learning
&
statistics
&
math
&
graph
&
utilities
MassivelyParallelProcessing
11. Iterative Model Execution
Master
model = init(…)
WHILE model not converged
model =
SELECT
model.aggregation(…)
FROM
data table
ENDWHILE
Stored Procedure for Model
…
Broadcast
Segment 2
Segment n
…
Transition Function
Operates on tuples
or mini-batches to
update transition state
(model)
1
Merge
Function
Combines
transition states2
Final Function
Transforms transition
state into output value
3
Segment 1
14. SVM Scale with Data Size
Greenplum cluster:
● 1 master
● 4 segment hosts with
6 segments per host
Support Vector Machines
15. PageRank Scale with Graph Size
Greenplum cluster:
● 1 master
● 4 segment hosts with
6 segments per host
Normal random graphs with
mean degrees 50 edges per vertex
(i.e., 5B edges in the largest case)
5B edges
(1K) (10K) (100K) (1M) (10M) (100M)
Note: log-log scale
(100s)
(1s)
(10K s)
(1M s)
“Graph Processing on Greenplum Database using Apache MADlib”, Frank McQuillan, Jan 2018,
https://content.pivotal.io/blog/graph-processing-on-greenplum-database-using-apache-madlib
16. But modeling is only part of the story...
“It’s an absolute myth that you can send an algorithm
over raw data and have insights pop up.”
- Jeffrey Heer, Professor of Computer Science at the University of Washington and Co-
founder of Trifacta
“For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”, Aug. 17, 2014
https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
19. Deep Learning
• Type of machine
learning inspired by
biology of the brain
• Artificial neural
networks with
multiple layers
between input and
output
20. Example Deep Learning Algorithms
Multilayer
perceptron (MLP)
“The Original”
Recurrent
neural network (RNN)
E.g., machine translation
Convolutional
neural network (CNN)
E.g., image classification
22. Graphics Processing Units (GPUs)
• Great at performing a
lot of simple
computations such as
matrix operations
• Well suited to deep
learning algorithms
24. Moving Data Greenplum <-> Single Server
Deep learningData preparation, feature generation,
machine learning, geospatial, etc.
Large
data
transfer
Suboptimal
25. Integrated Deep Learning with Greenplum
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Node1
Segment Host
Node2
Segment Host
Node3
Segment Host
NodeN
GPU N
…
GPU 1 GPU N
…
GPU 1 GPU N
…
GPU 1
…
GPU N
…
GPU 1
In-Database
Functions
Machine learning
&
statistics
&
math
&
graph
&
utilities
MassivelyParallelProcessing
26. Deep Learning on a Cluster
Num Approach Description
1 Distributed deep learning Train single model architecture across the cluster.
Data distributed (usually randomly) across segments.
2 Data parallel models Train same model architecture in parallel on different
data groups (e.g., build separate models per country).
3 Hyperparameter tuning Train same model architecture in parallel with different
hyperparameter settings and incorporate cross
validation. Same data on each segment.
4 Neural architecture
search
Train different model architectures in parallel. Same
data on each segment.
Current
work
35. Data Science Process
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup
36. Model Operationalization
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup
Model Operationalization
is the process of deploying data
science models to production
for ongoing use by other
software
37. Common Challenges With Operationalizing Models
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup
Common challenges with model
operationalization:
● Handling production data
● Engineering for scale and
performance
● Model transportation
● Managing and orchestrating
deployed models
● Data Scientists are not
developers or platform
experts
38. BATCH TRAINING
BATCH INFERENCE
~40% of today’s use cases
Tax Return Fraud: Score database of
tax returns - on a nightly basis - to flag
likely fraudulent returns for audit
EVENT DRIVEN
TRAINING EVENT
DRIVEN INFERENCE
<5% today’s use cases
Online Advertising: Maximize Click
Thru Rate by algorithmically selecting
and testing advertisement placement in
real time
BATCH TRAINING
EVENT DRIVEN
INFERENCE
~55% today’s use cases (growing)
Real Time Transaction Fraud: Train
a ML model on historical data to
classify - in real time - whether or not
new credit/debit transactions are likely
to be fraudulent
EXAMPLE
Patterns For Operationalizing Models
EXAMPLE EXAMPLE
PotsgreSQL/Greenplum
with MADlib supports
this pattern
PostgreSQL/Greenplum
with MADlib & MADlib
Flow supports this
pattern
Highly specialized – low
number of enterprise use
cases
39. AI For The PostgreSQL Community
Standardized end-to-end Data Science in SQL with the Greenplum/Postgres stack
Experimentation
Initial code development and testing,
model experimentation on samples.
Modeling at Scale
Heavy compute tasks such as model
training across big data
Deployment
Production deployment of models to feed
downstream applications and reports
Artificial
Intelligence
: Closed
Loop
Machine
Learning
40. Model Deployment With MADlib Flow
1
ML Training
Train ML model in
Postgres or Greenplum
using Apache MADlib
madlibflow --
deploy
Set configs in .yml and
deploy model from
Greenplum to Docker,
PCF or Kubernetes
2
Docker pull
Pull docker containers
with optimized Postgres
and MADlib
3
Pull Model
Extract model and
feature table schema
layout from Greenplum
database
4
Load Model
Load model and feature
table schema into
optimized Postgres
5
Deploy
Deploy docker container
to target environment
6
Automated Backend OperationsUser Operations
41. Containerized Deployment Of Models
$ madlibflow --deploy --target kubernetes --type model
Key benefits of MADlib Flow
● Easy to deploy & light weight
● Highly scalable REST and Streaming
● End-to-end SQL workflow
● Low latency inference/predictions
● Feature Transformations
Single command to deploy a MADlib
trained model from GPDB/Postgres to
Docker, PCF or Kubernetes
Containerized deployment of Apache MADlib Machine Learning workflows for low
latency event driven inference and scale
43. MADlib Flow : Hello World!
Let us demonstrate a Linear Regression Model deployment
Dependent Variable:
● patient has had a second heart attack within 1 year
independent variables:
● patient completed a treatment on anger control
● anxiety scale score
Workflow:
Create
schema
Load data Train model
Deploy
model
Tes
t
Batch
prediction
51. MADlib 2.0
● More deep learning
capabilities
○ Improved model
performance
○ Hyperparameter
tuning
● Model repositories and
management for
streamlined data science
workflows
● New and improved SQL
interface for MADlib
functions
MADlib Flow
● Support for PL/Python and
PL/R
● Native deployment to
Pivotal Cloud foundry as
build pack.
● Beta Release in May’19
● Metrics collector.
MADlib 1.16
● Initial deep learning
release for image
classification
(Keras/TensorFlow)
● Postgres 11 support
● Improve speed of k-
nearest neighbors via
approximate method
Looking Ahead
55. Distributed Deep Learning Methods
• Open area of research*
• Methods we have investigated so far:
– Simple averaging
– Ensembling
– Elastic averaging stochastic gradient descent
(EASGD)
* Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
https://arxiv.org/pdf/1802.09941.pdf
56. Some Results with CIFAR-10
• 60k 32x32 color
images in 10 classes,
with 6k images per
class
• 50k training images
and 10k test images
https://www.cs.toronto.edu/~kriz/cifar.html
57. ■ Experimentation -> Modeling at scale -> Deployment all in SQL
■ Single platform from model development to Deployment using Postgres/Greenplum
■ Low latency inference
■ Easy to deploy both feature generation code and model
■ Join data from event message with Feature cache objects using ANSI SQL
■ Continuously generate the features and feed in to feature engine.
■ Multiple versions of Models can be deployed for accuracy measurement.
■ Same tool can deploy to multiple Container Environments, PKS, AKS, GKE, etc.
MADlib Flow Benefits