SlideShare a Scribd company logo
1 of 59
Download to read offline
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Sridhar Paladugu
Frank McQuillan
AI on Greenplum Using
Apache MADlib and MADlib Flow
Greenplum Integrated Analytics
Data Transformation
Traditional BI
Machine
Learning
Graph
Data Science
Productivity Tools
Geospatial
Text
Deep
Learning
Build
Manage
Deploy
■ Machine learning
■ Deep learning
■ Model management
■ Deployment and
orchestration of models
Agenda
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
1. Machine Learning with
Apache MADlib
Scalable, In-Database
Machine Learning
• Open source https://github.com/apache/madlib
• Downloads and docs http://madlib.apache.org/
• Wiki https://cwiki.apache.org/confluence/display/MADLIB/
Apache MADlib: Big Data Machine Learning in SQL
Open source,
top level
Apache project
For PostgreSQL
and Greenplum
Database
Powerful machine
learning, graph,
statistics and analytics
for data scientists
History
MADlib project was initiated in 2011 by EMC/Greenplum architects and
Professor Joe Hellerstein from University of California, Berkeley.
UrbanDictionary.com:
mad (adj.): an adjective used to enhance a
noun.
1- dude, you got skills.
2- dude, you got mad skills.
Functions
Data Types and Transformations
Array and Matrix Operations
Matrix Factorization
• Low Rank
• Singular Value Decomposition (SVD)
Norms and Distance Functions
Sparse Vectors
Encoding Categorical Variables
Path Functions
Pivot
Sessionize
Stemming
Apache MADlib 1.15.1
Graph
All Pairs Shortest Path (APSP)
Breadth-First Search
Hyperlink-Induced Topic Search (HITS)
Average Path Length
Closeness Centrality
Graph Diameter
In-Out Degree
PageRank and Personalized PageRank
Single Source Shortest Path (SSSP)
Weakly Connected Components
Model Selection
Cross Validation
Prediction Metrics
Train-Test Split
Statistics
Descriptive Statistics
• Cardinality Estimators
• Correlation and Covariance
• Summary
Inferential Statistics
• Hypothesis Tests
Probability Functions
Supervised Learning
Neural Networks
Support Vector Machines (SVM)
Conditional Random Field (CRF)
Regression Models
• Clustered Variance
• Cox-Proportional Hazards Regression
• Elastic Net Regularization
• Generalized Linear Models
• Linear Regression
• Logistic Regression
• Marginal Effects
• Multinomial Regression
• Naïve Bayes
• Ordinal Regression
• Robust Variance
Tree Methods
• Decision Tree
• Random Forest
Time Series Analysis
• ARIMA
Unsupervised Learning
Association Rules (Apriori)
Clustering (k-Means)
Principal Component Analysis (PCA)
Topic Modelling (Latent Dirichlet Allocation)
Utility Functions
Columns to Vector
Conjugate Gradient
Linear Solvers
• Dense Linear Systems
• Sparse Linear Systems
Mini-Batching
PMML Export
Term Frequency for Text
Vector to Columns
Nearest Neighbors
• k-Nearest Neighbors
Sampling
Balanced
Random
Stratified
Comprehensive and mature
data science library
Why MADlib on Greenplum?
• Better parallelism
• Better scalability
• Higher predictive accuracy
• Top level ASF project
“Apache MADlib Comes of Age”, Frank McQuillan, Oct. 2017,
https://content.pivotal.io/blog/apache-madlib-comes-of-age
Greenplum Database with MADlib
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Node1
Segment Host
Node2
Segment Host
Node3
Segment Host
NodeN
Local
Storage
Other
RDBMSes
SparkGemFire
Cloud
Object
Storage
HDFS KafkaETL
Spring
Cloud
Data Flow
In-Database
Functions
Machine learning
&
statistics
&
math
&
graph
&
utilities
MassivelyParallelProcessing
Iterative Model Execution
Master
model = init(…)
WHILE model not converged
model =
SELECT
model.aggregation(…)
FROM
data table
ENDWHILE
Stored Procedure for Model
…
Broadcast
Segment 2
Segment n
…
Transition Function
Operates on tuples
or mini-batches to
update transition state
(model)
1
Merge
Function
Combines
transition states2
Final Function
Transforms transition
state into output value
3
Segment 1
Familiar SQL Interface
Train (build a predictive model)
Predict (use model on new data)
Familiar SQL Interface From house pricing model
SVM Scale with Data Size
Greenplum cluster:
● 1 master
● 4 segment hosts with
6 segments per host
Support Vector Machines
PageRank Scale with Graph Size
Greenplum cluster:
● 1 master
● 4 segment hosts with
6 segments per host
Normal random graphs with
mean degrees 50 edges per vertex
(i.e., 5B edges in the largest case)
5B edges
(1K) (10K) (100K) (1M) (10M) (100M)
Note: log-log scale
(100s)
(1s)
(10K s)
(1M s)
“Graph Processing on Greenplum Database using Apache MADlib”, Frank McQuillan, Jan 2018,
https://content.pivotal.io/blog/graph-processing-on-greenplum-database-using-apache-madlib
But modeling is only part of the story...
“It’s an absolute myth that you can send an algorithm
over raw data and have insights pop up.”
- Jeffrey Heer, Professor of Computer Science at the University of Washington and Co-
founder of Trifacta
“For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”, Aug. 17, 2014
https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
Feature Engineering
Example
data science
workflow
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
2. Deep Learning
Deep Learning
• Type of machine
learning inspired by
biology of the brain
• Artificial neural
networks with
multiple layers
between input and
output
Example Deep Learning Algorithms
Multilayer
perceptron (MLP)
“The Original”
Recurrent
neural network (RNN)
E.g., machine translation
Convolutional
neural network (CNN)
E.g., image classification
Convolutional Neural Networks (CNN)
• Effective for computer vision
• Fewer parameters than fully
connected networks
• Translational invariance
• Classic networks: LeNet-5,
AlexNet, VGG
Graphics Processing Units (GPUs)
• Great at performing a
lot of simple
computations such as
matrix operations
• Well suited to deep
learning algorithms
GPU N
…
Single Server
Host
Node 1
GPU 1
Moving Data Greenplum <-> Single Server
Deep learningData preparation, feature generation,
machine learning, geospatial, etc.
Large
data
transfer
Suboptimal
Integrated Deep Learning with Greenplum
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Node1
Segment Host
Node2
Segment Host
Node3
Segment Host
NodeN
GPU N
…
GPU 1 GPU N
…
GPU 1 GPU N
…
GPU 1
…
GPU N
…
GPU 1
In-Database
Functions
Machine learning
&
statistics
&
math
&
graph
&
utilities
MassivelyParallelProcessing
Deep Learning on a Cluster
Num Approach Description
1 Distributed deep learning Train single model architecture across the cluster.
Data distributed (usually randomly) across segments.
2 Data parallel models Train same model architecture in parallel on different
data groups (e.g., build separate models per country).
3 Hyperparameter tuning Train same model architecture in parallel with different
hyperparameter settings and incorporate cross
validation. Same data on each segment.
4 Neural architecture
search
Train different model architectures in parallel. Same
data on each segment.
Current
work
Data Loading and Formatting
Testing Infrastructure
• Google Cloud Platform (GCP)
• Type n1-highmem-32 (32 vCPUs, 208 GB memory)
• NVIDIA Tesla P100 GPUs
• Greenplum database config
– Tested up to 20 segment (worker node) clusters
– 1 GPU per segment
6-layer CNN - Runtime (CIFAR-10)
Method: Model weight averaging
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
3. Model Management
Try and try and try...
• Data scientists typically try many different types of
models with many different parameters combinations
Model Persistence in MADlib 1.x
One model at a time
Model Persistence in MADlib 2.0
Multiple models at a time in
model library
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
4. MADlib Flow
Data Science Process
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup
Model Operationalization
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup
Model Operationalization
is the process of deploying data
science models to production
for ongoing use by other
software
Common Challenges With Operationalizing Models
Model Evaluation
Operationalization
Model Building
Feature Engineering
Data
Review
User Feedback
Problem
Definition
Setup
Common challenges with model
operationalization:
● Handling production data
● Engineering for scale and
performance
● Model transportation
● Managing and orchestrating
deployed models
● Data Scientists are not
developers or platform
experts
BATCH TRAINING
BATCH INFERENCE
~40% of today’s use cases
Tax Return Fraud: Score database of
tax returns - on a nightly basis - to flag
likely fraudulent returns for audit
EVENT DRIVEN
TRAINING EVENT
DRIVEN INFERENCE
<5% today’s use cases
Online Advertising: Maximize Click
Thru Rate by algorithmically selecting
and testing advertisement placement in
real time
BATCH TRAINING
EVENT DRIVEN
INFERENCE
~55% today’s use cases (growing)
Real Time Transaction Fraud: Train
a ML model on historical data to
classify - in real time - whether or not
new credit/debit transactions are likely
to be fraudulent
EXAMPLE
Patterns For Operationalizing Models
EXAMPLE EXAMPLE
PotsgreSQL/Greenplum
with MADlib supports
this pattern
PostgreSQL/Greenplum
with MADlib & MADlib
Flow supports this
pattern
Highly specialized – low
number of enterprise use
cases
AI For The PostgreSQL Community
Standardized end-to-end Data Science in SQL with the Greenplum/Postgres stack
Experimentation
Initial code development and testing,
model experimentation on samples.
Modeling at Scale
Heavy compute tasks such as model
training across big data
Deployment
Production deployment of models to feed
downstream applications and reports
Artificial
Intelligence
: Closed
Loop
Machine
Learning
Model Deployment With MADlib Flow
1
ML Training
Train ML model in
Postgres or Greenplum
using Apache MADlib
madlibflow --
deploy
Set configs in .yml and
deploy model from
Greenplum to Docker,
PCF or Kubernetes
2
Docker pull
Pull docker containers
with optimized Postgres
and MADlib
3
Pull Model
Extract model and
feature table schema
layout from Greenplum
database
4
Load Model
Load model and feature
table schema into
optimized Postgres
5
Deploy
Deploy docker container
to target environment
6
Automated Backend OperationsUser Operations
Containerized Deployment Of Models
$ madlibflow --deploy --target kubernetes --type model
Key benefits of MADlib Flow
● Easy to deploy & light weight
● Highly scalable REST and Streaming
● End-to-end SQL workflow
● Low latency inference/predictions
● Feature Transformations
Single command to deploy a MADlib
trained model from GPDB/Postgres to
Docker, PCF or Kubernetes
Containerized deployment of Apache MADlib Machine Learning workflows for low
latency event driven inference and scale
MADlib Flow Components
MADlib Flow : Hello World!
Let us demonstrate a Linear Regression Model deployment
Dependent Variable:
● patient has had a second heart attack within 1 year
independent variables:
● patient completed a treatment on anger control
● anxiety scale score
Workflow:
Create
schema
Load data Train model
Deploy
model
Tes
t
Batch
prediction
Model Deployment
Deployment manifest
$ madlibflow --name patient-lr --type model --action deploy --target kubernetes --inputJson config.json
Model Deployment
Greenplum Database
Feature EngineCredit/Debit Card Transaction
(Input)
Message
{
“transaction_ts”: ,
“credit_card_number”: ,
“transaction_amt”:,
“merchant_id”:
}
Approved Credit/Debit Card
Transaction
(Output)
Message
{
“transaction_ts”: ,
“transaction_amt”:,
“credit_card_number”:,
“num_transactions_30days”:,
“max_transactions_30days”:,
“merchant_id”:,
“num_fraud_cases”:,
“avg_transaction_amount_30days”:,
“fraud_risk_score”: 0.92,
“approved”: True
}
Accounts
credit_card_number
num_transactions_30days
max_transactions_30days
Merchants
merchant_id
num_fraud_cases
avg_transaction_amount_30days
Cache
(Gemfire, PCC, Redis, etc.)
Cache Abstraction
Cache Abstraction
SELECT mch.*
,acct.*
,log(msg.transaction_amt + 1) AS log_transaction_amt
FROM message msg
JOIN merchants mch ON
msg.merchant_id=mch.merchant_id
JOIN accounts acct ON
msg.credit_card_number=acct.credit_card_number;
MADlib REST
Cache Loader
Automated deployment
of scalable low latency
end-to-end ML pipelines
(“Data Science Ops.”)
No code conversion -
engineer features and
populate cache in SQL
Join data from the
incoming message with
cached data
Accounts Merchants
SELECT create_accounts(); SELECT create_merchants();
Example Flow for Fraud Detection
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
5. Learn More!
• Download
– http://madlib.apache.org/
• ~40 Jupyter notebooks
– https://github.com/apache/madlib-site/tree/asf-
site/community-artifacts
• Wednesday March 20 @PostgresConf
#ScaleMatters
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Backup Slides
MADlib 2.0
● More deep learning
capabilities
○ Improved model
performance
○ Hyperparameter
tuning
● Model repositories and
management for
streamlined data science
workflows
● New and improved SQL
interface for MADlib
functions
MADlib Flow
● Support for PL/Python and
PL/R
● Native deployment to
Pivotal Cloud foundry as
build pack.
● Beta Release in May’19
● Metrics collector.
MADlib 1.16
● Initial deep learning
release for image
classification
(Keras/TensorFlow)
● Postgres 11 support
● Improve speed of k-
nearest neighbors via
approximate method
Looking Ahead
Apache MADlib Resources
• Web site
– http://madlib.apache.org/
• Wiki
– https://cwiki.apache.org/confluence/display/MAD
LIB/Apache+MADlib
• User docs
– http://madlib.apache.org/docs/latest/index.html
• Jupyter notebooks
– https://github.com/apache/madlib-site/tree/asf-
site/community-artifacts
• Technical docs
– http://madlib.apache.org/design.pdf
• Pivotal commercial site
– http://pivotal.io/madlib
• Mailing lists and JIRAs
– https://mail-
archives.apache.org/mod_mbox/incubator-
madlib-dev/
– http://mail-
archives.apache.org/mod_mbox/incubator-
madlib-user/
– https://issues.apache.org/jira/browse/MADLIB
• PivotalR
– https://cran.r-
project.org/web/packages/PivotalR/index.html
• Github
– https://github.com/apache/madlib
– https://github.com/pivotalsoftware/PivotalR
Execution Flow
Client
Database
Server
Master
Segment 1
Segment 2
Segment n
…
SQL
Stored
Procedure
Result
Set
String
Aggregation
psql
…
Artificial Intelligence Landscape
Deep
Learning
Distributed Deep Learning Methods
• Open area of research*
• Methods we have investigated so far:
– Simple averaging
– Ensembling
– Elastic averaging stochastic gradient descent
(EASGD)
* Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
https://arxiv.org/pdf/1802.09941.pdf
Some Results with CIFAR-10
• 60k 32x32 color
images in 10 classes,
with 6k images per
class
• 50k training images
and 10k test images
https://www.cs.toronto.edu/~kriz/cifar.html
■ Experimentation -> Modeling at scale -> Deployment all in SQL
■ Single platform from model development to Deployment using Postgres/Greenplum
■ Low latency inference
■ Easy to deploy both feature generation code and model
■ Join data from event message with Feature cache objects using ANSI SQL
■ Continuously generate the features and feed in to feature engine.
■ Multiple versions of Models can be deployed for accuracy measurement.
■ Same tool can deploy to multiple Container Environments, PKS, AKS, GKE, etc.
MADlib Flow Benefits
Model Training
Model Testing

More Related Content

What's hot

Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformJean-Paul Azar
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model ServingKubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model ServingTheofilos Papapanagiotou
 
Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019
Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019
Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019VMware Tanzu
 
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...NETWAYS
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep DiveRed_Hat_Storage
 
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...Lviv Startup Club
 
Kubernetes
KubernetesKubernetes
Kuberneteserialc_w
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Introduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdIntroduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdKohei Tokunaga
 
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016Amazon Web Services
 
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...confluent
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connectconfluent
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 

What's hot (20)

Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platform
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Greenplum Roadmap
Greenplum RoadmapGreenplum Roadmap
Greenplum Roadmap
 
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model ServingKubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
Kubecon 2023 EU - KServe - The State and Future of Cloud-Native Model Serving
 
Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019
Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019
Greenplum and Kafka: Real-time Streaming to Greenplum - Greenplum Summit 2019
 
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Kubernetes
KubernetesKubernetes
Kubernetes
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Introduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdIntroduction and Deep Dive Into Containerd
Introduction and Deep Dive Into Containerd
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
Amazon RDS with Amazon Aurora | AWS Public Sector Summit 2016
 
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connect
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 

Similar to AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019

Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...VMware Tanzu
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learningArnaud Rachez
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning InfrastructureSigOpt
 
GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017Joshua Patterson
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018Adam Gibson
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLDESMOND YUEN
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Ahmed Kamal
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform Seldon
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentationtestSri1
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsGanesan Narayanasamy
 

Similar to AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019 (20)

Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
An Analytics Platform for Connected Vehicles
An Analytics Platform for Connected VehiclesAn Analytics Platform for Connected Vehicles
An Analytics Platform for Connected Vehicles
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systemsTraditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
Traditional Machine Learning and Deep Learning on OpenPOWER/POWER systems
 

More from VMware Tanzu

What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItVMware Tanzu
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023VMware Tanzu
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleVMware Tanzu
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023VMware Tanzu
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductVMware Tanzu
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready AppsVMware Tanzu
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And BeyondVMware Tanzu
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfVMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023VMware Tanzu
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptxVMware Tanzu
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchVMware Tanzu
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishVMware Tanzu
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVMware Tanzu
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - FrenchVMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023VMware Tanzu
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootVMware Tanzu
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerVMware Tanzu
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeVMware Tanzu
 
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsSpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsVMware Tanzu
 

More from VMware Tanzu (20)

What AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About ItWhat AI Means For Your Product Strategy And What To Do About It
What AI Means For Your Product Strategy And What To Do About It
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023Make the Right Thing the Obvious Thing at Cardinal Health 2023
Make the Right Thing the Obvious Thing at Cardinal Health 2023
 
Enhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at ScaleEnhancing DevEx and Simplifying Operations at Scale
Enhancing DevEx and Simplifying Operations at Scale
 
Spring Update | July 2023
Spring Update | July 2023Spring Update | July 2023
Spring Update | July 2023
 
Platforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a ProductPlatforms, Platform Engineering, & Platform as a Product
Platforms, Platform Engineering, & Platform as a Product
 
Building Cloud Ready Apps
Building Cloud Ready AppsBuilding Cloud Ready Apps
Building Cloud Ready Apps
 
Spring Boot 3 And Beyond
Spring Boot 3 And BeyondSpring Boot 3 And Beyond
Spring Boot 3 And Beyond
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfSpring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
 
tanzu_developer_connect.pptx
tanzu_developer_connect.pptxtanzu_developer_connect.pptx
tanzu_developer_connect.pptx
 
Tanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - FrenchTanzu Virtual Developer Connect Workshop - French
Tanzu Virtual Developer Connect Workshop - French
 
Tanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - EnglishTanzu Developer Connect Workshop - English
Tanzu Developer Connect Workshop - English
 
Virtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - EnglishVirtual Developer Connect Workshop - English
Virtual Developer Connect Workshop - English
 
Tanzu Developer Connect - French
Tanzu Developer Connect - FrenchTanzu Developer Connect - French
Tanzu Developer Connect - French
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootSpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
 
SpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software EngineerSpringOne Tour: The Influential Software Engineer
SpringOne Tour: The Influential Software Engineer
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs PracticeSpringOne Tour: Domain-Driven Design: Theory vs Practice
SpringOne Tour: Domain-Driven Design: Theory vs Practice
 
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense SolutionsSpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
SpringOne Tour: Spring Recipes: A Collection of Common-Sense Solutions
 

Recently uploaded

Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 

AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019

  • 1.
  • 2. © Copyright 2019 Pivotal Software, Inc. All rights Reserved. Sridhar Paladugu Frank McQuillan AI on Greenplum Using Apache MADlib and MADlib Flow
  • 3. Greenplum Integrated Analytics Data Transformation Traditional BI Machine Learning Graph Data Science Productivity Tools Geospatial Text Deep Learning Build Manage Deploy
  • 4. ■ Machine learning ■ Deep learning ■ Model management ■ Deployment and orchestration of models Agenda
  • 5. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. 1. Machine Learning with Apache MADlib
  • 6. Scalable, In-Database Machine Learning • Open source https://github.com/apache/madlib • Downloads and docs http://madlib.apache.org/ • Wiki https://cwiki.apache.org/confluence/display/MADLIB/ Apache MADlib: Big Data Machine Learning in SQL Open source, top level Apache project For PostgreSQL and Greenplum Database Powerful machine learning, graph, statistics and analytics for data scientists
  • 7. History MADlib project was initiated in 2011 by EMC/Greenplum architects and Professor Joe Hellerstein from University of California, Berkeley. UrbanDictionary.com: mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills.
  • 8. Functions Data Types and Transformations Array and Matrix Operations Matrix Factorization • Low Rank • Singular Value Decomposition (SVD) Norms and Distance Functions Sparse Vectors Encoding Categorical Variables Path Functions Pivot Sessionize Stemming Apache MADlib 1.15.1 Graph All Pairs Shortest Path (APSP) Breadth-First Search Hyperlink-Induced Topic Search (HITS) Average Path Length Closeness Centrality Graph Diameter In-Out Degree PageRank and Personalized PageRank Single Source Shortest Path (SSSP) Weakly Connected Components Model Selection Cross Validation Prediction Metrics Train-Test Split Statistics Descriptive Statistics • Cardinality Estimators • Correlation and Covariance • Summary Inferential Statistics • Hypothesis Tests Probability Functions Supervised Learning Neural Networks Support Vector Machines (SVM) Conditional Random Field (CRF) Regression Models • Clustered Variance • Cox-Proportional Hazards Regression • Elastic Net Regularization • Generalized Linear Models • Linear Regression • Logistic Regression • Marginal Effects • Multinomial Regression • Naïve Bayes • Ordinal Regression • Robust Variance Tree Methods • Decision Tree • Random Forest Time Series Analysis • ARIMA Unsupervised Learning Association Rules (Apriori) Clustering (k-Means) Principal Component Analysis (PCA) Topic Modelling (Latent Dirichlet Allocation) Utility Functions Columns to Vector Conjugate Gradient Linear Solvers • Dense Linear Systems • Sparse Linear Systems Mini-Batching PMML Export Term Frequency for Text Vector to Columns Nearest Neighbors • k-Nearest Neighbors Sampling Balanced Random Stratified Comprehensive and mature data science library
  • 9. Why MADlib on Greenplum? • Better parallelism • Better scalability • Higher predictive accuracy • Top level ASF project “Apache MADlib Comes of Age”, Frank McQuillan, Oct. 2017, https://content.pivotal.io/blog/apache-madlib-comes-of-age
  • 10. Greenplum Database with MADlib Standby Master … Master Host SQL Interconnect Segment Host Node1 Segment Host Node2 Segment Host Node3 Segment Host NodeN Local Storage Other RDBMSes SparkGemFire Cloud Object Storage HDFS KafkaETL Spring Cloud Data Flow In-Database Functions Machine learning & statistics & math & graph & utilities MassivelyParallelProcessing
  • 11. Iterative Model Execution Master model = init(…) WHILE model not converged model = SELECT model.aggregation(…) FROM data table ENDWHILE Stored Procedure for Model … Broadcast Segment 2 Segment n … Transition Function Operates on tuples or mini-batches to update transition state (model) 1 Merge Function Combines transition states2 Final Function Transforms transition state into output value 3 Segment 1
  • 12. Familiar SQL Interface Train (build a predictive model) Predict (use model on new data)
  • 13. Familiar SQL Interface From house pricing model
  • 14. SVM Scale with Data Size Greenplum cluster: ● 1 master ● 4 segment hosts with 6 segments per host Support Vector Machines
  • 15. PageRank Scale with Graph Size Greenplum cluster: ● 1 master ● 4 segment hosts with 6 segments per host Normal random graphs with mean degrees 50 edges per vertex (i.e., 5B edges in the largest case) 5B edges (1K) (10K) (100K) (1M) (10M) (100M) Note: log-log scale (100s) (1s) (10K s) (1M s) “Graph Processing on Greenplum Database using Apache MADlib”, Frank McQuillan, Jan 2018, https://content.pivotal.io/blog/graph-processing-on-greenplum-database-using-apache-madlib
  • 16. But modeling is only part of the story... “It’s an absolute myth that you can send an algorithm over raw data and have insights pop up.” - Jeffrey Heer, Professor of Computer Science at the University of Washington and Co- founder of Trifacta “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”, Aug. 17, 2014 https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
  • 18. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. 2. Deep Learning
  • 19. Deep Learning • Type of machine learning inspired by biology of the brain • Artificial neural networks with multiple layers between input and output
  • 20. Example Deep Learning Algorithms Multilayer perceptron (MLP) “The Original” Recurrent neural network (RNN) E.g., machine translation Convolutional neural network (CNN) E.g., image classification
  • 21. Convolutional Neural Networks (CNN) • Effective for computer vision • Fewer parameters than fully connected networks • Translational invariance • Classic networks: LeNet-5, AlexNet, VGG
  • 22. Graphics Processing Units (GPUs) • Great at performing a lot of simple computations such as matrix operations • Well suited to deep learning algorithms
  • 24. Moving Data Greenplum <-> Single Server Deep learningData preparation, feature generation, machine learning, geospatial, etc. Large data transfer Suboptimal
  • 25. Integrated Deep Learning with Greenplum Standby Master … Master Host SQL Interconnect Segment Host Node1 Segment Host Node2 Segment Host Node3 Segment Host NodeN GPU N … GPU 1 GPU N … GPU 1 GPU N … GPU 1 … GPU N … GPU 1 In-Database Functions Machine learning & statistics & math & graph & utilities MassivelyParallelProcessing
  • 26. Deep Learning on a Cluster Num Approach Description 1 Distributed deep learning Train single model architecture across the cluster. Data distributed (usually randomly) across segments. 2 Data parallel models Train same model architecture in parallel on different data groups (e.g., build separate models per country). 3 Hyperparameter tuning Train same model architecture in parallel with different hyperparameter settings and incorporate cross validation. Same data on each segment. 4 Neural architecture search Train different model architectures in parallel. Same data on each segment. Current work
  • 27. Data Loading and Formatting
  • 28. Testing Infrastructure • Google Cloud Platform (GCP) • Type n1-highmem-32 (32 vCPUs, 208 GB memory) • NVIDIA Tesla P100 GPUs • Greenplum database config – Tested up to 20 segment (worker node) clusters – 1 GPU per segment
  • 29. 6-layer CNN - Runtime (CIFAR-10) Method: Model weight averaging
  • 30. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. 3. Model Management
  • 31. Try and try and try... • Data scientists typically try many different types of models with many different parameters combinations
  • 32. Model Persistence in MADlib 1.x One model at a time
  • 33. Model Persistence in MADlib 2.0 Multiple models at a time in model library
  • 34. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. 4. MADlib Flow
  • 35. Data Science Process Model Evaluation Operationalization Model Building Feature Engineering Data Review User Feedback Problem Definition Setup
  • 36. Model Operationalization Model Evaluation Operationalization Model Building Feature Engineering Data Review User Feedback Problem Definition Setup Model Operationalization is the process of deploying data science models to production for ongoing use by other software
  • 37. Common Challenges With Operationalizing Models Model Evaluation Operationalization Model Building Feature Engineering Data Review User Feedback Problem Definition Setup Common challenges with model operationalization: ● Handling production data ● Engineering for scale and performance ● Model transportation ● Managing and orchestrating deployed models ● Data Scientists are not developers or platform experts
  • 38. BATCH TRAINING BATCH INFERENCE ~40% of today’s use cases Tax Return Fraud: Score database of tax returns - on a nightly basis - to flag likely fraudulent returns for audit EVENT DRIVEN TRAINING EVENT DRIVEN INFERENCE <5% today’s use cases Online Advertising: Maximize Click Thru Rate by algorithmically selecting and testing advertisement placement in real time BATCH TRAINING EVENT DRIVEN INFERENCE ~55% today’s use cases (growing) Real Time Transaction Fraud: Train a ML model on historical data to classify - in real time - whether or not new credit/debit transactions are likely to be fraudulent EXAMPLE Patterns For Operationalizing Models EXAMPLE EXAMPLE PotsgreSQL/Greenplum with MADlib supports this pattern PostgreSQL/Greenplum with MADlib & MADlib Flow supports this pattern Highly specialized – low number of enterprise use cases
  • 39. AI For The PostgreSQL Community Standardized end-to-end Data Science in SQL with the Greenplum/Postgres stack Experimentation Initial code development and testing, model experimentation on samples. Modeling at Scale Heavy compute tasks such as model training across big data Deployment Production deployment of models to feed downstream applications and reports Artificial Intelligence : Closed Loop Machine Learning
  • 40. Model Deployment With MADlib Flow 1 ML Training Train ML model in Postgres or Greenplum using Apache MADlib madlibflow -- deploy Set configs in .yml and deploy model from Greenplum to Docker, PCF or Kubernetes 2 Docker pull Pull docker containers with optimized Postgres and MADlib 3 Pull Model Extract model and feature table schema layout from Greenplum database 4 Load Model Load model and feature table schema into optimized Postgres 5 Deploy Deploy docker container to target environment 6 Automated Backend OperationsUser Operations
  • 41. Containerized Deployment Of Models $ madlibflow --deploy --target kubernetes --type model Key benefits of MADlib Flow ● Easy to deploy & light weight ● Highly scalable REST and Streaming ● End-to-end SQL workflow ● Low latency inference/predictions ● Feature Transformations Single command to deploy a MADlib trained model from GPDB/Postgres to Docker, PCF or Kubernetes Containerized deployment of Apache MADlib Machine Learning workflows for low latency event driven inference and scale
  • 43. MADlib Flow : Hello World! Let us demonstrate a Linear Regression Model deployment Dependent Variable: ● patient has had a second heart attack within 1 year independent variables: ● patient completed a treatment on anger control ● anxiety scale score Workflow: Create schema Load data Train model Deploy model Tes t Batch prediction
  • 44. Model Deployment Deployment manifest $ madlibflow --name patient-lr --type model --action deploy --target kubernetes --inputJson config.json
  • 46. Greenplum Database Feature EngineCredit/Debit Card Transaction (Input) Message { “transaction_ts”: , “credit_card_number”: , “transaction_amt”:, “merchant_id”: } Approved Credit/Debit Card Transaction (Output) Message { “transaction_ts”: , “transaction_amt”:, “credit_card_number”:, “num_transactions_30days”:, “max_transactions_30days”:, “merchant_id”:, “num_fraud_cases”:, “avg_transaction_amount_30days”:, “fraud_risk_score”: 0.92, “approved”: True } Accounts credit_card_number num_transactions_30days max_transactions_30days Merchants merchant_id num_fraud_cases avg_transaction_amount_30days Cache (Gemfire, PCC, Redis, etc.) Cache Abstraction Cache Abstraction SELECT mch.* ,acct.* ,log(msg.transaction_amt + 1) AS log_transaction_amt FROM message msg JOIN merchants mch ON msg.merchant_id=mch.merchant_id JOIN accounts acct ON msg.credit_card_number=acct.credit_card_number; MADlib REST Cache Loader Automated deployment of scalable low latency end-to-end ML pipelines (“Data Science Ops.”) No code conversion - engineer features and populate cache in SQL Join data from the incoming message with cached data Accounts Merchants SELECT create_accounts(); SELECT create_merchants(); Example Flow for Fraud Detection
  • 47. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. 5. Learn More!
  • 48. • Download – http://madlib.apache.org/ • ~40 Jupyter notebooks – https://github.com/apache/madlib-site/tree/asf- site/community-artifacts • Wednesday March 20 @PostgresConf
  • 49. #ScaleMatters © Copyright 2019 Pivotal Software, Inc. All rights Reserved.
  • 50. © Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved. Backup Slides
  • 51. MADlib 2.0 ● More deep learning capabilities ○ Improved model performance ○ Hyperparameter tuning ● Model repositories and management for streamlined data science workflows ● New and improved SQL interface for MADlib functions MADlib Flow ● Support for PL/Python and PL/R ● Native deployment to Pivotal Cloud foundry as build pack. ● Beta Release in May’19 ● Metrics collector. MADlib 1.16 ● Initial deep learning release for image classification (Keras/TensorFlow) ● Postgres 11 support ● Improve speed of k- nearest neighbors via approximate method Looking Ahead
  • 52. Apache MADlib Resources • Web site – http://madlib.apache.org/ • Wiki – https://cwiki.apache.org/confluence/display/MAD LIB/Apache+MADlib • User docs – http://madlib.apache.org/docs/latest/index.html • Jupyter notebooks – https://github.com/apache/madlib-site/tree/asf- site/community-artifacts • Technical docs – http://madlib.apache.org/design.pdf • Pivotal commercial site – http://pivotal.io/madlib • Mailing lists and JIRAs – https://mail- archives.apache.org/mod_mbox/incubator- madlib-dev/ – http://mail- archives.apache.org/mod_mbox/incubator- madlib-user/ – https://issues.apache.org/jira/browse/MADLIB • PivotalR – https://cran.r- project.org/web/packages/PivotalR/index.html • Github – https://github.com/apache/madlib – https://github.com/pivotalsoftware/PivotalR
  • 53. Execution Flow Client Database Server Master Segment 1 Segment 2 Segment n … SQL Stored Procedure Result Set String Aggregation psql …
  • 55. Distributed Deep Learning Methods • Open area of research* • Methods we have investigated so far: – Simple averaging – Ensembling – Elastic averaging stochastic gradient descent (EASGD) * Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis https://arxiv.org/pdf/1802.09941.pdf
  • 56. Some Results with CIFAR-10 • 60k 32x32 color images in 10 classes, with 6k images per class • 50k training images and 10k test images https://www.cs.toronto.edu/~kriz/cifar.html
  • 57. ■ Experimentation -> Modeling at scale -> Deployment all in SQL ■ Single platform from model development to Deployment using Postgres/Greenplum ■ Low latency inference ■ Easy to deploy both feature generation code and model ■ Join data from event message with Feature cache objects using ANSI SQL ■ Continuously generate the features and feed in to feature engine. ■ Multiple versions of Models can be deployed for accuracy measurement. ■ Same tool can deploy to multiple Container Environments, PKS, AKS, GKE, etc. MADlib Flow Benefits