SlideShare une entreprise Scribd logo
1  sur  37
Srivatsan Ramanujam
Senior Data Scientist
Greenplum

© Copyright 2011 EMC Corporation. All rights reserved.

1
Agenda
• Greenplum UAP overview
– Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance
– GPDB Architecture

• MADlib
–
–
–
–

Overview
Algorithms
Working Mechanism
Performance Comparison with Mahout

• PyMADlib
– Overview
– Demo in IPython Notebook

• Future Directions
– GPHD and HAWQ

© Copyright 2011 EMC Corporation. All rights reserved.

2
Greenplum Overview

© Copyright 2011 EMC Corporation. All rights reserved.

3
Products

© Copyright 2011 EMC Corporation. All rights reserved.

4
Greenplum Database - Architecture
MPP (Massively Parallel Processing)
Shared-Nothing Architecture
Master
Servers

...

SQL
MapReduce

...

Query planning &
dispatch

Network
Interconnect

Segment
Servers

...

...

Query processing
& data storage

External
Sources
Loading,
streaming, etc.

© Copyright 2011 EMC Corporation. All rights reserved.

5
MADlib

© Copyright 2011 EMC Corporation. All rights reserved.

6
MADlib: The Origin

UrbanDictionary.com:
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills.

• First mention of MAD analytics was at VLDB’09
– MAD Skills: New Analysis Practices for Big Data
– Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, Caleb
Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf

• MADlib project initiated in late 2010
– Maintained by Greenplum/EMC with significant contributions
from UW Madison, UFlorida and UC Berkeley.

© Copyright 2011 EMC Corporation. All rights reserved.

7
Current Modules
Data Modeling
Supervised Learning
•
•
•
•
•
•
•
•
•

Naive Bayes Classification
Linear Regression
Logistic Regression
Multinomial Logistic Regression
Decision Tree
Random Forest
Support Vector Machines
Cox-Proportional Hazards Regression
Conditional Random Field

Unsupervised Learning
• Association Rules
• k-Means Clustering
• Low-rank Matrix Factorization
• SVD Matrix Factorization
• Parallel Latent Dirichlet Allocation

Descriptive Statistics
Sketch-based Estimators
• CountMin (CormodeMuthukrishnan)
• FM (Flajolet-Martin)
• MFV (Most Frequent Values)

Profile

Quantile

Support
Array
Operations
Conjugate
Gradient
Sparse
Vectors
Probability
Functions
Random
Sampling

Inferential Statistics
Hypothesis tests

© Copyright 2011 EMC Corporation. All rights reserved.

8
MADlib – User Doc
• Check out the user guide with examples at: http://doc.madlib.net

© Copyright 2011 EMC Corporation. All rights reserved.

9
How does it work ? : A Linear Regression Example
• Finding linear dependencies between variables
– y ≈ c0 + c1 · x1 + c2 · x2 ?
# select y, x1, x2

Vector of
dependent
variables y

© Copyright 2011 EMC Corporation. All rights reserved.

from unm limit 6;

y
| x1 | x2
-------+------+----10.14 |
0 | 0.3
11.93 | 0.69 | 0.6
13.57 | 1.1 | 0.9
14.17 | 1.39 | 1.2
15.25 | 1.61 | 1.5
16.15 | 1.79 | 1.8

Design
matrix X

10
Reminder: Linear-Regression Model
•
• If residuals i.i.d. Gaussians with standard deviation σ:
– max likelihood ⇔ min sum of squared residuals

• First-order conditions for the following quadratic objective (in c)

yield the minimizer

© Copyright 2011 EMC Corporation. All rights reserved.

11
Linear Regression: Streaming Algorithm
• How to compute with a single table scan?

-1
XT

XT

y

X

X TX

© Copyright 2011 EMC Corporation. All rights reserved.

XTy

12
Linear Regression: Parallel Computation
XT
y

Segment 1

T
X1 y1

© Copyright 2011 EMC Corporation. All rights reserved.

Segment 2

T
X2 y2

Master

X Ty

13
Performance Comparison : Test Setup on AWB
• AWB
– 1000-node cluster located in Las Vegas
– Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk
storage
– 8000+ Map Task Capacity, 5000+ Reduce Task Capacity
– GPHD 1.1, GPDB 4.2.3

• Mahout v0.7
• MADlib v0.5
– With small LMF change to allow 4-byte integer values

• Test matrix
–
–
–
–

Data size (# rows/records, # columns/features)
Algorithms
Algorithm parameters (e.g. convergence threshold, # iterations)
GPDB segment / MR (Map-Reduce) task configurations

© Copyright 2011 EMC Corporation. All rights reserved.

14
Performance & Scalability Results (summary)

• Whitepaper coming out shortly!

© Copyright 2011 EMC Corporation. All rights reserved.

15
Logistic Regression
• Mahout only has sequential (i.e. single node) IGD implementation

MADlib & Mahout Logistic Regression Scalability Across
Number of Attributes
700

Census data, 48 attributes [Mahout]
600

Time in Minutes

Census data, 48 attributes [MADlib]
500
400
300
200
100
0
1000000

10000000

10000000

1E+09

log(Number of Rows)

© Copyright 2011 EMC Corporation. All rights reserved.

16
Logistic Regression
MADlib Scalability Across Number of GPDB Segments
18
16

Time in Minutes

14
12
10
8
6
4
2
0
0

50

100

150

200

250

300

Number of GPDB Segments

© Copyright 2011 EMC Corporation. All rights reserved.

17
K-Means Clustering
MADlib & Mahout K-means Scalability Across
Number of Rows
350

Census data, 48 attributes [Mahout]
300

Census data, 48 attributes [MADlib]
Time in Min

250
200
150
100
50
0
1000000

10000000

10000000

1E+09

log(Number of Rows)

© Copyright 2011 EMC Corporation. All rights reserved.

18
K-Means Clustering
MADlib K-means Scalability Across
Number of GPDB Segments
10
9
8

Time in Min

7
6
5
4
3
2
1

0
0

50

100

150

200

250

300

Number of GPDB Segments

© Copyright 2011 EMC Corporation. All rights reserved.

19
PyMADlib : Python + MADlib = Awesome!

© Copyright 2011 EMC Corporation. All rights reserved.

20
Motivation
• SQL is great for many things, but it’s not nearly enough

• Undeniably the most straightforward way to query data

• But not necessarily designed for data science

© Copyright 2011 EMC Corporation. All rights reserved.

21
MADlib is a godsend!
• Empowers data scientists to run canned machine learning
routines – focus less on coding, more on science
• In-database, explicitly parallel.

• So why do we need anything else?
– UI is still all in SQL
– Need to tap into rich visualization libraries

© Copyright 2011 EMC Corporation. All rights reserved.

22
Then which interface is favored by and familiar
to data scientists?

• Depends on who you ask
• Left survey is for “higher level languages,” and right survey is for “lower level languages”

© Copyright 2011 EMC Corporation. All rights reserved.

23
Wait, don’t we already have this (PL/R,
PL/Python, SAS HPA)?
• PL/X’s are wonderful, but:
– It still requires non-trivial knowledge of SQL to use effectively
– Mostly limited to explicitly parallel jobs
– Primarily a SQL interface to the end user

• Need an interface that is:
– Less SQL, more R/Python/SAS
– Implicitly parallelized
– More scalable

• SAS HPA = $$$$$

© Copyright 2011 EMC Corporation. All rights reserved.

24
The challenge
• MADlib
–
–
–
–

Open source
Extremely powerful/scalable
Growing algorithm breadth
SQL

• Python/R
–
–
–
–

Open source
Memory limited
High algorithm breadth
Language/interface purpose-designed for data science

• SAS
–
–
–
–

High user loyalty
Non-HPA is memory limited, HPA requires investment
High algorithm breadth
Language/interface purpose-designed for data science

• Want to leverage both the performance benefits of MADlib and the
usability of languages like Python, SAS, and R

© Copyright 2011 EMC Corporation. All rights reserved.

25
Simple solution: Translate Python code into
SQL
ODBC/
JDBC

Python  SQL

SQL to execute MADlib
Model output

• All data stays in DB and all model estimation and heavy lifting done in DB by
MADlib

• Only strings of SQL and model output transferred across ODBC/JDBC
• Best of both worlds: number crunching power of MADlib along with rich set of
visualizations of Matplotlib, NetworkX and all your other favorite Python
libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL
database, while you program in your favorite language – Python.

© Copyright 2011 EMC Corporation. All rights reserved.

26
Demo

PyMADlib Tutorial –
IPython Notebook Viewer Link

http://nbviewer.ipython.org/5275846

© Copyright 2011 EMC Corporation. All rights reserved.

27
Where do I get it ?

$pip install pymadlib

© Copyright 2011 EMC Corporation. All rights reserved.

28
I don’t have GPDB or MADlib – What do I do ?
• Greenplum Database Community Edition is freely
available for single node installations on multiple
platforms
– Written permission may be requested from EMC/Greenplum
for research use for multi-node installations

• MADlib is free and open-source
– Downloadable for multiple platforms from
https://github.com/madlib/madlib

• PyMADlib is also free and open-source 
– Downloadable from https://github.com/vatsan/pymadlib

© Copyright 2011 EMC Corporation. All rights reserved.

29
Future Directions

© Copyright 2011 EMC Corporation. All rights reserved.

30
Greenplum HD
• HAWQ – Parallel SQL query engine that combines the key
technological advantages of industry-leading Greenplum
Database with scalability and convenience of Hadoop

• SQL Standards Compliant
– Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes
+ range of scalar and aggregate functions

• ACID Compliant

© Copyright 2011 EMC Corporation. All rights reserved.

31
HAWQ – Architecture

© Copyright 2011 EMC Corporation. All rights reserved.

32
Performance : HAWQ1 Vs. Hive Vs. Impala2

All experiments were run on a 60 node deployment with Analytics Workbench3

1
2
3

http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf
https://github.com/cloudera/impala/
http://www.analyticsworkbench.com/

© Copyright 2011 EMC Corporation. All rights reserved.

33
HAWQ: Deep Scalable Analytics
What’s inside the box?

• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• K-Means

• Association Rules
• Latent Dirichlet Allocation
• Users can connect to HAWQ via popular programming languages and it also
supports JDBC and ODBC.
• Most tools will work out of the box with HAWQ, including PyMADlib

© Copyright 2011 EMC Corporation. All rights reserved.

34
Questions?
@being_bayesian
vatsan.cs@utexas.edu
https://github.com/vatsan/pymadlib

© Copyright 2011 EMC Corporation. All rights reserved.

35
Appendix

© Copyright 2011 EMC Corporation. All rights reserved.

36
Datasets
The following datasets were used in comparing the performance of
MADlib with Mahout
– KDD Cup 2009 Orange marketing churn data (16.5 MB)
• About 500,000 records and 15,000 numerical and categorical attributes
– Census 2000 data (1.7 GB)
• About 14 million records and 48 numerical and categorical attributes
– Enron data (1.9 GB)
• About 700,000 documents with a vocabulary size of 200,000
– KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB)
• About 1 million users, 600,000 songs, and 250 million ratings
– Netflix Prize 2009 data (52.7 MB)
• About 400,000 users, 900 movies, and 4.5 million ratings

© Copyright 2011 EMC Corporation. All rights reserved.

37

Contenu connexe

Tendances

Overview Presentation ver 5 new
Overview Presentation ver 5 newOverview Presentation ver 5 new
Overview Presentation ver 5 new
Kim Gilmer
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
Databricks
 

Tendances (20)

Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
 
Overview Presentation ver 5 new
Overview Presentation ver 5 newOverview Presentation ver 5 new
Overview Presentation ver 5 new
 
Försäkringskassan: Neo4j as an Information Hub (GraphSummit Stockholm 2023)
Försäkringskassan: Neo4j as an Information Hub (GraphSummit Stockholm 2023)Försäkringskassan: Neo4j as an Information Hub (GraphSummit Stockholm 2023)
Försäkringskassan: Neo4j as an Information Hub (GraphSummit Stockholm 2023)
 
Slides: Relational to NoSQL Migration
Slides: Relational to NoSQL MigrationSlides: Relational to NoSQL Migration
Slides: Relational to NoSQL Migration
 
Facebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platformsFacebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platforms
 
Introducing Scylla Cloud
Introducing Scylla CloudIntroducing Scylla Cloud
Introducing Scylla Cloud
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptx
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
 
Data Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesData Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best Practices
 
Data Migration with Spark to Hive
Data Migration with Spark to HiveData Migration with Spark to Hive
Data Migration with Spark to Hive
 
What's Coming In CloudStack 4.18
What's Coming In CloudStack 4.18What's Coming In CloudStack 4.18
What's Coming In CloudStack 4.18
 
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Data Mesh 101
Data Mesh 101Data Mesh 101
Data Mesh 101
 

En vedette

Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Sarah Aerni
 

En vedette (10)

Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National ParkClimate Data Lake: Empowering Citizen Scientists in Acadia National Park
Climate Data Lake: Empowering Citizen Scientists in Acadia National Park
 
Analyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity FuturesAnalyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity Futures
 
Internet Of Things: How Data Science Driven Software is Eating the Connected ...
Internet Of Things: How Data Science Driven Software is Eating the Connected ...Internet Of Things: How Data Science Driven Software is Eating the Connected ...
Internet Of Things: How Data Science Driven Software is Eating the Connected ...
 
Data Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehiclesData Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehicles
 
Strata aerni 2015_09_30_1315
Strata aerni 2015_09_30_1315Strata aerni 2015_09_30_1315
Strata aerni 2015_09_30_1315
 
Data Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data ScienceData Driven Action : A Primer on Data Science
Data Driven Action : A Primer on Data Science
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
All thingspython@pivotal
All thingspython@pivotalAll thingspython@pivotal
All thingspython@pivotal
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
 

Similaire à PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.

Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bk
IBM Switzerland
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology Group
Slide_N
 
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston MeetupOpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
ragss
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
Daniela Zuppini
 
Deview 2013 rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john maoDeview 2013   rise of the wimpy machines - john mao
Deview 2013 rise of the wimpy machines - john mao
NAVER D2
 

Similaire à PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library. (20)

EMC Unified Analytics Platform. Gintaras Pelenis
EMC Unified Analytics Platform. Gintaras PelenisEMC Unified Analytics Platform. Gintaras Pelenis
EMC Unified Analytics Platform. Gintaras Pelenis
 
Greenplum feature
Greenplum featureGreenplum feature
Greenplum feature
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Green Plum IIIT- Allahabad
Green Plum IIIT- Allahabad Green Plum IIIT- Allahabad
Green Plum IIIT- Allahabad
 
Pro sphere customer technical
Pro sphere customer technicalPro sphere customer technical
Pro sphere customer technical
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bk
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology Group
 
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
 
BrightTalk session-The right SDS for your OpenStack Cloud
BrightTalk session-The right SDS for your OpenStack CloudBrightTalk session-The right SDS for your OpenStack Cloud
BrightTalk session-The right SDS for your OpenStack Cloud
 
OpenCAPI next generation accelerator
OpenCAPI next generation accelerator OpenCAPI next generation accelerator
OpenCAPI next generation accelerator
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston MeetupOpenStack + Cloud Foundry for the OpenStack Boston Meetup
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
 
Oaktable World 2014 Kevin Closson: SLOB – For More Than I/O!
Oaktable World 2014 Kevin Closson:  SLOB – For More Than I/O!Oaktable World 2014 Kevin Closson:  SLOB – For More Than I/O!
Oaktable World 2014 Kevin Closson: SLOB – For More Than I/O!
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
Transforming your Business with Scale-Out Flash: How MongoDB & Flash Accelera...
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
 
Taming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data AnalyticsTaming Latency: Case Studies in MapReduce Data Analytics
Taming Latency: Case Studies in MapReduce Data Analytics
 
Software Defined Infrastructure
Software Defined InfrastructureSoftware Defined Infrastructure
Software Defined Infrastructure
 
Deview 2013 rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john maoDeview 2013   rise of the wimpy machines - john mao
Deview 2013 rise of the wimpy machines - john mao
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.

  • 1. Srivatsan Ramanujam Senior Data Scientist Greenplum © Copyright 2011 EMC Corporation. All rights reserved. 1
  • 2. Agenda • Greenplum UAP overview – Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance – GPDB Architecture • MADlib – – – – Overview Algorithms Working Mechanism Performance Comparison with Mahout • PyMADlib – Overview – Demo in IPython Notebook • Future Directions – GPHD and HAWQ © Copyright 2011 EMC Corporation. All rights reserved. 2
  • 3. Greenplum Overview © Copyright 2011 EMC Corporation. All rights reserved. 3
  • 4. Products © Copyright 2011 EMC Corporation. All rights reserved. 4
  • 5. Greenplum Database - Architecture MPP (Massively Parallel Processing) Shared-Nothing Architecture Master Servers ... SQL MapReduce ... Query planning & dispatch Network Interconnect Segment Servers ... ... Query processing & data storage External Sources Loading, streaming, etc. © Copyright 2011 EMC Corporation. All rights reserved. 5
  • 6. MADlib © Copyright 2011 EMC Corporation. All rights reserved. 6
  • 7. MADlib: The Origin UrbanDictionary.com: mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills. • First mention of MAD analytics was at VLDB’09 – MAD Skills: New Analysis Practices for Big Data – Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, Caleb Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf • MADlib project initiated in late 2010 – Maintained by Greenplum/EMC with significant contributions from UW Madison, UFlorida and UC Berkeley. © Copyright 2011 EMC Corporation. All rights reserved. 7
  • 8. Current Modules Data Modeling Supervised Learning • • • • • • • • • Naive Bayes Classification Linear Regression Logistic Regression Multinomial Logistic Regression Decision Tree Random Forest Support Vector Machines Cox-Proportional Hazards Regression Conditional Random Field Unsupervised Learning • Association Rules • k-Means Clustering • Low-rank Matrix Factorization • SVD Matrix Factorization • Parallel Latent Dirichlet Allocation Descriptive Statistics Sketch-based Estimators • CountMin (CormodeMuthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent Values) Profile Quantile Support Array Operations Conjugate Gradient Sparse Vectors Probability Functions Random Sampling Inferential Statistics Hypothesis tests © Copyright 2011 EMC Corporation. All rights reserved. 8
  • 9. MADlib – User Doc • Check out the user guide with examples at: http://doc.madlib.net © Copyright 2011 EMC Corporation. All rights reserved. 9
  • 10. How does it work ? : A Linear Regression Example • Finding linear dependencies between variables – y ≈ c0 + c1 · x1 + c2 · x2 ? # select y, x1, x2 Vector of dependent variables y © Copyright 2011 EMC Corporation. All rights reserved. from unm limit 6; y | x1 | x2 -------+------+----10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design matrix X 10
  • 11. Reminder: Linear-Regression Model • • If residuals i.i.d. Gaussians with standard deviation σ: – max likelihood ⇔ min sum of squared residuals • First-order conditions for the following quadratic objective (in c) yield the minimizer © Copyright 2011 EMC Corporation. All rights reserved. 11
  • 12. Linear Regression: Streaming Algorithm • How to compute with a single table scan? -1 XT XT y X X TX © Copyright 2011 EMC Corporation. All rights reserved. XTy 12
  • 13. Linear Regression: Parallel Computation XT y Segment 1 T X1 y1 © Copyright 2011 EMC Corporation. All rights reserved. Segment 2 T X2 y2 Master X Ty 13
  • 14. Performance Comparison : Test Setup on AWB • AWB – 1000-node cluster located in Las Vegas – Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk storage – 8000+ Map Task Capacity, 5000+ Reduce Task Capacity – GPHD 1.1, GPDB 4.2.3 • Mahout v0.7 • MADlib v0.5 – With small LMF change to allow 4-byte integer values • Test matrix – – – – Data size (# rows/records, # columns/features) Algorithms Algorithm parameters (e.g. convergence threshold, # iterations) GPDB segment / MR (Map-Reduce) task configurations © Copyright 2011 EMC Corporation. All rights reserved. 14
  • 15. Performance & Scalability Results (summary) • Whitepaper coming out shortly! © Copyright 2011 EMC Corporation. All rights reserved. 15
  • 16. Logistic Regression • Mahout only has sequential (i.e. single node) IGD implementation MADlib & Mahout Logistic Regression Scalability Across Number of Attributes 700 Census data, 48 attributes [Mahout] 600 Time in Minutes Census data, 48 attributes [MADlib] 500 400 300 200 100 0 1000000 10000000 10000000 1E+09 log(Number of Rows) © Copyright 2011 EMC Corporation. All rights reserved. 16
  • 17. Logistic Regression MADlib Scalability Across Number of GPDB Segments 18 16 Time in Minutes 14 12 10 8 6 4 2 0 0 50 100 150 200 250 300 Number of GPDB Segments © Copyright 2011 EMC Corporation. All rights reserved. 17
  • 18. K-Means Clustering MADlib & Mahout K-means Scalability Across Number of Rows 350 Census data, 48 attributes [Mahout] 300 Census data, 48 attributes [MADlib] Time in Min 250 200 150 100 50 0 1000000 10000000 10000000 1E+09 log(Number of Rows) © Copyright 2011 EMC Corporation. All rights reserved. 18
  • 19. K-Means Clustering MADlib K-means Scalability Across Number of GPDB Segments 10 9 8 Time in Min 7 6 5 4 3 2 1 0 0 50 100 150 200 250 300 Number of GPDB Segments © Copyright 2011 EMC Corporation. All rights reserved. 19
  • 20. PyMADlib : Python + MADlib = Awesome! © Copyright 2011 EMC Corporation. All rights reserved. 20
  • 21. Motivation • SQL is great for many things, but it’s not nearly enough • Undeniably the most straightforward way to query data • But not necessarily designed for data science © Copyright 2011 EMC Corporation. All rights reserved. 21
  • 22. MADlib is a godsend! • Empowers data scientists to run canned machine learning routines – focus less on coding, more on science • In-database, explicitly parallel. • So why do we need anything else? – UI is still all in SQL – Need to tap into rich visualization libraries © Copyright 2011 EMC Corporation. All rights reserved. 22
  • 23. Then which interface is favored by and familiar to data scientists? • Depends on who you ask • Left survey is for “higher level languages,” and right survey is for “lower level languages” © Copyright 2011 EMC Corporation. All rights reserved. 23
  • 24. Wait, don’t we already have this (PL/R, PL/Python, SAS HPA)? • PL/X’s are wonderful, but: – It still requires non-trivial knowledge of SQL to use effectively – Mostly limited to explicitly parallel jobs – Primarily a SQL interface to the end user • Need an interface that is: – Less SQL, more R/Python/SAS – Implicitly parallelized – More scalable • SAS HPA = $$$$$ © Copyright 2011 EMC Corporation. All rights reserved. 24
  • 25. The challenge • MADlib – – – – Open source Extremely powerful/scalable Growing algorithm breadth SQL • Python/R – – – – Open source Memory limited High algorithm breadth Language/interface purpose-designed for data science • SAS – – – – High user loyalty Non-HPA is memory limited, HPA requires investment High algorithm breadth Language/interface purpose-designed for data science • Want to leverage both the performance benefits of MADlib and the usability of languages like Python, SAS, and R © Copyright 2011 EMC Corporation. All rights reserved. 25
  • 26. Simple solution: Translate Python code into SQL ODBC/ JDBC Python  SQL SQL to execute MADlib Model output • All data stays in DB and all model estimation and heavy lifting done in DB by MADlib • Only strings of SQL and model output transferred across ODBC/JDBC • Best of both worlds: number crunching power of MADlib along with rich set of visualizations of Matplotlib, NetworkX and all your other favorite Python libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL database, while you program in your favorite language – Python. © Copyright 2011 EMC Corporation. All rights reserved. 26
  • 27. Demo PyMADlib Tutorial – IPython Notebook Viewer Link http://nbviewer.ipython.org/5275846 © Copyright 2011 EMC Corporation. All rights reserved. 27
  • 28. Where do I get it ? $pip install pymadlib © Copyright 2011 EMC Corporation. All rights reserved. 28
  • 29. I don’t have GPDB or MADlib – What do I do ? • Greenplum Database Community Edition is freely available for single node installations on multiple platforms – Written permission may be requested from EMC/Greenplum for research use for multi-node installations • MADlib is free and open-source – Downloadable for multiple platforms from https://github.com/madlib/madlib • PyMADlib is also free and open-source  – Downloadable from https://github.com/vatsan/pymadlib © Copyright 2011 EMC Corporation. All rights reserved. 29
  • 30. Future Directions © Copyright 2011 EMC Corporation. All rights reserved. 30
  • 31. Greenplum HD • HAWQ – Parallel SQL query engine that combines the key technological advantages of industry-leading Greenplum Database with scalability and convenience of Hadoop • SQL Standards Compliant – Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes + range of scalar and aggregate functions • ACID Compliant © Copyright 2011 EMC Corporation. All rights reserved. 31
  • 32. HAWQ – Architecture © Copyright 2011 EMC Corporation. All rights reserved. 32
  • 33. Performance : HAWQ1 Vs. Hive Vs. Impala2 All experiments were run on a 60 node deployment with Analytics Workbench3 1 2 3 http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf https://github.com/cloudera/impala/ http://www.analyticsworkbench.com/ © Copyright 2011 EMC Corporation. All rights reserved. 33
  • 34. HAWQ: Deep Scalable Analytics What’s inside the box? • Linear Regression • Logistic Regression • Multinomial Logistic Regression • K-Means • Association Rules • Latent Dirichlet Allocation • Users can connect to HAWQ via popular programming languages and it also supports JDBC and ODBC. • Most tools will work out of the box with HAWQ, including PyMADlib © Copyright 2011 EMC Corporation. All rights reserved. 34
  • 36. Appendix © Copyright 2011 EMC Corporation. All rights reserved. 36
  • 37. Datasets The following datasets were used in comparing the performance of MADlib with Mahout – KDD Cup 2009 Orange marketing churn data (16.5 MB) • About 500,000 records and 15,000 numerical and categorical attributes – Census 2000 data (1.7 GB) • About 14 million records and 48 numerical and categorical attributes – Enron data (1.9 GB) • About 700,000 documents with a vocabulary size of 200,000 – KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB) • About 1 million users, 600,000 songs, and 250 million ratings – Netflix Prize 2009 data (52.7 MB) • About 400,000 users, 900 movies, and 4.5 million ratings © Copyright 2011 EMC Corporation. All rights reserved. 37

Notes de l'éditeur

  1. Special thanks to Grace Gee (Engineer, SOAR Program, Greenplum)