Contenu connexe Similaire à PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library. (20) PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learning library.2. Agenda
• Greenplum UAP overview
– Products: GPDB, GPHD, Chorus, Analytics Labs, Data Computing Appliance
– GPDB Architecture
• MADlib
–
–
–
–
Overview
Algorithms
Working Mechanism
Performance Comparison with Mahout
• PyMADlib
– Overview
– Demo in IPython Notebook
• Future Directions
– GPHD and HAWQ
© Copyright 2011 EMC Corporation. All rights reserved.
2
5. Greenplum Database - Architecture
MPP (Massively Parallel Processing)
Shared-Nothing Architecture
Master
Servers
...
SQL
MapReduce
...
Query planning &
dispatch
Network
Interconnect
Segment
Servers
...
...
Query processing
& data storage
External
Sources
Loading,
streaming, etc.
© Copyright 2011 EMC Corporation. All rights reserved.
5
7. MADlib: The Origin
UrbanDictionary.com:
mad (adj.): an adjective used to enhance a noun.
1- dude, you got skills.
2- dude, you got mad skills.
• First mention of MAD analytics was at VLDB’09
– MAD Skills: New Analysis Practices for Big Data
– Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph Hellerstein, Caleb
Welton http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf
• MADlib project initiated in late 2010
– Maintained by Greenplum/EMC with significant contributions
from UW Madison, UFlorida and UC Berkeley.
© Copyright 2011 EMC Corporation. All rights reserved.
7
8. Current Modules
Data Modeling
Supervised Learning
•
•
•
•
•
•
•
•
•
Naive Bayes Classification
Linear Regression
Logistic Regression
Multinomial Logistic Regression
Decision Tree
Random Forest
Support Vector Machines
Cox-Proportional Hazards Regression
Conditional Random Field
Unsupervised Learning
• Association Rules
• k-Means Clustering
• Low-rank Matrix Factorization
• SVD Matrix Factorization
• Parallel Latent Dirichlet Allocation
Descriptive Statistics
Sketch-based Estimators
• CountMin (CormodeMuthukrishnan)
• FM (Flajolet-Martin)
• MFV (Most Frequent Values)
Profile
Quantile
Support
Array
Operations
Conjugate
Gradient
Sparse
Vectors
Probability
Functions
Random
Sampling
Inferential Statistics
Hypothesis tests
© Copyright 2011 EMC Corporation. All rights reserved.
8
9. MADlib – User Doc
• Check out the user guide with examples at: http://doc.madlib.net
© Copyright 2011 EMC Corporation. All rights reserved.
9
10. How does it work ? : A Linear Regression Example
• Finding linear dependencies between variables
– y ≈ c0 + c1 · x1 + c2 · x2 ?
# select y, x1, x2
Vector of
dependent
variables y
© Copyright 2011 EMC Corporation. All rights reserved.
from unm limit 6;
y
| x1 | x2
-------+------+----10.14 |
0 | 0.3
11.93 | 0.69 | 0.6
13.57 | 1.1 | 0.9
14.17 | 1.39 | 1.2
15.25 | 1.61 | 1.5
16.15 | 1.79 | 1.8
Design
matrix X
10
11. Reminder: Linear-Regression Model
•
• If residuals i.i.d. Gaussians with standard deviation σ:
– max likelihood ⇔ min sum of squared residuals
• First-order conditions for the following quadratic objective (in c)
yield the minimizer
© Copyright 2011 EMC Corporation. All rights reserved.
11
12. Linear Regression: Streaming Algorithm
• How to compute with a single table scan?
-1
XT
XT
y
X
X TX
© Copyright 2011 EMC Corporation. All rights reserved.
XTy
12
13. Linear Regression: Parallel Computation
XT
y
Segment 1
T
X1 y1
© Copyright 2011 EMC Corporation. All rights reserved.
Segment 2
T
X2 y2
Master
X Ty
13
14. Performance Comparison : Test Setup on AWB
• AWB
– 1000-node cluster located in Las Vegas
– Over 24,000 processors, 48 TB of Memory, and 24 PB of raw disk
storage
– 8000+ Map Task Capacity, 5000+ Reduce Task Capacity
– GPHD 1.1, GPDB 4.2.3
• Mahout v0.7
• MADlib v0.5
– With small LMF change to allow 4-byte integer values
• Test matrix
–
–
–
–
Data size (# rows/records, # columns/features)
Algorithms
Algorithm parameters (e.g. convergence threshold, # iterations)
GPDB segment / MR (Map-Reduce) task configurations
© Copyright 2011 EMC Corporation. All rights reserved.
14
15. Performance & Scalability Results (summary)
• Whitepaper coming out shortly!
© Copyright 2011 EMC Corporation. All rights reserved.
15
16. Logistic Regression
• Mahout only has sequential (i.e. single node) IGD implementation
MADlib & Mahout Logistic Regression Scalability Across
Number of Attributes
700
Census data, 48 attributes [Mahout]
600
Time in Minutes
Census data, 48 attributes [MADlib]
500
400
300
200
100
0
1000000
10000000
10000000
1E+09
log(Number of Rows)
© Copyright 2011 EMC Corporation. All rights reserved.
16
17. Logistic Regression
MADlib Scalability Across Number of GPDB Segments
18
16
Time in Minutes
14
12
10
8
6
4
2
0
0
50
100
150
200
250
300
Number of GPDB Segments
© Copyright 2011 EMC Corporation. All rights reserved.
17
18. K-Means Clustering
MADlib & Mahout K-means Scalability Across
Number of Rows
350
Census data, 48 attributes [Mahout]
300
Census data, 48 attributes [MADlib]
Time in Min
250
200
150
100
50
0
1000000
10000000
10000000
1E+09
log(Number of Rows)
© Copyright 2011 EMC Corporation. All rights reserved.
18
19. K-Means Clustering
MADlib K-means Scalability Across
Number of GPDB Segments
10
9
8
Time in Min
7
6
5
4
3
2
1
0
0
50
100
150
200
250
300
Number of GPDB Segments
© Copyright 2011 EMC Corporation. All rights reserved.
19
20. PyMADlib : Python + MADlib = Awesome!
© Copyright 2011 EMC Corporation. All rights reserved.
20
21. Motivation
• SQL is great for many things, but it’s not nearly enough
• Undeniably the most straightforward way to query data
• But not necessarily designed for data science
© Copyright 2011 EMC Corporation. All rights reserved.
21
22. MADlib is a godsend!
• Empowers data scientists to run canned machine learning
routines – focus less on coding, more on science
• In-database, explicitly parallel.
• So why do we need anything else?
– UI is still all in SQL
– Need to tap into rich visualization libraries
© Copyright 2011 EMC Corporation. All rights reserved.
22
23. Then which interface is favored by and familiar
to data scientists?
• Depends on who you ask
• Left survey is for “higher level languages,” and right survey is for “lower level languages”
© Copyright 2011 EMC Corporation. All rights reserved.
23
24. Wait, don’t we already have this (PL/R,
PL/Python, SAS HPA)?
• PL/X’s are wonderful, but:
– It still requires non-trivial knowledge of SQL to use effectively
– Mostly limited to explicitly parallel jobs
– Primarily a SQL interface to the end user
• Need an interface that is:
– Less SQL, more R/Python/SAS
– Implicitly parallelized
– More scalable
• SAS HPA = $$$$$
© Copyright 2011 EMC Corporation. All rights reserved.
24
25. The challenge
• MADlib
–
–
–
–
Open source
Extremely powerful/scalable
Growing algorithm breadth
SQL
• Python/R
–
–
–
–
Open source
Memory limited
High algorithm breadth
Language/interface purpose-designed for data science
• SAS
–
–
–
–
High user loyalty
Non-HPA is memory limited, HPA requires investment
High algorithm breadth
Language/interface purpose-designed for data science
• Want to leverage both the performance benefits of MADlib and the
usability of languages like Python, SAS, and R
© Copyright 2011 EMC Corporation. All rights reserved.
25
26. Simple solution: Translate Python code into
SQL
ODBC/
JDBC
Python SQL
SQL to execute MADlib
Model output
• All data stays in DB and all model estimation and heavy lifting done in DB by
MADlib
• Only strings of SQL and model output transferred across ODBC/JDBC
• Best of both worlds: number crunching power of MADlib along with rich set of
visualizations of Matplotlib, NetworkX and all your other favorite Python
libraries. Let MADlib do all the heavy-lifting on your Greenplum/PostGreSQL
database, while you program in your favorite language – Python.
© Copyright 2011 EMC Corporation. All rights reserved.
26
28. Where do I get it ?
$pip install pymadlib
© Copyright 2011 EMC Corporation. All rights reserved.
28
29. I don’t have GPDB or MADlib – What do I do ?
• Greenplum Database Community Edition is freely
available for single node installations on multiple
platforms
– Written permission may be requested from EMC/Greenplum
for research use for multi-node installations
• MADlib is free and open-source
– Downloadable for multiple platforms from
https://github.com/madlib/madlib
• PyMADlib is also free and open-source
– Downloadable from https://github.com/vatsan/pymadlib
© Copyright 2011 EMC Corporation. All rights reserved.
29
31. Greenplum HD
• HAWQ – Parallel SQL query engine that combines the key
technological advantages of industry-leading Greenplum
Database with scalability and convenience of Hadoop
• SQL Standards Compliant
– Supports Correlated Sub-queries, Window Functions, Roll-ups, Cubes
+ range of scalar and aggregate functions
• ACID Compliant
© Copyright 2011 EMC Corporation. All rights reserved.
31
33. Performance : HAWQ1 Vs. Hive Vs. Impala2
All experiments were run on a 60 node deployment with Analytics Workbench3
1
2
3
http://www.greenplum.com/sites/default/files/2013_0301_hawq_sql_engine_hadoop_1.pdf
https://github.com/cloudera/impala/
http://www.analyticsworkbench.com/
© Copyright 2011 EMC Corporation. All rights reserved.
33
34. HAWQ: Deep Scalable Analytics
What’s inside the box?
• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• K-Means
• Association Rules
• Latent Dirichlet Allocation
• Users can connect to HAWQ via popular programming languages and it also
supports JDBC and ODBC.
• Most tools will work out of the box with HAWQ, including PyMADlib
© Copyright 2011 EMC Corporation. All rights reserved.
34
37. Datasets
The following datasets were used in comparing the performance of
MADlib with Mahout
– KDD Cup 2009 Orange marketing churn data (16.5 MB)
• About 500,000 records and 15,000 numerical and categorical attributes
– Census 2000 data (1.7 GB)
• About 14 million records and 48 numerical and categorical attributes
– Enron data (1.9 GB)
• About 700,000 documents with a vocabulary size of 200,000
– KDD Cup 2011 Yahoo! Music Webscope data (4.16 GB)
• About 1 million users, 600,000 songs, and 250 million ratings
– Netflix Prize 2009 data (52.7 MB)
• About 400,000 users, 900 movies, and 4.5 million ratings
© Copyright 2011 EMC Corporation. All rights reserved.
37
Notes de l'éditeur Special thanks to Grace Gee (Engineer, SOAR Program, Greenplum)