34. 34
Functions
Supervised Learning
Regression Models
• Cox Proportional Hazards Regression
• Elastic Net Regularization
• Generalized Linear Models
• Linear Regression
• Logistic Regression
• Marginal Effects
• Multinomial Regression
• Ordinal Regression
• Robust Variance, Clustered Variance
• Support Vector Machines
Tree Methods
• Decision Tree
• Random Forest
Other Methods
• Conditional Random Field
• Naïve Bayes
Unsupervised Learning
• Association Rules (Apriori)
• Clustering (K-means)
• Topic Modeling (LDA)
Statistics
Descriptive
• Cardinality Estimators
• Correlation
• Summary
Inferential
• Hypothesis Tests
Other Statistics
• Probability Functions
Other Modules
• Conjugate Gradient
• Linear Solvers
• PMML Export
• Random Sampling
• Term Frequency for Text
Time Series
• ARIMA
Aug 2015
Data Types and Transformations
• Array Operations
• Dimensionality Reduction (PCA)
• Encoding Categorical Variables
• Matrix Operations
• Matrix Factorization (SVD, Low Rank)
• Norms and Distance Functions
• Sparse Vectors
Model Evaluation
• Cross Validation
Predictive Analytics Library
@MADlib_analytic
35. 35
Architecture
C API
(Greenplum, PostgreSQL, HAWQ)
Low-level Abstraction Layer
(array operations,
C++ to DB type-bridge, …)
RDBMS
Built-in
Functions
User Interface
High-level Iteration Layer
(iteration controller, …)
Functions for Inner Loops
(implements ML logic)
Python
SQL
C++
Eigen
@MADlib_analytic
37. 37
What are our customers saying about us?
k-means clustering:
• finding items that are similar within an n-
dimensional space
• Lloyd’s local-search heuristic works well
in practice
• Two fundamental steps:
1. Assign each point to its closest centroid
2. Move each centroid to the
barycenter/mean of all points currently
assigned to it@MADlib_analytic