Data Science Tools for Big Data Analysis

BUILT FOR THE SPEED OF BUSINESS

Data Science as a Commodity:

How to use MADlib, R, and other Publicly
Available and Open Source Tools for Data
Science
Pivotal OSS Meetups
Sarah Aerni
Pivotal Senior Data Scientist
@itweetsarah
saerni@gopivotal.com

January 28, 2014
© Copyright 2014 Pivotal. All rights reserved.

2

What we will cover in today’s Meetup
Ÿ  What is data science, big data,
buzzword, buzzword?
Ÿ  What are some examples of data
science in action?
Ÿ  What do I do at Pivotal?
Ÿ  Who are our data scientists?
Ÿ  Why is open source software
important for data science?

Ÿ  What do I do with loads of data?
Ÿ  How can I create good models?
Ÿ  What types of open source tools can
I use to build models?
Ÿ  How can I build a quick app?
Ÿ  What can I do to get started
analyzing text data?

Ÿ  Which tools exist to create
Ÿ  What tools does our team use? For
visualizations of my data that I can
NLP? For optimization? For
understand?
regression?


3

What we will not cover #notdatascience


4

Instead: Practical Data Science Tools #useful

– Kaushik Das
http://blog.gopivotal.com/p-o-v/the-eightfold-path-of-data-science


5

“At companies where there is no
framework for operationalization
of the models, PowerPoint is
where models go to die!”
– Hulya Farinas
http://venturebeat.com/2013/12/03/how-torevolutionize-healthcare-get-data-scientists-andapp-developers-together/


6

“At companies where there is no
framework for operationalization
of the models, PowerPoint is
where models go to die!”
– Hulya Farinas
http://venturebeat.com/2013/12/03/how-torevolutionize-healthcare-get-data-scientists-andapp-developers-together/

“The use of statistical and machine
learning techniques on big multistructured data — in a distributed
computing environment — to identify
correlations and causal relationships,
classify and predict events, identify
patterns and anomalies, and infer
probabilities, interest, and sentiment.”
– Annika Jimenez
http://blog.gopivotal.com/news-2/annika-jimenez-ondisruptive-data-science-at-the-strata-conference


7

DATA
IS THE NEW
CENTER OF GRAVITY

Data > Application!

“BIG DATA IS THE NEW NORMAL”
“‘BIG DATA’ BECOMES ‘DATA’ ONCE AGAIN”

8

What Can “Small Data” Scientists Bring on Their
“Big Data” Journey?

http://factspy.net/the-differencebetween-geeks-vs-nerds/


9

What Can “Small Data” Scientists Bring on Their
“Big Data” Journey?
Small Data

Databases

In-me
m

Flat files

Big Data

MapRe

duce

Many tools and
approaches are
being adapted to big
data technologies

S
HDF

Cloud computing

ory m
buildin odel
g

Command-line
tools

pu
d com
tribute

ting

Dis

Command-line tools
10

Basic DS Tools: From Command-line to GUI
Ÿ  Quick-and-dirty tricks using
command-line tools
– 
– 
– 
– 

Fast feedback - interactive
Fast to process
Easy to write, hard to read
Background processing (screen)

Ian Huston, Alex Kagoshima, Ronert Obst

11

command-line tools
– 
– 
– 
– 

Fast to process

Ÿ  Large-volumes of data à automatically parallel
environments (e.g. GPDB) may be faster


12

command-line tools
– 
– 
– 
– 

Fast to process

Ÿ  Large-volumes of data à automatically parallel
environments (e.g. GPDB) may be faster

Ÿ  Python and R

–  Rstudio
–  iPython (iPythonNotebook)


13

Favorite python and R packages and resources
Python

–  NumPy
–  SciPy
–  scikit-learn – machine
learning package
–  statsmodels
–  pandas
–  pyMC
–  IPython
(IPythonNotebook)
–  matplotlib


14

Favorite python and R packages, resources, and more
Ÿ  R

– 
– 
– 
– 
– 
– 
– 
– 
– 

ggplot
reshape
plyr
Shiny
Good support for time
series analyses
Rstudio ( weave )
foreach, parallel
taskviews
parboost


15

What do I do at Pivotal?
A New Platform for a New Era
DATA-DRIVEN APPLICATION DEVELOPMENT

App Fabric

Data Fabric

“The new Middleware”

“The new Database”

Cloud Fabric
“The new OS”
...ETC

“The new Hardware”

16

Pivotal Big Data Technology: HAWQ
Think of it as multiple PostGreSQL servers
Master

Segments/Workers
Rows are distributed across segments by
a particular field (or randomly)
Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database


17

Performance Through Parallelism
Ÿ  Automatic parallelization
–  Load and query like any database
–  Automatically distributed tables
across nodes

Ÿ  Analytics-oriented query optimization
Ÿ  Scalable MPP architecture
–  All nodes can scan and process in
parallel
–  Linear scalability by adding nodes
Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database


18

Data Science Tools for Big Data
COMMERCIAL

OPEN SOURCE (OR FREE)

PL/R,
PL/Python
PL/Java


19

Making sense of your “big data”
Ÿ  Large volumes of data may be difficult to understand
–  ~100 tables
–  Tens of thousands of columns


20

–  ~100 tables

Ÿ  How do you build models that use all the data? Score all the
data?


21

–  ~100 tables

data?
Ÿ  Where do you focus your effort?

–  Getting a rapid grasp of relevant fields is important
–  Scanning lots of data is slow, creating models with huge numbers of features is
possible, but generally better to understand your data
–  Columns with little or no variation or only null values


22

–  ~100 tables

data?
Ÿ  Where do you focus your effort?

–  Getting a rapid grasp of relevant fields is important
–  Scanning lots of data is slow, creating models with huge numbers of features is
possible, but generally better to understand your data
–  Columns with little or no variation or only null values

Ÿ  These functions exist in MADlib

23

MADlib In-Database Functions
Predictive Modeling Library
Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Multinomial Logistic Regression
•  Cox Proportional Hazards
•  Regression
•  Elastic Net Regularization
•  Sandwich Estimators (Huber
white, clustered, marginal
effects)
Matrix Factorization
•  Single Value Decomposition
(SVD)
•  Low-Rank


Machine Learning Algorithms
•  Principal Component Analysis (PCA)
•  Association Rules (Affinity Analysis,
Market Basket)
•  Topic Modeling (Parallel LDA)
•  Decision Trees
•  Ensemble Learners (Random Forests)
•  Support Vector Machines
•  Conditional Random Field (CRF)
•  Clustering (K-means)
•  Cross Validation
Linear Systems
•  Sparse and Dense Solvers

Descriptive Statistics

Sketch-based
Estimators
•  CountMin (CormodeMuthukrishnan)
•  FM (Flajolet-Martin)
•  MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions

24

MADlib in Action: Regression on
Billions of Rows

Ÿ  Input Data

–  10s of millions of rows from data collected at multiple drill
testing sites
–  Sensor data for drills during operation, including rate of
penetration, depth of penetration, weight on drill bit and
more

Ÿ  Data Massaging and Review

–  Rapid summarization of many columns of data - to identify
outliers, missing data and remove them from analysis
–  Used window functions to construct a moving average
(smoothing) of all the features and dependent variable

Ÿ  Model

–  Linear regression on the complete dataset
–  K-means clustering to determine similarities of sites

Rashmi Raghu

Drilling into the San Andreas Fault at Parkfield California.
Credit: Stephen H. Hickman, USGS

25

Linear Regression: Streaming Algorithm
Ÿ  Finding linear
dependencies between
variables
Ÿ  How to compute with a
single scan?


26

Linear Regression: Parallel Computation
XT
y

X T y = ∑ xiT yi
i


27

XT
y

T
X1 y1

Segment 1

+

T
X 2 y2

Segment 2

=

XT y
Master
28

XT
y

T
X1 y1

Segment 1

+

T
X 2 y2

Segment 2

=

XT y
Master
29

Performing a linear regression on 10 million
rows in seconds

Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of
the VLDB Endowment 5.12 (2012): 1700-1711.


30

Calling MADlib Functions: Fast Training, Scoring
Ÿ  MADlib allows users to easily and
create models without moving data
out of the systems
–  Model generation
–  Model validation
–  Scoring (evaluation of) new data

Ÿ  All the data can be used in one
model

MADlib model function

Table containing
training data

SELECT madlib.linregr_train( 'houses’,!
'houses_linregr’,!
'price’,!
'ARRAY[1, tax, bath, size]’);!
Features included in the
model

Table in which to
save results

Column containing
dependent variable

Ÿ  Built-in functionality to create of
multiple smaller models (e.g.
classification grouped by feature)
Ÿ  Open-source lets you tweak and
extend methods, or build your own


31

out of the systems

model

MADlib model function

Table containing
training data

'price’,!
'ARRAY[1, tax, bath, size]’,!
‘bedroom’);!

Table in which to
save results

Column containing
dependent variable
Features included in the
model
Create multiple output models
(one for each value of bedroom)



32

out of the systems

model


'price’,!
MADlib model scoring function
SELECT houses.*,
madlib.linregr_predict(ARRAY[1,tax,bath,size],
m.coef!
)as predict !
FROM houses, houses_linregr m;!
Table with data to be scored

Table containing model

33

PivotalR: Bringing MADlib and HAWQ to a familiar
R interface
Ÿ  Challenge

Want to harness the familiarity of R’s interface and the performance &
scalability benefits of in-DB analytics

Ÿ  Simple solution:

Translate R code into SQL
Pivotal R

d <- db.data.frame(”houses")!
houses_linregr <- madlib.lm(price ~ tax!
!
!
!+ bath!
!
!
!+ size!
!
!
!, data=d)!

SQL Code
'price’,!

http://gopivotal.github.io/PivotalR/

Woo Jung

34

PivotalR: Bringing MADlib and HAWQ to a familiar
R interface
Ÿ  Challenge

Want to harness the familiarity of R’s interface and the performance &
scalability benefits of in-DB analytics

Ÿ  Simple solution:

Translate R code into SQL
Pivotal R

#
#
#
#
#

Build a regression model with a different!
intercept term for each state!
(state=1 as baseline).!
Note that PivotalR supports automated!
indicator coding a la as.factor()!!

d <- db.data.frame(”houses")!
houses_linregr <- madlib.lm(price ~ as.factor(state)!
!
!
!
!+ tax!
!
!
!
!+ bath!
!
!
!
!+ size!
!
!
!
!, data=d)!

Woo Jung

35

PivotalR Design Overview
• 
• 

Call MADlib’s in-DB machine learning functions
directly from R
Syntax is analogous to native R function

PivotalR

R à SQL

No data here

RPostgreSQL

Data lives here

SQL to execute
Computation results

• 
• 

Database w/ MADlib

Data doesn’t need to leave the database
All heavy lifting, including model estimation
& computation, are done in the database

Woo Jung

36

PivotalR: Current Features
And more ... (SQL wrapper)

• 
MADlib Functionality

•  Linear Regression
•  Logistic Regression
•  Elastic Net
•  ARIMA
•  Marginal Effects
•  Cross Validation
•  Bagging
•  summary on model objects


+ - *
%/% ^

/

%%

• Automated Indicator
Variable Coding
as.factor

• predict

• 
• 
• 
• 
• 
• 
• 
• 

• 

dim names

$

[

==
&
by

[[
!=

|

$<>

!

• 

• 

<

[<>=

merge

sort

db.data.frame

• 
• 

[[<<=

• 

is.na

preview
content

as.db.data.frame
c mean sum sd var min max
length colMeans colSums
db.connect db.disconnect
db.list db.objects
db.existsObject delete

37


Woo Jung

38

http://www.rstudio.com/shiny/

Woo Jung

39

Shiny Showcase: Example Web Apps in R
Ÿ  Users can choose
input parameters with
sliders, drop-downs,
and text fields.
Ÿ  HTML/JavaScript
knowledge not
required.


40

Shiny Showcase: Example Web Apps in R
Ÿ  Users can choose
input parameters with
sliders, drop-downs,
and text fields.
Ÿ  HTML/JavaScript
knowledge not
required.


41

http://d3js.org/

42

D3 Data-Driven Documents

http://d3js.org/

43

D3 Data-Driven Documents

http://d3js.org/

44

PyMADlib
Ÿ  Python wrapper for MADlib

http://nbviewer.ipython.org/gist/vatsan/5275846

45

PyMADlib
Ÿ  Python wrapper for MADlib

http://nbviewer.ipython.org/gist/vatsan/5275846

46

Procedural Languages in Big Data Science
Ÿ  HAWQ & PL/X can take advantage of “data
parallel” tasks by performing analyses in
parallel – embarrassingly parallel tasks
Ÿ  Little or no effort is required to break up the
problem into a number of parallel tasks, and
there exists no dependency (or
communication) between those parallel
tasks
Ÿ  Examples of ‘data parallel’ problems:
–  Counting words in documents
–  Genome-Wide Association Study
–  Studying network anomalies

http://gopivotal.github.io/gp-r/


SQL & R
Master
Severs
Network
Interconnect

Segment
Severs

Doc1

Doc2

DocM

Stem1

Stem2

StemM

Count1

Count2

CountM

47

Structure of input table for PL/R function
Columns
Description

A

Network ID
ID of the network.
300K in total.

Terminal
readings

Topology

Network Readings

Array of integers
defining the
topology tree.

Array of readings from
network terminal points
over (say) a week.

C

Ÿ  Using historical readings, solve a
linear program to establish baseline
behavior, for example number of
shipments

0
B

Ÿ  Topology: Hubs connected to multiple
terminal points

D

Ÿ  Detecting anomalies within subnetworks on future observations

Vivek Ramamurthy

48

Performance Analysis
Number of
networks

Time/network
(ms)

Total time
(seconds)

500

6.604

3.30

1000

3.637

3.64

5000

2.822

14.11

400

10,000

2.356

23.56

300

50,000

2.160

108.02

200

100,000

2.142

214.20

100

150,000

2.162

324.29

200,000

2.142

428.48

250,000

2.138

534.69

300,000

2.132

639.85

Execution time v/s number of networks

Time (seconds)

700
600
500

0
0

50
100
150
200
250
Number of networks (in thousands)

300

Vivek Ramamurthy

49

R package used

optim

quadprog

Rsymphony

Rglpk

Single network in R (time)

~60s

6.3 s

0.145 s

0.181 s

300K networks in PL/R (time)

~84 hrs

5.87 hrs

10.7 min

14.6 min

Time per network in PL/R

1005.2 ms

70.44 ms

2.13 ms

2.92 ms

Vivek Ramamurthy

50

R package used

optim

quadprog

Rsymphony

Rglpk


~60s

6.3 s

0.145 s

0.181 s


~84 hrs

5.87 hrs

10.7 min

14.6 min


1005.2 ms

70.44 ms

2.13 ms

2.92 ms

COIN-OR : Computational Infrastructure for Operations
Research

http://www.coin-or.org/

–  Libraries for linear and non-linear programming, integer
programming
–  SYMPHONY : Callable library in COIN-OR for solving mixed
integer linear programs

GLPK : GNU Linear Programming Kit
Used for large-scale LPs, MIPs and related problems

Vivek Ramamurthy

51

R package used

optim

quadprog

Rsymphony

Rglpk


~60s

6.3 s

0.145 s

0.181 s


~84 hrs

5.87 hrs

10.7 min

14.6 min


1005.2 ms

70.44 ms

2.13 ms

2.92 ms

COIN-OR : Computational Infrastructure for Operations
Research

http://www.coin-or.org/

–  Libraries for linear and non-linear programming, integer
programming
–  SYMPHONY : Callable library in COIN-OR for solving mixed
integer linear programs

GLPK : GNU Linear Programming Kit
–  Used for large-scale LPs, MIPs and related problems

http://www.gnu.org/software/glpk/

Vivek Ramamurthy

52

Natural language processing
Data sources

Applications

NLP processing
pipeline

Text sources
Documents, books,
emails
Sentence
detection

Tokenization

Morphological
stemming

Stop word
removal

Word-sense
disambiguation

Part-of-Speech
tagging

Syntactic
parsing

Semantic role
labeling

Entity
recognition

Reference
resolution

Speech
Phone logs,
conversations

Event
processing

Word clouds
Topic modeling
Sentiment analysis
Machine translation
Document classification
Document summarization
Language generation
Search
Question answering
Information Extraction

…

Common tasks/tools in NLP

Niels Kasch

53

Open source tools for common NLP tasks
RELEVANT NLP TOOLS

OPEN SOURCE SOFTWARE

WORD CLOUDS

T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N

I N F O R M AT I O N E X T R A C T I O N

Niels Kasch

54

RELEVANT NLP TOOLS


WORD CLOUDS
Tokenization

Stemming/
lemmatization

Stop word
removal

• 
• 
• 

GPText
Apache UIMA
OpenNLP (Java)

• 
• 
• 

NLTK (Python)
WordNet
Pytagcloud



Niels Kasch

55

RELEVANT NLP TOOLS


WORD CLOUDS
Tokenization

Stemming/
lemmatization

Stop word
removal

• 
• 
• 

GPText
Apache UIMA
OpenNLP (Java)

• 
• 
• 

NLTK (Python)
WordNet
Pytagcloud

Tokenization

Stemming/
lemmatization

Stop word
removal

Language
detection

• 
• 
• 

Madlib (PLDA)
gensim (LSA & LDA package for python)
https://code.google.com/p/language-detection/


Niels Kasch

56

RELEVANT NLP TOOLS


WORD CLOUDS
Tokenization

Stemming/
lemmatization

• 
• 
• 

Stop word
removal

GPText
Apache UIMA
OpenNLP (Java)

• 
• 
• 

NLTK (Python)
WordNet
Pytagcloud

Tokenization

Stemming/
lemmatization

Stop word
removal

Language
detection

• 
• 
• 

Madlib (PLDA)
gensim (LSA & LDA package for python)
https://code.google.com/p/language-detection/

• 
• 
• 

GPText and Madlib
OpenNLP
NLTK

Sentence
detection

Tokenization
Language
detection

Relationship
extraction

Syntactic
parsing

Entity
extraction

• 

Stanford CoreNLP (incl.
POS tagger, NER, parser,
etc.)

Niels Kasch

57

Topic Analysis – MADlib pLDA
Natural Language Processing - GPText
Filter
relevant
content

Align
Data

Social
Media
Tokenizer

Stemming,
frequency
filtering

Prepare
dataset for
Topic Modeling

Srivatsan Ramanujam

58

Topic Analysis – MADlib pLDA
Natural Language Processing - GPText
Filter
relevant
content

Align
Data

Social
Media
Tokenizer

Stemming,
frequency
filtering

Prepare
dataset for
Topic Modeling

Topic Graph
Topic composition

MADlib Topic
Model
Topic Clouds

Srivatsan Ramanujam

59

Is there more? What’s next?
blog.gopivotal.com/tag/data-science
blog.gopivotal.com/tag/data-science-tech


60

Data Science Tools for Big Data Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Data Science Tools for Big Data Analysis

Similar to Data Science Tools for Big Data Analysis (20)

Recently uploaded

Recently uploaded (20)

Data Science Tools for Big Data Analysis