SlideShare a Scribd company logo
1 of 61
Download to read offline
BUILT FOR THE SPEED OF BUSINESS
Data Science as a Commodity:

How to use MADlib, R, and other Publicly
Available and Open Source Tools for Data
Science
Pivotal OSS Meetups
Sarah Aerni
Pivotal Senior Data Scientist
@itweetsarah
saerni@gopivotal.com

January 28, 2014
© Copyright 2014 Pivotal. All rights reserved.

2
What we will cover in today’s Meetup
Ÿ  What is data science, big data,
buzzword, buzzword?
Ÿ  What are some examples of data
science in action?
Ÿ  What do I do at Pivotal?
Ÿ  Who are our data scientists?
Ÿ  Why is open source software
important for data science?

Ÿ  What do I do with loads of data?
Ÿ  How can I create good models?
Ÿ  What types of open source tools can
I use to build models?
Ÿ  How can I build a quick app?
Ÿ  What can I do to get started
analyzing text data?

Ÿ  Which tools exist to create
Ÿ  What tools does our team use? For
visualizations of my data that I can
NLP? For optimization? For
understand?
regression?

© Copyright 2014 Pivotal. All rights reserved.

3
What we will not cover #notdatascience

© Copyright 2014 Pivotal. All rights reserved.

4
Instead: Practical Data Science Tools #useful

– Kaushik Das
http://blog.gopivotal.com/p-o-v/the-eightfold-path-of-data-science

© Copyright 2014 Pivotal. All rights reserved.

5
Instead: Practical Data Science Tools #useful
“At companies where there is no
framework for operationalization
of the models, PowerPoint is
where models go to die!”
– Hulya Farinas
http://venturebeat.com/2013/12/03/how-torevolutionize-healthcare-get-data-scientists-andapp-developers-together/

© Copyright 2014 Pivotal. All rights reserved.

6
Instead: Practical Data Science Tools #useful
“At companies where there is no
framework for operationalization
of the models, PowerPoint is
where models go to die!”
– Hulya Farinas
http://venturebeat.com/2013/12/03/how-torevolutionize-healthcare-get-data-scientists-andapp-developers-together/

“The use of statistical and machine
learning techniques on big multistructured data — in a distributed
computing environment — to identify
correlations and causal relationships,
classify and predict events, identify
patterns and anomalies, and infer
probabilities, interest, and sentiment.”
– Annika Jimenez
http://blog.gopivotal.com/news-2/annika-jimenez-ondisruptive-data-science-at-the-strata-conference

© Copyright 2014 Pivotal. All rights reserved.

7
DATA
IS THE NEW
CENTER OF GRAVITY

Data > Application!

“BIG DATA IS THE NEW NORMAL”
“‘BIG DATA’ BECOMES ‘DATA’ ONCE AGAIN”
© Copyright 2014 Pivotal. All rights reserved.

8
What Can “Small Data” Scientists Bring on Their
“Big Data” Journey?

http://factspy.net/the-differencebetween-geeks-vs-nerds/

© Copyright 2014 Pivotal. All rights reserved.

9
What Can “Small Data” Scientists Bring on Their
“Big Data” Journey?
Small Data

Databases

In-me
m

Flat files

Big Data

MapRe

duce

Many tools and
approaches are
being adapted to big
data technologies

S
HDF

Cloud computing

ory m
buildin odel
g

Command-line
tools
© Copyright 2014 Pivotal. All rights reserved.

pu
d com
tribute

ting

Dis

Command-line tools
10
Basic DS Tools: From Command-line to GUI
Ÿ  Quick-and-dirty tricks using
command-line tools
– 
– 
– 
– 

Fast feedback - interactive
Fast to process
Easy to write, hard to read
Background processing (screen)

Ian Huston, Alex Kagoshima, Ronert Obst
© Copyright 2014 Pivotal. All rights reserved.

11
Basic DS Tools: From Command-line to GUI
Ÿ  Quick-and-dirty tricks using
command-line tools
– 
– 
– 
– 

Fast feedback - interactive
Fast to process
Easy to write, hard to read
Background processing (screen)

Ÿ  Large-volumes of data à automatically parallel
environments (e.g. GPDB) may be faster

Ian Huston, Alex Kagoshima, Ronert Obst
© Copyright 2014 Pivotal. All rights reserved.

12
Basic DS Tools: From Command-line to GUI
Ÿ  Quick-and-dirty tricks using
command-line tools
– 
– 
– 
– 

Fast feedback - interactive
Fast to process
Easy to write, hard to read
Background processing (screen)

Ÿ  Large-volumes of data à automatically parallel
environments (e.g. GPDB) may be faster

Ÿ  Python and R

–  Rstudio
–  iPython (iPythonNotebook)

Ian Huston, Alex Kagoshima, Ronert Obst
© Copyright 2014 Pivotal. All rights reserved.

13
Favorite python and R packages and resources
Python

–  NumPy
–  SciPy
–  scikit-learn – machine
learning package
–  statsmodels
–  pandas
–  pyMC
–  IPython
(IPythonNotebook)
–  matplotlib

Ian Huston, Alex Kagoshima, Ronert Obst
© Copyright 2014 Pivotal. All rights reserved.

14
Favorite python and R packages, resources, and more
Ÿ  R

– 
– 
– 
– 
– 
– 
– 
– 
– 

ggplot
reshape
plyr
Shiny
Good support for time
series analyses
Rstudio ( weave )
foreach, parallel
taskviews
parboost

Ian Huston, Alex Kagoshima, Ronert Obst
© Copyright 2014 Pivotal. All rights reserved.

15
What do I do at Pivotal?
A New Platform for a New Era
DATA-DRIVEN APPLICATION DEVELOPMENT

App Fabric

Data Fabric

“The new Middleware”

“The new Database”

Cloud Fabric
“The new OS”
...ETC

“The new Hardware”
© Copyright 2014 Pivotal. All rights reserved.

16
Pivotal Big Data Technology: HAWQ
Think of it as multiple PostGreSQL servers
Master

Segments/Workers
Rows are distributed across segments by
a particular field (or randomly)
Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database

© Copyright 2014 Pivotal. All rights reserved.

17
Performance Through Parallelism
Ÿ  Automatic parallelization
–  Load and query like any database
–  Automatically distributed tables
across nodes

Ÿ  Analytics-oriented query optimization
Ÿ  Scalable MPP architecture
–  All nodes can scan and process in
parallel
–  Linear scalability by adding nodes
Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database

© Copyright 2014 Pivotal. All rights reserved.

18
Data Science Tools for Big Data
COMMERCIAL

OPEN SOURCE (OR FREE)

PL/R,	
  PL/Python	
  PL/Java	
  

© Copyright 2014 Pivotal. All rights reserved.

19
Making sense of your “big data”
Ÿ  Large volumes of data may be difficult to understand
–  ~100 tables
–  Tens of thousands of columns

© Copyright 2014 Pivotal. All rights reserved.

20
Making sense of your “big data”
Ÿ  Large volumes of data may be difficult to understand
–  ~100 tables
–  Tens of thousands of columns

Ÿ  How do you build models that use all the data? Score all the
data?

© Copyright 2014 Pivotal. All rights reserved.

21
Making sense of your “big data”
Ÿ  Large volumes of data may be difficult to understand
–  ~100 tables
–  Tens of thousands of columns

Ÿ  How do you build models that use all the data? Score all the
data?
Ÿ  Where do you focus your effort?

–  Getting a rapid grasp of relevant fields is important
–  Scanning lots of data is slow, creating models with huge numbers of features is
possible, but generally better to understand your data
–  Columns with little or no variation or only null values

© Copyright 2014 Pivotal. All rights reserved.

22
Making sense of your “big data”
Ÿ  Large volumes of data may be difficult to understand
–  ~100 tables
–  Tens of thousands of columns

Ÿ  How do you build models that use all the data? Score all the
data?
Ÿ  Where do you focus your effort?

–  Getting a rapid grasp of relevant fields is important
–  Scanning lots of data is slow, creating models with huge numbers of features is
possible, but generally better to understand your data
–  Columns with little or no variation or only null values

Ÿ  These functions exist in MADlib
© Copyright 2014 Pivotal. All rights reserved.

23
MADlib In-Database Functions
Predictive Modeling Library
Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Multinomial Logistic Regression
•  Cox Proportional Hazards
•  Regression
•  Elastic Net Regularization
•  Sandwich Estimators (Huber
white, clustered, marginal
effects)
Matrix Factorization
•  Single Value Decomposition
(SVD)
•  Low-Rank

© Copyright 2014 Pivotal. All rights reserved.

Machine Learning Algorithms
•  Principal Component Analysis (PCA)
•  Association Rules (Affinity Analysis,
Market Basket)
•  Topic Modeling (Parallel LDA)
•  Decision Trees
•  Ensemble Learners (Random Forests)
•  Support Vector Machines
•  Conditional Random Field (CRF)
•  Clustering (K-means)
•  Cross Validation
Linear Systems
•  Sparse and Dense Solvers

Descriptive Statistics

Sketch-based
Estimators
•  CountMin (CormodeMuthukrishnan)
•  FM (Flajolet-Martin)
•  MFV (Most Frequent
Values)
Correlation
Summary
Support Modules
Array Operations
Sparse Vectors
Random Sampling
Probability Functions

24
MADlib in Action: Regression on
Billions of Rows

Ÿ  Input Data

–  10s of millions of rows from data collected at multiple drill
testing sites
–  Sensor data for drills during operation, including rate of
penetration, depth of penetration, weight on drill bit and
more

Ÿ  Data Massaging and Review

–  Rapid summarization of many columns of data - to identify
outliers, missing data and remove them from analysis
–  Used window functions to construct a moving average
(smoothing) of all the features and dependent variable

Ÿ  Model

–  Linear regression on the complete dataset
–  K-means clustering to determine similarities of sites

Rashmi Raghu
© Copyright 2014 Pivotal. All rights reserved.

Drilling into the San Andreas Fault at Parkfield California.
Credit: Stephen H. Hickman, USGS

25
Linear Regression: Streaming Algorithm
Ÿ  Finding linear
dependencies between
variables
Ÿ  How to compute with a
single scan?

© Copyright 2014 Pivotal. All rights reserved.

26
Linear Regression: Parallel Computation
XT
y

X T y = ∑ xiT yi
i

© Copyright 2014 Pivotal. All rights reserved.

27
Linear Regression: Parallel Computation
XT
y

T
X1 y1

Segment 1
© Copyright 2013 Pivotal. All rights reserved.

+

T
X 2 y2

Segment 2

=

XT y
Master
28
Linear Regression: Parallel Computation
XT
y

T
X1 y1

Segment 1
© Copyright 2013 Pivotal. All rights reserved.

+

T
X 2 y2

Segment 2

=

XT y
Master
29
Performing a linear regression on 10 million
rows in seconds

Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of
the VLDB Endowment 5.12 (2012): 1700-1711.

© Copyright 2013 Pivotal. All rights reserved.

30
Calling MADlib Functions: Fast Training, Scoring
Ÿ  MADlib allows users to easily and
create models without moving data
out of the systems
–  Model generation
–  Model validation
–  Scoring (evaluation of) new data

Ÿ  All the data can be used in one
model

MADlib model function

Table containing
training data

SELECT madlib.linregr_train( 'houses’,!
'houses_linregr’,!
'price’,!
'ARRAY[1, tax, bath, size]’);!
Features included in the
model

Table in which to
save results

Column containing
dependent variable

Ÿ  Built-in functionality to create of
multiple smaller models (e.g.
classification grouped by feature)
Ÿ  Open-source lets you tweak and
extend methods, or build your own

© Copyright 2014 Pivotal. All rights reserved.

31
Calling MADlib Functions: Fast Training, Scoring
Ÿ  MADlib allows users to easily and
create models without moving data
out of the systems
–  Model generation
–  Model validation
–  Scoring (evaluation of) new data

Ÿ  All the data can be used in one
model
Ÿ  Built-in functionality to create of
multiple smaller models (e.g.
classification grouped by feature)

MADlib model function

Table containing
training data

SELECT madlib.linregr_train( 'houses’,!
'houses_linregr’,!
'price’,!
'ARRAY[1, tax, bath, size]’,!
‘bedroom’);!

Table in which to
save results

Column containing
dependent variable
Features included in the
model
Create multiple output models
(one for each value of bedroom)

Ÿ  Open-source lets you tweak and
extend methods, or build your own

© Copyright 2014 Pivotal. All rights reserved.

32
Calling MADlib Functions: Fast Training, Scoring
Ÿ  MADlib allows users to easily and
create models without moving data
out of the systems
–  Model generation
–  Model validation
–  Scoring (evaluation of) new data

Ÿ  All the data can be used in one
model
Ÿ  Built-in functionality to create of
multiple smaller models (e.g.
classification grouped by feature)
Ÿ  Open-source lets you tweak and
extend methods, or build your own

© Copyright 2014 Pivotal. All rights reserved.

SELECT madlib.linregr_train( 'houses’,!
'houses_linregr’,!
'price’,!
'ARRAY[1, tax, bath, size]’);!
MADlib model scoring function
SELECT houses.*,
madlib.linregr_predict(ARRAY[1,tax,bath,size],
m.coef!
)as predict !
FROM houses, houses_linregr m;!
Table with data to be scored

Table containing model

33
PivotalR: Bringing MADlib and HAWQ to a familiar
R interface
Ÿ  Challenge

Want to harness the familiarity of R’s interface and the performance &
scalability benefits of in-DB analytics

Ÿ  Simple solution:

Translate R code into SQL
Pivotal R

d <- db.data.frame(”houses")!
houses_linregr <- madlib.lm(price ~ tax!
!
!
!+ bath!
!
!
!+ size!
!
!
!, data=d)!

SQL Code
SELECT madlib.linregr_train( 'houses’,!
'houses_linregr’,!
'price’,!
'ARRAY[1, tax, bath, size]’);!

http://gopivotal.github.io/PivotalR/

Woo Jung
© Copyright 2014 Pivotal. All rights reserved.

34
PivotalR: Bringing MADlib and HAWQ to a familiar
R interface
Ÿ  Challenge

Want to harness the familiarity of R’s interface and the performance &
scalability benefits of in-DB analytics

Ÿ  Simple solution:

Translate R code into SQL
Pivotal R

#
#
#
#
#

Build a regression model with a different!
intercept term for each state!
(state=1 as baseline).!
Note that PivotalR supports automated!
indicator coding a la as.factor()!!

d <- db.data.frame(”houses")!
houses_linregr <- madlib.lm(price ~ as.factor(state)!
!
!
!
!+ tax!
!
!
!
!+ bath!
!
!
!
!+ size!
!
!
!
!, data=d)!
http://gopivotal.github.io/PivotalR/

Woo Jung
© Copyright 2014 Pivotal. All rights reserved.

35
PivotalR Design Overview
• 
• 

Call MADlib’s in-DB machine learning functions
directly from R
Syntax is analogous to native R function

PivotalR

R à SQL

No data here
http://gopivotal.github.io/PivotalR/

RPostgreSQL

Data lives here

SQL to execute
Computation results

• 
• 

Database w/ MADlib

Data doesn’t need to leave the database
All heavy lifting, including model estimation
& computation, are done in the database

Woo Jung
© Copyright 2014 Pivotal. All rights reserved.

36
PivotalR: Current Features
And more ... (SQL wrapper)

• 
MADlib Functionality

•  Linear Regression
•  Logistic Regression
•  Elastic Net
•  ARIMA
•  Marginal Effects
•  Cross Validation
•  Bagging
•  summary on model objects
http://gopivotal.github.io/PivotalR/

© Copyright 2014 Pivotal. All rights reserved.

+ - *
%/% ^

/

%%

• Automated Indicator
Variable Coding
as.factor

• predict

• 
• 
• 
• 
• 
• 
• 
• 

• 

dim names

$

[

==
&
by

[[
!=

|

$<>

!

• 

• 

<

[<>=

merge

sort

db.data.frame

• 
• 

[[<<=

• 

is.na

preview
content

as.db.data.frame
c mean sum sd var min max
length colMeans colSums
db.connect db.disconnect
db.list db.objects
db.existsObject delete

37
http://gopivotal.github.io/PivotalR/

Woo Jung
© Copyright 2014 Pivotal. All rights reserved.

38
http://www.rstudio.com/shiny/
http://gopivotal.github.io/PivotalR/

Woo Jung
© Copyright 2014 Pivotal. All rights reserved.

39
Shiny Showcase: Example Web Apps in R
Ÿ  Users can choose
input parameters with
sliders, drop-downs,
and text fields.
Ÿ  HTML/JavaScript
knowledge not
required.

http://www.rstudio.com/shiny/
© Copyright 2014 Pivotal. All rights reserved.

40
Shiny Showcase: Example Web Apps in R
Ÿ  Users can choose
input parameters with
sliders, drop-downs,
and text fields.
Ÿ  HTML/JavaScript
knowledge not
required.

http://www.rstudio.com/shiny/
© Copyright 2014 Pivotal. All rights reserved.

41
http://d3js.org/
© Copyright 2014 Pivotal. All rights reserved.

42
D3 Data-Driven Documents

http://d3js.org/
© Copyright 2014 Pivotal. All rights reserved.

43
D3 Data-Driven Documents

http://d3js.org/
© Copyright 2014 Pivotal. All rights reserved.

44
PyMADlib
Ÿ  Python wrapper for MADlib

http://nbviewer.ipython.org/gist/vatsan/5275846
© Copyright 2014 Pivotal. All rights reserved.

45
PyMADlib
Ÿ  Python wrapper for MADlib

http://nbviewer.ipython.org/gist/vatsan/5275846
© Copyright 2014 Pivotal. All rights reserved.

46
Procedural Languages in Big Data Science
Ÿ  HAWQ & PL/X can take advantage of “data
parallel” tasks by performing analyses in
parallel – embarrassingly parallel tasks
Ÿ  Little or no effort is required to break up the
problem into a number of parallel tasks, and
there exists no dependency (or
communication) between those parallel
tasks
Ÿ  Examples of ‘data parallel’ problems:
–  Counting words in documents
–  Genome-Wide Association Study
–  Studying network anomalies

http://gopivotal.github.io/gp-r/

© Copyright 2014 Pivotal. All rights reserved.

SQL & R
Master
Severs
Network
Interconnect

Segment
Severs

Doc1

Doc2

DocM

Stem1

Stem2

StemM

Count1

Count2

CountM

47
Structure of input table for PL/R function
Columns
Description

A

Network ID
ID of the network.
300K in total.

Terminal
readings

Topology

Network Readings

Array of integers
defining the
topology tree.

Array of readings from
network terminal points
over (say) a week.

C

Ÿ  Using historical readings, solve a
linear program to establish baseline
behavior, for example number of
shipments

0
B

Ÿ  Topology: Hubs connected to multiple
terminal points

D

Ÿ  Detecting anomalies within subnetworks on future observations

Vivek Ramamurthy
© Copyright 2014 Pivotal. All rights reserved.

48
Performance Analysis
Number of
networks

Time/network
(ms)

Total time
(seconds)

500

6.604

3.30

1000

3.637

3.64

5000

2.822

14.11

400

10,000

2.356

23.56

300

50,000

2.160

108.02

200

100,000

2.142

214.20

100

150,000

2.162

324.29

200,000

2.142

428.48

250,000

2.138

534.69

300,000

2.132

639.85

Execution time v/s number of networks

Time (seconds)

700
600
500

0
0

50
100
150
200
250
Number of networks (in thousands)

300

Vivek Ramamurthy
© Copyright 2014 Pivotal. All rights reserved.

49
Performance Analysis
R package used

optim

quadprog

Rsymphony

Rglpk

Single network in R (time)

~60s

6.3 s

0.145 s

0.181 s

300K networks in PL/R (time)

~84 hrs

5.87 hrs

10.7 min

14.6 min

Time per network in PL/R

1005.2 ms

70.44 ms

2.13 ms

2.92 ms

Vivek Ramamurthy
© Copyright 2014 Pivotal. All rights reserved.

50
Performance Analysis
R package used

optim

quadprog

Rsymphony

Rglpk

Single network in R (time)

~60s

6.3 s

0.145 s

0.181 s

300K networks in PL/R (time)

~84 hrs

5.87 hrs

10.7 min

14.6 min

Time per network in PL/R

1005.2 ms

70.44 ms

2.13 ms

2.92 ms

COIN-OR : Computational Infrastructure for Operations
Research

http://www.coin-or.org/

–  Libraries for linear and non-linear programming, integer
programming
–  SYMPHONY : Callable library in COIN-OR for solving mixed
integer linear programs

GLPK : GNU Linear Programming Kit
Used for large-scale LPs, MIPs and related problems

Vivek Ramamurthy
© Copyright 2014 Pivotal. All rights reserved.

51
Performance Analysis
R package used

optim

quadprog

Rsymphony

Rglpk

Single network in R (time)

~60s

6.3 s

0.145 s

0.181 s

300K networks in PL/R (time)

~84 hrs

5.87 hrs

10.7 min

14.6 min

Time per network in PL/R

1005.2 ms

70.44 ms

2.13 ms

2.92 ms

COIN-OR : Computational Infrastructure for Operations
Research

http://www.coin-or.org/

–  Libraries for linear and non-linear programming, integer
programming
–  SYMPHONY : Callable library in COIN-OR for solving mixed
integer linear programs

GLPK : GNU Linear Programming Kit
–  Used for large-scale LPs, MIPs and related problems

http://www.gnu.org/software/glpk/

Vivek Ramamurthy
© Copyright 2014 Pivotal. All rights reserved.

52
Natural language processing
Data sources

Applications

NLP processing
pipeline

Text sources
Documents, books,
emails
Sentence
detection

Tokenization

Morphological
stemming

Stop word
removal

Word-sense
disambiguation

Part-of-Speech
tagging

Syntactic
parsing

Semantic role
labeling

Entity
recognition

Reference
resolution

Speech
Phone logs,
conversations

Event
processing

Word clouds
Topic modeling
Sentiment analysis
Machine translation
Document classification
Document summarization
Language generation
Search
Question answering
Information Extraction

…

Common tasks/tools in NLP

Niels Kasch
© Copyright 2014 Pivotal. All rights reserved.

53
Open source tools for common NLP tasks
RELEVANT NLP TOOLS

OPEN SOURCE SOFTWARE

WORD CLOUDS

T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N

I N F O R M AT I O N E X T R A C T I O N

Niels Kasch
© Copyright 2014 Pivotal. All rights reserved.

54
Open source tools for common NLP tasks
RELEVANT NLP TOOLS

OPEN SOURCE SOFTWARE

WORD CLOUDS
Tokenization

Stemming/
lemmatization

Stop word
removal

• 
• 
• 

GPText
Apache UIMA
OpenNLP (Java)

• 
• 
• 

NLTK (Python)
WordNet
Pytagcloud

T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N

I N F O R M AT I O N E X T R A C T I O N

Niels Kasch
© Copyright 2014 Pivotal. All rights reserved.

55
Open source tools for common NLP tasks
RELEVANT NLP TOOLS

OPEN SOURCE SOFTWARE

WORD CLOUDS
Tokenization

Stemming/
lemmatization

Stop word
removal

• 
• 
• 

GPText
Apache UIMA
OpenNLP (Java)

• 
• 
• 

NLTK (Python)
WordNet
Pytagcloud

T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N
Tokenization

Stemming/
lemmatization

Stop word
removal

Language
detection

• 
• 
• 

Madlib (PLDA)
gensim (LSA & LDA package for python)
https://code.google.com/p/language-detection/

I N F O R M AT I O N E X T R A C T I O N

Niels Kasch
© Copyright 2014 Pivotal. All rights reserved.

56
Open source tools for common NLP tasks
RELEVANT NLP TOOLS

OPEN SOURCE SOFTWARE

WORD CLOUDS
Tokenization

Stemming/
lemmatization

• 
• 
• 

Stop word
removal

GPText
Apache UIMA
OpenNLP (Java)

• 
• 
• 

NLTK (Python)
WordNet
Pytagcloud

T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N
Tokenization

Stemming/
lemmatization

Stop word
removal

Language
detection

• 
• 
• 

Madlib (PLDA)
gensim (LSA & LDA package for python)
https://code.google.com/p/language-detection/

• 
• 
• 

GPText and Madlib
OpenNLP
NLTK

I N F O R M AT I O N E X T R A C T I O N
Sentence
detection

Tokenization
Language
detection

Relationship
extraction

Syntactic
parsing

Entity
extraction

• 

Stanford CoreNLP (incl.
POS tagger, NER, parser,
etc.)

Niels Kasch
© Copyright 2014 Pivotal. All rights reserved.

57
Topic Analysis – MADlib pLDA
Natural Language Processing - GPText
Filter
relevant
content

Align
Data

Social
Media
Tokenizer

Stemming,
frequency
filtering

Prepare
dataset for
Topic Modeling

Srivatsan Ramanujam
© Copyright 2014 Pivotal. All rights reserved.

58
Topic Analysis – MADlib pLDA
Natural Language Processing - GPText
Filter
relevant
content

Align
Data

Social
Media
Tokenizer

Stemming,
frequency
filtering

Prepare
dataset for
Topic Modeling

Topic Graph
Topic composition

MADlib Topic
Model
Topic Clouds

Srivatsan Ramanujam
© Copyright 2014 Pivotal. All rights reserved.

59
Is there more? What’s next?
blog.gopivotal.com/tag/data-science
blog.gopivotal.com/tag/data-science-tech

© Copyright 2014 Pivotal. All rights reserved.

60
BUILT FOR THE SPEED OF BUSINESS

More Related Content

What's hot

Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Big Data Spain
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Analyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity FuturesAnalyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity FuturesSrivatsan Ramanujam
 
Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...GetInData
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Srivatsan Ramanujam
 
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark Summit
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramSkillspeed
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01Krishna Sankar
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at UberDataWorks Summit
 
Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action
Pivotal: Data Scientists on the Front Line: Examples of Data Science in ActionPivotal: Data Scientists on the Front Line: Examples of Data Science in Action
Pivotal: Data Scientists on the Front Line: Examples of Data Science in ActionEMC
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
 
Distributed Deep Learning At Scale On Apache Spark With BigDL
Distributed Deep Learning At Scale On Apache Spark With BigDLDistributed Deep Learning At Scale On Apache Spark With BigDL
Distributed Deep Learning At Scale On Apache Spark With BigDLYulia Tell
 
Hadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreHadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreTrendwise Analytics
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with HadoopDataWorks Summit
 
Big data today and tomorrow
Big data today and tomorrowBig data today and tomorrow
Big data today and tomorrowmagda3695
 

What's hot (20)

Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Analyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity FuturesAnalyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity Futures
 
Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...Monitoring environment based on satellite data with Python and PySpark - Albe...
Monitoring environment based on satellite data with Python and PySpark - Albe...
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
 
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x Program
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at Uber
 
Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action
Pivotal: Data Scientists on the Front Line: Examples of Data Science in ActionPivotal: Data Scientists on the Front Line: Examples of Data Science in Action
Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Distributed Deep Learning At Scale On Apache Spark With BigDL
Distributed Deep Learning At Scale On Apache Spark With BigDLDistributed Deep Learning At Scale On Apache Spark With BigDL
Distributed Deep Learning At Scale On Apache Spark With BigDL
 
Hadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and MoreHadoop,Big Data Analytics and More
Hadoop,Big Data Analytics and More
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Predictive Analytics with Hadoop
Predictive Analytics with HadoopPredictive Analytics with Hadoop
Predictive Analytics with Hadoop
 
Big data today and tomorrow
Big data today and tomorrowBig data today and tomorrow
Big data today and tomorrow
 

Viewers also liked

A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...Srivatsan Ramanujam
 
maintaineR: A web-based dashboard for maintainers of CRAN packages
maintaineR: A web-based dashboard for maintainers of CRAN packagesmaintaineR: A web-based dashboard for maintainers of CRAN packages
maintaineR: A web-based dashboard for maintainers of CRAN packagesTom Mens
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRPivotalOpenSourceHub
 
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICSBIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICSTIBCO Spotfire
 
Analytics Environment
Analytics EnvironmentAnalytics Environment
Analytics EnvironmentYuu Kimy
 
About alteryx
About alteryxAbout alteryx
About alteryxYuu Kimy
 
In-Database Predictive Analytics
In-Database Predictive AnalyticsIn-Database Predictive Analytics
In-Database Predictive AnalyticsJohn De Goes
 
Io tビジネスモデルに関する考察20161119
Io tビジネスモデルに関する考察20161119Io tビジネスモデルに関する考察20161119
Io tビジネスモデルに関する考察20161119Keiichiro Nabeno
 
[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...
[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...
[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...Insight Technology, Inc.
 
データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)
データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)
データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)Masanori Kado
 
Pivotal Data Warehouse in the Age of Digital Transformation
Pivotal Data Warehouse in the Age of Digital TransformationPivotal Data Warehouse in the Age of Digital Transformation
Pivotal Data Warehouse in the Age of Digital TransformationVMware Tanzu
 
「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京
「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京
「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京Koichi Hamada
 
The ninja elephant, scaling the analytics database in Transwerwise
The ninja elephant, scaling the analytics database in TranswerwiseThe ninja elephant, scaling the analytics database in Transwerwise
The ninja elephant, scaling the analytics database in TranswerwiseFederico Campoli
 
アウトプットし続ける技術〜毎日書くためのマインドセットとスキルセット
アウトプットし続ける技術〜毎日書くためのマインドセットとスキルセットアウトプットし続ける技術〜毎日書くためのマインドセットとスキルセット
アウトプットし続ける技術〜毎日書くためのマインドセットとスキルセットMasanori Saito
 

Viewers also liked (18)

A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
 
maintaineR: A web-based dashboard for maintainers of CRAN packages
maintaineR: A web-based dashboard for maintainers of CRAN packagesmaintaineR: A web-based dashboard for maintainers of CRAN packages
maintaineR: A web-based dashboard for maintainers of CRAN packages
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICSBIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS
 
Analytics Environment
Analytics EnvironmentAnalytics Environment
Analytics Environment
 
About alteryx
About alteryxAbout alteryx
About alteryx
 
In-Database Predictive Analytics
In-Database Predictive AnalyticsIn-Database Predictive Analytics
In-Database Predictive Analytics
 
Io tビジネスモデルに関する考察20161119
Io tビジネスモデルに関する考察20161119Io tビジネスモデルに関する考察20161119
Io tビジネスモデルに関する考察20161119
 
[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...
[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...
[db tech showcase Sapporo 2015] C15:商用RDBをOSSへ Oracle to Postgres 徹底解説 by 株式会...
 
データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)
データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)
データからインサイト そして、アイデアの発想へ(CJM/POV/HMW)
 
Pivotal Data Warehouse in the Age of Digital Transformation
Pivotal Data Warehouse in the Age of Digital TransformationPivotal Data Warehouse in the Age of Digital Transformation
Pivotal Data Warehouse in the Age of Digital Transformation
 
ビジネスモデルの作り方
ビジネスモデルの作り方ビジネスモデルの作り方
ビジネスモデルの作り方
 
「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京
「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京
「はじめてでもわかる RandomForest 入門-集団学習による分類・予測 -」 -第7回データマイニング+WEB勉強会@東京
 
The ninja elephant, scaling the analytics database in Transwerwise
The ninja elephant, scaling the analytics database in TranswerwiseThe ninja elephant, scaling the analytics database in Transwerwise
The ninja elephant, scaling the analytics database in Transwerwise
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
アウトプットし続ける技術〜毎日書くためのマインドセットとスキルセット
アウトプットし続ける技術〜毎日書くためのマインドセットとスキルセットアウトプットし続ける技術〜毎日書くためのマインドセットとスキルセット
アウトプットし続ける技術〜毎日書くためのマインドセットとスキルセット
 

Similar to Data Science Tools for Big Data Analysis

Data Science Perspective and DS demo
Data Science Perspective and DS demo Data Science Perspective and DS demo
Data Science Perspective and DS demo PivotalOpenSourceHub
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
 
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Ali Alkan
 
Hadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataHadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataSenturus
 
Getting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionGetting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionSplunk
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdfPoornimaShetty27
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdfSreenivasa Harish
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...DataWorks Summit
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormRevolution Analytics
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Tomasz Bednarz
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar 18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar Revolution Analytics
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science Mahesh Kumar CV
 
Oh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataOh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataPrakalp Agarwal
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Renato Bonomini
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Ali Alkan
 

Similar to Data Science Tools for Big Data Analysis (20)

Data Science Perspective and DS demo
Data Science Perspective and DS demo Data Science Perspective and DS demo
Data Science Perspective and DS demo
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
 
Hadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataHadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big Data
 
Getting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionGetting Started with Splunk Breakout Session
Getting Started with Splunk Breakout Session
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar 18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
 
Big data
Big dataBig data
Big data
 
Oh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataOh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG Data
 
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Data Science Tools for Big Data Analysis

  • 1. BUILT FOR THE SPEED OF BUSINESS
  • 2. Data Science as a Commodity: How to use MADlib, R, and other Publicly Available and Open Source Tools for Data Science Pivotal OSS Meetups Sarah Aerni Pivotal Senior Data Scientist @itweetsarah saerni@gopivotal.com January 28, 2014 © Copyright 2014 Pivotal. All rights reserved. 2
  • 3. What we will cover in today’s Meetup Ÿ  What is data science, big data, buzzword, buzzword? Ÿ  What are some examples of data science in action? Ÿ  What do I do at Pivotal? Ÿ  Who are our data scientists? Ÿ  Why is open source software important for data science? Ÿ  What do I do with loads of data? Ÿ  How can I create good models? Ÿ  What types of open source tools can I use to build models? Ÿ  How can I build a quick app? Ÿ  What can I do to get started analyzing text data? Ÿ  Which tools exist to create Ÿ  What tools does our team use? For visualizations of my data that I can NLP? For optimization? For understand? regression? © Copyright 2014 Pivotal. All rights reserved. 3
  • 4. What we will not cover #notdatascience © Copyright 2014 Pivotal. All rights reserved. 4
  • 5. Instead: Practical Data Science Tools #useful – Kaushik Das http://blog.gopivotal.com/p-o-v/the-eightfold-path-of-data-science © Copyright 2014 Pivotal. All rights reserved. 5
  • 6. Instead: Practical Data Science Tools #useful “At companies where there is no framework for operationalization of the models, PowerPoint is where models go to die!” – Hulya Farinas http://venturebeat.com/2013/12/03/how-torevolutionize-healthcare-get-data-scientists-andapp-developers-together/ © Copyright 2014 Pivotal. All rights reserved. 6
  • 7. Instead: Practical Data Science Tools #useful “At companies where there is no framework for operationalization of the models, PowerPoint is where models go to die!” – Hulya Farinas http://venturebeat.com/2013/12/03/how-torevolutionize-healthcare-get-data-scientists-andapp-developers-together/ “The use of statistical and machine learning techniques on big multistructured data — in a distributed computing environment — to identify correlations and causal relationships, classify and predict events, identify patterns and anomalies, and infer probabilities, interest, and sentiment.” – Annika Jimenez http://blog.gopivotal.com/news-2/annika-jimenez-ondisruptive-data-science-at-the-strata-conference © Copyright 2014 Pivotal. All rights reserved. 7
  • 8. DATA IS THE NEW CENTER OF GRAVITY Data > Application! “BIG DATA IS THE NEW NORMAL” “‘BIG DATA’ BECOMES ‘DATA’ ONCE AGAIN” © Copyright 2014 Pivotal. All rights reserved. 8
  • 9. What Can “Small Data” Scientists Bring on Their “Big Data” Journey? http://factspy.net/the-differencebetween-geeks-vs-nerds/ © Copyright 2014 Pivotal. All rights reserved. 9
  • 10. What Can “Small Data” Scientists Bring on Their “Big Data” Journey? Small Data Databases In-me m Flat files Big Data MapRe duce Many tools and approaches are being adapted to big data technologies S HDF Cloud computing ory m buildin odel g Command-line tools © Copyright 2014 Pivotal. All rights reserved. pu d com tribute ting Dis Command-line tools 10
  • 11. Basic DS Tools: From Command-line to GUI Ÿ  Quick-and-dirty tricks using command-line tools –  –  –  –  Fast feedback - interactive Fast to process Easy to write, hard to read Background processing (screen) Ian Huston, Alex Kagoshima, Ronert Obst © Copyright 2014 Pivotal. All rights reserved. 11
  • 12. Basic DS Tools: From Command-line to GUI Ÿ  Quick-and-dirty tricks using command-line tools –  –  –  –  Fast feedback - interactive Fast to process Easy to write, hard to read Background processing (screen) Ÿ  Large-volumes of data à automatically parallel environments (e.g. GPDB) may be faster Ian Huston, Alex Kagoshima, Ronert Obst © Copyright 2014 Pivotal. All rights reserved. 12
  • 13. Basic DS Tools: From Command-line to GUI Ÿ  Quick-and-dirty tricks using command-line tools –  –  –  –  Fast feedback - interactive Fast to process Easy to write, hard to read Background processing (screen) Ÿ  Large-volumes of data à automatically parallel environments (e.g. GPDB) may be faster Ÿ  Python and R –  Rstudio –  iPython (iPythonNotebook) Ian Huston, Alex Kagoshima, Ronert Obst © Copyright 2014 Pivotal. All rights reserved. 13
  • 14. Favorite python and R packages and resources Python –  NumPy –  SciPy –  scikit-learn – machine learning package –  statsmodels –  pandas –  pyMC –  IPython (IPythonNotebook) –  matplotlib Ian Huston, Alex Kagoshima, Ronert Obst © Copyright 2014 Pivotal. All rights reserved. 14
  • 15. Favorite python and R packages, resources, and more Ÿ  R –  –  –  –  –  –  –  –  –  ggplot reshape plyr Shiny Good support for time series analyses Rstudio ( weave ) foreach, parallel taskviews parboost Ian Huston, Alex Kagoshima, Ronert Obst © Copyright 2014 Pivotal. All rights reserved. 15
  • 16. What do I do at Pivotal? A New Platform for a New Era DATA-DRIVEN APPLICATION DEVELOPMENT App Fabric Data Fabric “The new Middleware” “The new Database” Cloud Fabric “The new OS” ...ETC “The new Hardware” © Copyright 2014 Pivotal. All rights reserved. 16
  • 17. Pivotal Big Data Technology: HAWQ Think of it as multiple PostGreSQL servers Master Segments/Workers Rows are distributed across segments by a particular field (or randomly) Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database © Copyright 2014 Pivotal. All rights reserved. 17
  • 18. Performance Through Parallelism Ÿ  Automatic parallelization –  Load and query like any database –  Automatically distributed tables across nodes Ÿ  Analytics-oriented query optimization Ÿ  Scalable MPP architecture –  All nodes can scan and process in parallel –  Linear scalability by adding nodes Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database © Copyright 2014 Pivotal. All rights reserved. 18
  • 19. Data Science Tools for Big Data COMMERCIAL OPEN SOURCE (OR FREE) PL/R,  PL/Python  PL/Java   © Copyright 2014 Pivotal. All rights reserved. 19
  • 20. Making sense of your “big data” Ÿ  Large volumes of data may be difficult to understand –  ~100 tables –  Tens of thousands of columns © Copyright 2014 Pivotal. All rights reserved. 20
  • 21. Making sense of your “big data” Ÿ  Large volumes of data may be difficult to understand –  ~100 tables –  Tens of thousands of columns Ÿ  How do you build models that use all the data? Score all the data? © Copyright 2014 Pivotal. All rights reserved. 21
  • 22. Making sense of your “big data” Ÿ  Large volumes of data may be difficult to understand –  ~100 tables –  Tens of thousands of columns Ÿ  How do you build models that use all the data? Score all the data? Ÿ  Where do you focus your effort? –  Getting a rapid grasp of relevant fields is important –  Scanning lots of data is slow, creating models with huge numbers of features is possible, but generally better to understand your data –  Columns with little or no variation or only null values © Copyright 2014 Pivotal. All rights reserved. 22
  • 23. Making sense of your “big data” Ÿ  Large volumes of data may be difficult to understand –  ~100 tables –  Tens of thousands of columns Ÿ  How do you build models that use all the data? Score all the data? Ÿ  Where do you focus your effort? –  Getting a rapid grasp of relevant fields is important –  Scanning lots of data is slow, creating models with huge numbers of features is possible, but generally better to understand your data –  Columns with little or no variation or only null values Ÿ  These functions exist in MADlib © Copyright 2014 Pivotal. All rights reserved. 23
  • 24. MADlib In-Database Functions Predictive Modeling Library Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white, clustered, marginal effects) Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank © Copyright 2014 Pivotal. All rights reserved. Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation Linear Systems •  Sparse and Dense Solvers Descriptive Statistics Sketch-based Estimators •  CountMin (CormodeMuthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions 24
  • 25. MADlib in Action: Regression on Billions of Rows Ÿ  Input Data –  10s of millions of rows from data collected at multiple drill testing sites –  Sensor data for drills during operation, including rate of penetration, depth of penetration, weight on drill bit and more Ÿ  Data Massaging and Review –  Rapid summarization of many columns of data - to identify outliers, missing data and remove them from analysis –  Used window functions to construct a moving average (smoothing) of all the features and dependent variable Ÿ  Model –  Linear regression on the complete dataset –  K-means clustering to determine similarities of sites Rashmi Raghu © Copyright 2014 Pivotal. All rights reserved. Drilling into the San Andreas Fault at Parkfield California. Credit: Stephen H. Hickman, USGS 25
  • 26. Linear Regression: Streaming Algorithm Ÿ  Finding linear dependencies between variables Ÿ  How to compute with a single scan? © Copyright 2014 Pivotal. All rights reserved. 26
  • 27. Linear Regression: Parallel Computation XT y X T y = ∑ xiT yi i © Copyright 2014 Pivotal. All rights reserved. 27
  • 28. Linear Regression: Parallel Computation XT y T X1 y1 Segment 1 © Copyright 2013 Pivotal. All rights reserved. + T X 2 y2 Segment 2 = XT y Master 28
  • 29. Linear Regression: Parallel Computation XT y T X1 y1 Segment 1 © Copyright 2013 Pivotal. All rights reserved. + T X 2 y2 Segment 2 = XT y Master 29
  • 30. Performing a linear regression on 10 million rows in seconds Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711. © Copyright 2013 Pivotal. All rights reserved. 30
  • 31. Calling MADlib Functions: Fast Training, Scoring Ÿ  MADlib allows users to easily and create models without moving data out of the systems –  Model generation –  Model validation –  Scoring (evaluation of) new data Ÿ  All the data can be used in one model MADlib model function Table containing training data SELECT madlib.linregr_train( 'houses’,! 'houses_linregr’,! 'price’,! 'ARRAY[1, tax, bath, size]’);! Features included in the model Table in which to save results Column containing dependent variable Ÿ  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature) Ÿ  Open-source lets you tweak and extend methods, or build your own © Copyright 2014 Pivotal. All rights reserved. 31
  • 32. Calling MADlib Functions: Fast Training, Scoring Ÿ  MADlib allows users to easily and create models without moving data out of the systems –  Model generation –  Model validation –  Scoring (evaluation of) new data Ÿ  All the data can be used in one model Ÿ  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature) MADlib model function Table containing training data SELECT madlib.linregr_train( 'houses’,! 'houses_linregr’,! 'price’,! 'ARRAY[1, tax, bath, size]’,! ‘bedroom’);! Table in which to save results Column containing dependent variable Features included in the model Create multiple output models (one for each value of bedroom) Ÿ  Open-source lets you tweak and extend methods, or build your own © Copyright 2014 Pivotal. All rights reserved. 32
  • 33. Calling MADlib Functions: Fast Training, Scoring Ÿ  MADlib allows users to easily and create models without moving data out of the systems –  Model generation –  Model validation –  Scoring (evaluation of) new data Ÿ  All the data can be used in one model Ÿ  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature) Ÿ  Open-source lets you tweak and extend methods, or build your own © Copyright 2014 Pivotal. All rights reserved. SELECT madlib.linregr_train( 'houses’,! 'houses_linregr’,! 'price’,! 'ARRAY[1, tax, bath, size]’);! MADlib model scoring function SELECT houses.*, madlib.linregr_predict(ARRAY[1,tax,bath,size], m.coef! )as predict ! FROM houses, houses_linregr m;! Table with data to be scored Table containing model 33
  • 34. PivotalR: Bringing MADlib and HAWQ to a familiar R interface Ÿ  Challenge Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics Ÿ  Simple solution: Translate R code into SQL Pivotal R d <- db.data.frame(”houses")! houses_linregr <- madlib.lm(price ~ tax! ! ! !+ bath! ! ! !+ size! ! ! !, data=d)! SQL Code SELECT madlib.linregr_train( 'houses’,! 'houses_linregr’,! 'price’,! 'ARRAY[1, tax, bath, size]’);! http://gopivotal.github.io/PivotalR/ Woo Jung © Copyright 2014 Pivotal. All rights reserved. 34
  • 35. PivotalR: Bringing MADlib and HAWQ to a familiar R interface Ÿ  Challenge Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics Ÿ  Simple solution: Translate R code into SQL Pivotal R # # # # # Build a regression model with a different! intercept term for each state! (state=1 as baseline).! Note that PivotalR supports automated! indicator coding a la as.factor()!! d <- db.data.frame(”houses")! houses_linregr <- madlib.lm(price ~ as.factor(state)! ! ! ! !+ tax! ! ! ! !+ bath! ! ! ! !+ size! ! ! ! !, data=d)! http://gopivotal.github.io/PivotalR/ Woo Jung © Copyright 2014 Pivotal. All rights reserved. 35
  • 36. PivotalR Design Overview •  •  Call MADlib’s in-DB machine learning functions directly from R Syntax is analogous to native R function PivotalR R à SQL No data here http://gopivotal.github.io/PivotalR/ RPostgreSQL Data lives here SQL to execute Computation results •  •  Database w/ MADlib Data doesn’t need to leave the database All heavy lifting, including model estimation & computation, are done in the database Woo Jung © Copyright 2014 Pivotal. All rights reserved. 36
  • 37. PivotalR: Current Features And more ... (SQL wrapper) •  MADlib Functionality •  Linear Regression •  Logistic Regression •  Elastic Net •  ARIMA •  Marginal Effects •  Cross Validation •  Bagging •  summary on model objects http://gopivotal.github.io/PivotalR/ © Copyright 2014 Pivotal. All rights reserved. + - * %/% ^ / %% • Automated Indicator Variable Coding as.factor • predict •  •  •  •  •  •  •  •  •  dim names $ [ == & by [[ != | $<> ! •  •  < [<>= merge sort db.data.frame •  •  [[<<= •  is.na preview content as.db.data.frame c mean sum sd var min max length colMeans colSums db.connect db.disconnect db.list db.objects db.existsObject delete 37
  • 38. http://gopivotal.github.io/PivotalR/ Woo Jung © Copyright 2014 Pivotal. All rights reserved. 38
  • 40. Shiny Showcase: Example Web Apps in R Ÿ  Users can choose input parameters with sliders, drop-downs, and text fields. Ÿ  HTML/JavaScript knowledge not required. http://www.rstudio.com/shiny/ © Copyright 2014 Pivotal. All rights reserved. 40
  • 41. Shiny Showcase: Example Web Apps in R Ÿ  Users can choose input parameters with sliders, drop-downs, and text fields. Ÿ  HTML/JavaScript knowledge not required. http://www.rstudio.com/shiny/ © Copyright 2014 Pivotal. All rights reserved. 41
  • 42. http://d3js.org/ © Copyright 2014 Pivotal. All rights reserved. 42
  • 43. D3 Data-Driven Documents http://d3js.org/ © Copyright 2014 Pivotal. All rights reserved. 43
  • 44. D3 Data-Driven Documents http://d3js.org/ © Copyright 2014 Pivotal. All rights reserved. 44
  • 45. PyMADlib Ÿ  Python wrapper for MADlib http://nbviewer.ipython.org/gist/vatsan/5275846 © Copyright 2014 Pivotal. All rights reserved. 45
  • 46. PyMADlib Ÿ  Python wrapper for MADlib http://nbviewer.ipython.org/gist/vatsan/5275846 © Copyright 2014 Pivotal. All rights reserved. 46
  • 47. Procedural Languages in Big Data Science Ÿ  HAWQ & PL/X can take advantage of “data parallel” tasks by performing analyses in parallel – embarrassingly parallel tasks Ÿ  Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks Ÿ  Examples of ‘data parallel’ problems: –  Counting words in documents –  Genome-Wide Association Study –  Studying network anomalies http://gopivotal.github.io/gp-r/ © Copyright 2014 Pivotal. All rights reserved. SQL & R Master Severs Network Interconnect Segment Severs Doc1 Doc2 DocM Stem1 Stem2 StemM Count1 Count2 CountM 47
  • 48. Structure of input table for PL/R function Columns Description A Network ID ID of the network. 300K in total. Terminal readings Topology Network Readings Array of integers defining the topology tree. Array of readings from network terminal points over (say) a week. C Ÿ  Using historical readings, solve a linear program to establish baseline behavior, for example number of shipments 0 B Ÿ  Topology: Hubs connected to multiple terminal points D Ÿ  Detecting anomalies within subnetworks on future observations Vivek Ramamurthy © Copyright 2014 Pivotal. All rights reserved. 48
  • 49. Performance Analysis Number of networks Time/network (ms) Total time (seconds) 500 6.604 3.30 1000 3.637 3.64 5000 2.822 14.11 400 10,000 2.356 23.56 300 50,000 2.160 108.02 200 100,000 2.142 214.20 100 150,000 2.162 324.29 200,000 2.142 428.48 250,000 2.138 534.69 300,000 2.132 639.85 Execution time v/s number of networks Time (seconds) 700 600 500 0 0 50 100 150 200 250 Number of networks (in thousands) 300 Vivek Ramamurthy © Copyright 2014 Pivotal. All rights reserved. 49
  • 50. Performance Analysis R package used optim quadprog Rsymphony Rglpk Single network in R (time) ~60s 6.3 s 0.145 s 0.181 s 300K networks in PL/R (time) ~84 hrs 5.87 hrs 10.7 min 14.6 min Time per network in PL/R 1005.2 ms 70.44 ms 2.13 ms 2.92 ms Vivek Ramamurthy © Copyright 2014 Pivotal. All rights reserved. 50
  • 51. Performance Analysis R package used optim quadprog Rsymphony Rglpk Single network in R (time) ~60s 6.3 s 0.145 s 0.181 s 300K networks in PL/R (time) ~84 hrs 5.87 hrs 10.7 min 14.6 min Time per network in PL/R 1005.2 ms 70.44 ms 2.13 ms 2.92 ms COIN-OR : Computational Infrastructure for Operations Research http://www.coin-or.org/ –  Libraries for linear and non-linear programming, integer programming –  SYMPHONY : Callable library in COIN-OR for solving mixed integer linear programs GLPK : GNU Linear Programming Kit Used for large-scale LPs, MIPs and related problems Vivek Ramamurthy © Copyright 2014 Pivotal. All rights reserved. 51
  • 52. Performance Analysis R package used optim quadprog Rsymphony Rglpk Single network in R (time) ~60s 6.3 s 0.145 s 0.181 s 300K networks in PL/R (time) ~84 hrs 5.87 hrs 10.7 min 14.6 min Time per network in PL/R 1005.2 ms 70.44 ms 2.13 ms 2.92 ms COIN-OR : Computational Infrastructure for Operations Research http://www.coin-or.org/ –  Libraries for linear and non-linear programming, integer programming –  SYMPHONY : Callable library in COIN-OR for solving mixed integer linear programs GLPK : GNU Linear Programming Kit –  Used for large-scale LPs, MIPs and related problems http://www.gnu.org/software/glpk/ Vivek Ramamurthy © Copyright 2014 Pivotal. All rights reserved. 52
  • 53. Natural language processing Data sources Applications NLP processing pipeline Text sources Documents, books, emails Sentence detection Tokenization Morphological stemming Stop word removal Word-sense disambiguation Part-of-Speech tagging Syntactic parsing Semantic role labeling Entity recognition Reference resolution Speech Phone logs, conversations Event processing Word clouds Topic modeling Sentiment analysis Machine translation Document classification Document summarization Language generation Search Question answering Information Extraction … Common tasks/tools in NLP Niels Kasch © Copyright 2014 Pivotal. All rights reserved. 53
  • 54. Open source tools for common NLP tasks RELEVANT NLP TOOLS OPEN SOURCE SOFTWARE WORD CLOUDS T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N I N F O R M AT I O N E X T R A C T I O N Niels Kasch © Copyright 2014 Pivotal. All rights reserved. 54
  • 55. Open source tools for common NLP tasks RELEVANT NLP TOOLS OPEN SOURCE SOFTWARE WORD CLOUDS Tokenization Stemming/ lemmatization Stop word removal •  •  •  GPText Apache UIMA OpenNLP (Java) •  •  •  NLTK (Python) WordNet Pytagcloud T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N I N F O R M AT I O N E X T R A C T I O N Niels Kasch © Copyright 2014 Pivotal. All rights reserved. 55
  • 56. Open source tools for common NLP tasks RELEVANT NLP TOOLS OPEN SOURCE SOFTWARE WORD CLOUDS Tokenization Stemming/ lemmatization Stop word removal •  •  •  GPText Apache UIMA OpenNLP (Java) •  •  •  NLTK (Python) WordNet Pytagcloud T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N Tokenization Stemming/ lemmatization Stop word removal Language detection •  •  •  Madlib (PLDA) gensim (LSA & LDA package for python) https://code.google.com/p/language-detection/ I N F O R M AT I O N E X T R A C T I O N Niels Kasch © Copyright 2014 Pivotal. All rights reserved. 56
  • 57. Open source tools for common NLP tasks RELEVANT NLP TOOLS OPEN SOURCE SOFTWARE WORD CLOUDS Tokenization Stemming/ lemmatization •  •  •  Stop word removal GPText Apache UIMA OpenNLP (Java) •  •  •  NLTK (Python) WordNet Pytagcloud T O P I C M O D E L I N G / T E X T C L A S S I F I C AT I O N Tokenization Stemming/ lemmatization Stop word removal Language detection •  •  •  Madlib (PLDA) gensim (LSA & LDA package for python) https://code.google.com/p/language-detection/ •  •  •  GPText and Madlib OpenNLP NLTK I N F O R M AT I O N E X T R A C T I O N Sentence detection Tokenization Language detection Relationship extraction Syntactic parsing Entity extraction •  Stanford CoreNLP (incl. POS tagger, NER, parser, etc.) Niels Kasch © Copyright 2014 Pivotal. All rights reserved. 57
  • 58. Topic Analysis – MADlib pLDA Natural Language Processing - GPText Filter relevant content Align Data Social Media Tokenizer Stemming, frequency filtering Prepare dataset for Topic Modeling Srivatsan Ramanujam © Copyright 2014 Pivotal. All rights reserved. 58
  • 59. Topic Analysis – MADlib pLDA Natural Language Processing - GPText Filter relevant content Align Data Social Media Tokenizer Stemming, frequency filtering Prepare dataset for Topic Modeling Topic Graph Topic composition MADlib Topic Model Topic Clouds Srivatsan Ramanujam © Copyright 2014 Pivotal. All rights reserved. 59
  • 60. Is there more? What’s next? blog.gopivotal.com/tag/data-science blog.gopivotal.com/tag/data-science-tech © Copyright 2014 Pivotal. All rights reserved. 60
  • 61. BUILT FOR THE SPEED OF BUSINESS