SlideShare une entreprise Scribd logo
1  sur  72
1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved.
All things Python @ Pivotal (Data Science)
Oct 15, 2015
POSH meetup
Srivatsan Ramanujam
Principal Data Scientist
Pivotal Labs
@being_bayesian
https://xkcd.com/353/
Joint work with Pivotal Data Science & MADlib team
2© Copyright 2013 Pivotal. All rights reserved.
About Me
Graduate School
Software Engineer
Analytics
Natural Language
Scientist
Research Intern
Principal Data Scientist,
Data Science R&D Lead
Machine Learning
Engineer (Drug
Discovery)
https://www.linkedin.com/pub/srivatsan-ramanujam/7/91b/888
3© Copyright 2013 Pivotal. All rights reserved.
Agenda
 Pivotal Data Science – Introduction
 Technology Stack
 Python on the client
 Python on our Big Data Platform (BDS)
– Data Parallelism
– Model Parallelism
 Python on our Cloud Platform (PCF)
 Putting it all together – demo!
4© Copyright 2013 Pivotal. All rights reserved.
Pivotal Data Science – Introduction
5© Copyright 2013 Pivotal. All rights reserved.
Pivotal Data Science
Our Charter:
Pivotal Data Science is Pivotal’s differentiated and
highly opinionated data-centric service delivery
organization (part of Pivotal Labs)
Our Goals:
Expedite customer time-to-value and ROI, by driving
business-aligned innovation and solutions assurance
within Pivotal’s Data Fabric technologies.
Drive customer adoption and autonomy across the full
spectrum of Pivotal Data technologies through best-in-
class data science and data engineering services, with a
deep emphasis on knowledge transfer.
Data Science Data Engineering
App Dev
6© Copyright 2013 Pivotal. All rights reserved.
Pivotal Data Science Knowledge Development
7© Copyright 2013 Pivotal. All rights reserved.
PIVOTAL DATA SCIENCE TEAM
• Annika Jimenez – Global head of Data Science Services (Sr. Director, Audience
and Advertising Analytics at Yahoo!, M.I.A. in International Management, UCSD)
• Kaushik Das – Mathematical Modeling in Energy, Retail and Telco(Director of
Analytics at M-Factor, M.S. in Mineral Engineering, UC Berkeley)
• Michael Brand –Text, Speech and Video Research for Retail, Finance and Gaming
(Chief Scientist at Verint Systems, M.S. in Applied Mathematics, Weizmann
Institute)
• Woo Jung – Bayesian Inference and Demand Analysis (Sr. Statistician at M-
Factor, M.S. in Statistics, Stanford)
• Noelle Sio – Digital Media Analytics and Mathematical Modeling (Sr. Analyst at
eHarmony, Fox Interactive Media (Myspace), M.S. in Applied Mathematics, Cal
Poly Pomona)
• Rashmi Raghu – Computational Methods and Analysis (Ph.D. in Mechanical
Engineering, Stanford)
• Jarrod Vawdrey – Marketing Analytics & SAS (Analytics Consultant at Aspen
Marketing, B.S. in Mathematics, Kennesaw State University)
• Sarah Aerni – Genomics and Machine Learning (Ph.D. in Biomedical Informatics,
Stanford)
• Srivatsan Ramanujam – NLP and Text Mining (Natural Language Scientist at
Sony, Salesforce.com, M.S. in Computer Sciences, UT Austin)
• Niels Kasch – Text Analytics and NLP (Ph.D. in Computer Science, UMBC)
• Regunathan Radhakrishnan – Machine Learning, Signal Processing, Multimedia
Content Analysis, Fingerprinting & Watermarking (Research Staff at Dolby
Laboratories, MERL, Ph.D. in Electrical Engineering, NYU-Poly, Brooklyn)
• Cao Yi – Optimization and Statistical Data Mining (Sr. Marketing Analyst at Energy
Market Company Singapore, Ph.D. in Operations Research, National University of
Singapore)
• Ian Huston – Numerical Modeling, Simulation, and Analysis (Ph.D. in Theoretical
Cosmology, Queen Mary, University of London)
• Michael Natusch – Director EMEA Data Science (Chief Analyst at Cumulus Analytics,
Ph.D. in Theoretical Condensed Matter Physics, University of Cambridge)
• Greg Whalen – Director APJ Data Science (VP, Global Development Center at
Experian, M.S. in Computer Science, Columbia University)
• Hulya Farinas – Optimization, Resource Allocation in Healthcare (Modeler at M-Factor,
IBM, Ph.D. in Operations Research, University of Florida)
• Derek Lin – Network Security, Fraud Detection, Speech and Language Processing,
(Principal Scientist at RSA, M.S. in Signal Processing, USC)
• Kee Siong Ng – Statistical Modeling in Energy, Retail and Healthcare (Consulting Lead
Data Scientist at Reliance, Ph.D. in Computer Science, Australian National University)
• Jin Yu – Stochastic Optimization, Robust Statistics in Machine Learning, Computer
Vision (Research Associate at U of Adelaide, Ph.D. in Machine Learning, Australian
National University)
• Gautam Muralidhar – PhD Biomed UT Austin, Image Processing, Signal Processing
• Ailey Crow – PhD Bio-physics, UC Berkeley, Image Processing, Bio Med
• Hong Ooi – Insurance and Finance Risk Modeling (Statistician at ANZ, Ph.D. in
Statistics, Australian National University)
• Mariann Micsinai – Next Generation Sequencing (Market Risk Management Associate
at Lehman Brothers, Ph.D. in Computational Biology, NYU / Yale)
• Victor Fang – Imaging and Graph Analytics, Machine Learning (Sr. Scientist at Riverain
Medical, Ph.D. in Computer Sciences, University of Cincinnati)
• Anirudh Kondaveeti – Trajectory Data Mining and Machine Learning (Ph.D. in
Computing & Dec. Systems Eng, Arizona State University)
• Alexander Kagoshima – Time Series, Statistics and Machine Learning (M.S. in
Economics/Computer Science, TU Berlin)
• Ronert Obst – Machine Learning, Bayesian Inference, Time Series (M.S. in Statistics,
LMU Munich)
8© Copyright 2013 Pivotal. All rights reserved.
Technology and Tools
9© Copyright 2013 Pivotal. All rights reserved.
Data Science Toolkit
KEY LANGUAGES
P L A T F O R M
KEY TOOLS
MLlib
PL/X
ModelingTools
VisualizationTools
Platform
10© Copyright 2013 Pivotal. All rights reserved.
Data Lake
Business Levers
Apps
Pipeline of a Data Science Driven App
MLlib
PL/X
Model Building
Model Tuning
Continuous Model
Improvement
Data Feeds
Ingest Filter Enrich Sink
SpringXD
Greenplum
11© Copyright 2013 Pivotal. All rights reserved.
Python on the client
12© Copyright 2013 Pivotal. All rights reserved.
Data Science Lab – Sample Timeline
Week
2 4 6 8 10 12
Data Review
Feature Creation
Optimization & Validation
Code QA & Scoring
Insights Presentation
Model and Code Handoff
Feature Review
Data Review
Knowledge Transfer
Model Development
Model Review
Phase 2 Phase 3 Phase 4 Model Building Phase 5 Model Enablement
13© Copyright 2013 Pivotal. All rights reserved.
Data Science Storytelling
 We primarily use Python on the client (laptop) for data
exploration, visualization and data science story-telling.
 Complex statistical models and data wrangling are run in the
backend on our Big Data Suite (MPP databases like
Greenplum and HAWQ).
 We typically use a connector like psycopg2 to talk to the
backend database and use a Jupyter notebook to document
our analysis on a laptop.
14© Copyright 2013 Pivotal. All rights reserved.
Python Distribution
 We love Anaconda - Python with “batteries included”
– Contains all the great libraries in the PyData stack that we often use for data science (numpy,
scipy, sklearn, statsmodels, searborn, matplotlib, nltk etc.)
 Conda package manager takes the pain out of Python package management
(remember the dreaded “pip install numpy scipy matplotlib” ?)
15© Copyright 2013 Pivotal. All rights reserved.
Notebooks
 Open source, interactive data science
and scientific computing across over 40
programming languages.
 Great for data science story-telling
 Living document, models and insights
“don’t die in Powerpoint slides”.
https://jupyter.org/
Data science lab templates
16© Copyright 2013 Pivotal. All rights reserved.
Seaborn
 Based on Matplotlib with the aesthetics of ggplot2 (thank you Michael Waskom!)
 Intuitive interface, tightly integrated with PyData stack including support for numpy and
pandas data structures and statistical routines from scipy and statsmodels.
http://stanford.edu/~mwaskom/software/seaborn/index.html
17© Copyright 2013 Pivotal. All rights reserved.
What about machine learning?
Source: the interwebs
18© Copyright 2013 Pivotal. All rights reserved.
Machine Learning in Python : Scikit Learn
http://scikit-learn.org/stable/
19© Copyright 2013 Pivotal. All rights reserved.
Scikit Learn Cheat Sheet
http://scikit-learn.org/stable/tutorial/machine_learning_map/
‘Cheat’ with care 
20© Copyright 2013 Pivotal. All rights reserved.
Numerous other libraries
topic modeling for humans
PyMC
21© Copyright 2013 Pivotal. All rights reserved.
Python in-database
22© Copyright 2013 Pivotal. All rights reserved.
• For embarrassingly parallel
tasks, we can use procedural
languages to easily
parallelize any stand-alone
library in Java, Python, R,
pgSQL or C/C++
• The interpreter/VM of the
language ‘X’ is installed on
each node of the MPP
environment
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Data Parallelism through PL/X : X in Python, R, Java,
C/C++ and pgSQL
• plpython and python are loaded as dynamic
libraries on the master and segment nodes
(libpython.so and plpython.so are under
$GPHOME/ext/python)
23© Copyright 2013 Pivotal. All rights reserved.
What exactly does PL/Python do?
PostgreSQL
type
Python type
boolean bool
smallint, Int int
bigint Long (py2.x), int (py 3.x)
real, double float
numeric decimal
bytea str in (py2.x), bytes (py3.x)
array list
record Python mapping (dict)
NULL None
Input Conversion Output Conversion
PostgreSQL type Python type
boolean 0, ‘’ is false
bytea retval -> str -> bytea
record retval can be list, tuple or
dict, but not set
Everything else retval is converted to
python str and constructor
for corresponding postgres
datatype is invoked
24© Copyright 2013 Pivotal. All rights reserved.
User Defined Functions (UDFs) in PL/Python
 Procedural languages need to be installed on each database used.
 Syntax is like normal Python function with function definition line replaced by SQL wrapper.
Alternatively like a SQL User Defined Function with Python inside.
CREATE FUNCTION pymax (a integer, b integer)
RETURNS integer
AS $$
if a > b:
return a
return b
$$ LANGUAGE plpythonu;
SQL wrapper
SQL wrapper
Normal Python
25© Copyright 2013 Pivotal. All rights reserved.
Returning Results
 Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)
 Composite types can be returned by creating a composite type in the database:
CREATE TYPE named_value AS (
name text,
value integer
);
 Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE FUNCTION make_pair (name text, value integer)
RETURNS named_value
AS $$
return [ name, value ]
# or alternatively, as tuple: return ( name, value )
# or as dict: return { "name": name, "value": value }
# or as an object with attributes .name and .value
$$ LANGUAGE plpythonu;
 For functions which return multiple rows, prefix “setof” before the return type
http://www.slideshare.net/PyData/massively-parallel-process-with-prodedural-python-ian-huston
26© Copyright 2013 Pivotal. All rights reserved.
Returning more results
You can return multiple results by wrapping them in a sequence (tuple, list or set),
an iterator or a generator:
CREATE FUNCTION make_pair (name text)
RETURNS SETOF named_value
AS $$
return ([ name, 1 ], [ name, 2 ], [ name, 3])
$$ LANGUAGE plpythonu;
Sequence
Generator
CREATE FUNCTION make_pair (name text)
RETURNS SETOF named_value AS $$
for i in range(3):
yield (name, i)
$$ LANGUAGE plpythonu;
27© Copyright 2013 Pivotal. All rights reserved.
Accessing Packages
 On Greenplum DB: packages must be installed on the individual
segment nodes.
– Can use “parallel ssh” tool gpssh to install
– Currently Greenplum DB ships with Python 2.6 (!)
 Then just import as usual inside the UDF:
CREATE FUNCTION make_pair (name text)
RETURNS named_value
AS $$
import numpy as np
return ((name,i) for i in np.arange(3))
$$ LANGUAGE plpythonu;
Anaconda
PL/Python
coming in
GPDB 5.0
28© Copyright 2013 Pivotal. All rights reserved.
UCI Auto MPG Dataset – A toy problem
Sample Data
 Sample Task: Aero-dynamics aside (attributable to body style), what is the effect of engine parameters
(bore, stroke, compression_ratio, horsepower, peak_rpm) on the highway mpg of cars?
 Solution: Build a Linear Regression model for each body style (hatchback, sedan) using the features
bore, stroke, compression ration, horsepower and peak_rpm with highway_mpg as the target label.
 This is a data parallel task which can be executed in parallel by simply piggybacking on the MPP
architecture. One segment can build a model for Hatchbacks another for Sedan
http://archive.ics.uci.edu/ml/datasets/Auto+MPG
29© Copyright 2013 Pivotal. All rights reserved.
Ridge Regression with scikit-learn on PL/Python
Python
SQL
wrapper
SQL
wrapper
User Defined Function
User Defined Type User Defined Aggregate
30© Copyright 2013 Pivotal. All rights reserved.
PL/Python + scikit-learn : Model Coefficients
Physical machine on the cluster in which the regression model was built
Invoke UDF
Build Feature
Vector
Choose Features
One model
per body style
31© Copyright 2013 Pivotal. All rights reserved.
Model Parallelism
 Data Parallel computation via PL/Python libraries only allow
us to run ‘n’ models in parallel.
 This works great when we are building one model for each
value of the group by column, but we need parallelized
algorithms to be able to build a single model on all the
available data
 For this, we use MADlib – an open source library of parallel
in-database machine learning algorithms.
32© Copyright 2013 Pivotal. All rights reserved.
MADlib : Scalable, in-database Machine Learning
http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf
33© Copyright 2013 Pivotal. All rights reserved.
Supported Platforms
PHD
HDP
Other ODPi distros
GPDB PostgreSQL
@MADlib_analytic
34
Functions
Supervised Learning
Regression Models
• Cox Proportional Hazards Regression
• Elastic Net Regularization
• Generalized Linear Models
• Linear Regression
• Logistic Regression
• Marginal Effects
• Multinomial Regression
• Ordinal Regression
• Robust Variance, Clustered Variance
• Support Vector Machines
Tree Methods
• Decision Tree
• Random Forest
Other Methods
• Conditional Random Field
• Naïve Bayes
Unsupervised Learning
• Association Rules (Apriori)
• Clustering (K-means)
• Topic Modeling (LDA)
Statistics
Descriptive
• Cardinality Estimators
• Correlation
• Summary
Inferential
• Hypothesis Tests
Other Statistics
• Probability Functions
Other Modules
• Conjugate Gradient
• Linear Solvers
• PMML Export
• Random Sampling
• Term Frequency for Text
Time Series
• ARIMA
Aug 2015
Data Types and Transformations
• Array Operations
• Dimensionality Reduction (PCA)
• Encoding Categorical Variables
• Matrix Operations
• Matrix Factorization (SVD, Low Rank)
• Norms and Distance Functions
• Sparse Vectors
Model Evaluation
• Cross Validation
Predictive Analytics Library
@MADlib_analytic
35
Architecture
C API
(Greenplum, PostgreSQL, HAWQ)
Low-level Abstraction Layer
(array operations,
C++ to DB type-bridge, …)
RDBMS
Built-in
Functions
User Interface
High-level Iteration Layer
(iteration controller, …)
Functions for Inner Loops
(implements ML logic)
Python
SQL
C++
Eigen
@MADlib_analytic
36© Copyright 2013 Pivotal. All rights reserved.
Convex optimization framework
98 4.475 1.151
63 13.35 3.263
40 45.48 13.10
8 171.7 84.59
ecution times
igure6: TheArchetypical Convex Function f(x) = x2
.
Application Objective
Each step has an analytical formulation that can be performed in parallel
• WI TH RECURSI VE
•
–
•
CREATE TEMP TABLE t emp!
I NSERT I NTO t emp SELECT
st ep( . . . ) FROM . . . !
SELECT conver ged( . . . )
FROM t emp, . . . !
SELECT r esul t ( . . . ) !
FROM t emp!
@MADlib_analytic
37
What are our customers saying about us?
k-means clustering:
• finding items that are similar within an n-
dimensional space
• Lloyd’s local-search heuristic works well
in practice
• Two fundamental steps:
1. Assign each point to its closest centroid
2. Move each centroid to the
barycenter/mean of all points currently
assigned to it@MADlib_analytic
38
What are our customers saying about us?
@MADlib_analytic
39
What are our customers saying about us?
@MADlib_analytic
40
What are our customers saying about us?
@MADlib_analytic
41
What are our customers saying about us?
@MADlib_analytic
42
What are our customers saying about us?
@MADlib_analytic
43
What are our customers saying about us?
@MADlib_analytic
44
What are our customers saying about us?
@MADlib_analytic
45
What are our customers saying about us?
@MADlib_analytic
46
What are our customers saying about us?
@MADlib_analytic
47
What are our customers saying about us?
@MADlib_analytic
48
What are our customers saying about us?
@MADlib_analytic
49
What are our customers saying about us?
@MADlib_analytic
50
What are our customers saying about us?
@MADlib_analytic
51
• innova
• leader
• design
• speed
• graphics
• improvement
• bug
• installation
• download
What are our customers saying about us?
@MADlib_analytic
52
K-means: Parallel Computation
Segment 1 Segment 2
Iteration end
Master
@MADlib_analytic
53© Copyright 2013 Pivotal. All rights reserved.
Driver Functions in PL/Python
 Every PL/Python UDF has access to a module called plpy, which allows you to
execute SQL queries from within the PL/Python UDF
 Gives the ability to “drive” distributed computation
Will run and fetch data
from segment nodes
Runs on the master only
Runs on the master only
• plpy.debug(msg), plpy.log(msg), plpy.info(msg), plpy.notice(msg), plpy.warning(msg), plpy.error(msg)
are useful utility functions for logging
54© Copyright 2013 Pivotal. All rights reserved.
In-database parallel grid search using
https://github.com/vatsan/gp_xgboost_gridsearch
• XGBoost (eXtreme
Gradient Boosting) is a
popular library used in
many prize winning
Kaggle contests.
• Implemented in C++ with
Python and R bindings
• Supports multi-core
• Implemented in-database
parallel grid-search for
XGBoost using PL/Python
55© Copyright 2013 Pivotal. All rights reserved.
In-database grid search - Approach
https://github.com/vatsan/gp_xgboost_gridsearch
Refreshed data (incoming
daily/weekly/monthly updates)
feature gen.
pipeline training dataset
(distributed table)
Model
selection
structured,
unstructured
data sources
scored results
grid search
params dict
Grid params table
(expanded)
master
segments
param-list-1 param-list-n. . .
training set(serialized) training set(serialized)
Driver function
(PL/Python)
pickle
and
distribute
mdl-1 mdl-n. . .
56© Copyright 2013 Pivotal. All rights reserved.
Model Training and Scoring : XGBoost
https://github.com/vatsan/gp_xgboost_gridsearch
Training Scoring
57© Copyright 2013 Pivotal. All rights reserved.
Python on Cloud Foundry
Ian Huston, Ronert Obst, Alex Kagoshima
58© Copyright 2013 Pivotal. All rights reserved.
What is Cloud Foundry?
http://cloudfoundry.org
Open Source Cloud Platform
Simple App Deployment,
Scaling & Availability
No Cloud Provider Lock In
@ianhuston
59© Copyright 2013 Pivotal. All rights reserved.
How can CF help data scientists?
 Jamie is a data scientist who has just finished some
analysis. They want to put up a simple internal web app with
Javascript visualisations connected to internal data stores.
 Sam is a data engineer who wants to set up a REST API to
expose a production machine learning model as a service.
 Alex is a data scientist who has an existing RShiny or
Python app that they want to make available with multiple
instances.
@ianhuston
60© Copyright 2013 Pivotal. All rights reserved.
Cloud Foundry is a Platform
You bring the apps, the rest
is taken care of!
Source: Albert Barron (IBM),
https://www.linkedin.com/pulse/20140730172610-9679881-pizza-as-a-service
@ianhuston
61© Copyright 2013 Pivotal. All rights reserved.
Cloud Foundry Foundation: Industry Standard
Gold
Silver
@ianhuston
62© Copyright 2013 Pivotal. All rights reserved.
CF for data scientists & developers
Easily deploy your web app
cf push myapp
Scale up and out quickly
cf scale myapp –i 5 –m 1G
Create and bind services
cf bind-service myapp redis
@ianhuston
63© Copyright 2013 Pivotal. All rights reserved.
Python on Cloud Foundry
 First class language (with Go, Java, Ruby, Node.js, PHP)
 Automatic app type detection
– Looks for requirements.txt or setup.py
 Buildpack takes care of
– Detecting that a Python app is being pushed
– Installing Python interpreter
– Installing packages in requirements.txt using pip
– Starting web app as requested (e.g. python myapp.py)
@ianhuston
64© Copyright 2013 Pivotal. All rights reserved.
Official Python Buildpack
 Great for simple pip based requirements
 Well tested and officially maintained
 Covers both Python 2 and 3
✗Suffers from the Python Packaging Problem:
- Hard to build packages with C, C++ or Fortran extensions
- Complicated local configuration of libraries and paths needed
- Takes a long time to build main PyData packages from source
@ianhuston
65© Copyright 2013 Pivotal. All rights reserved.
Using conda for package management
 http://conda.pydata.org
 Benefits:
– Uses precompiled binary packages
– No fiddling with Fortran or C compilers and library paths
– Known good combinations of main package versions
– Really simple environment management (better than virtualenv)
– Easy to run Python 2 and 3 side-by-side
Go try it out if you haven’t already!
@ianhuston
66© Copyright 2013 Pivotal. All rights reserved.
How to use the conda buildpack
https://github.com/ihuston/python-conda-buildpack
 Specify as a custom buildpack when pushing app with
manifest or -b command line option.
 Export your current environment to a environment.yml file
 Or write requirements.txt (pip) and conda_requirements.txt
 Send me feedback & pull requests!
67© Copyright 2013 Pivotal. All rights reserved.
Putting it all together : Topic and
Sentiment Analysis Demo
Srivatsan Ramanujam, Greg Cobb, Vinson Chuong, Ofri Afek, Jarrod Vawdrey, Joelle Gernez
68© Copyright 2013 Pivotal. All rights reserved.
Data Science + Agile = Quick Wins
 The Team
– 1 Data Scientist
– 2 Agile Developers
– 1 Designer (part-time)
– 1 Project Manager (part-time)
 Duration
– 3 weeks!
69© Copyright 2013 Pivotal. All rights reserved.
Text Analytics Pipeline
Stored on
Data Lake
Tweet
Stream
(PXF/gpfdist)
Loaded as
external tables
Parallel Parsing of
JSON and extraction
of fields using
PL/Python
Topic Analysis
through MADlib
pLDA
Sentiment Analysis
through custom
PL/Python functions
Pivotal
Cloud Foundry
55 million
tweets/day
70© Copyright 2013 Pivotal. All rights reserved.
Topic and Sentiment Analysis Engine (Demo)
http://www.slideshare.net/SrivatsanRamanujam/python-powered-data-science-at-pivotal-pydata-2013
71© Copyright 2013 Pivotal. All rights reserved.
Appendix
72© Copyright 2013 Pivotal. All rights reserved.
Pivotal Data Science Blogs
1. Scaling native (C++) apps on Pivotal MPP
2. Predicting commodity futures through Tweets
3. A pipeline for distributed topic & sentiment analysis of tweets on Greenplum
4. Using data science to predict TV viewer behavior
5. Twitter NLP: Scaling part-of-speech tagging
6. Distributed deep learning on MPP and Hadoop
7. Multi-variate time series forecasting
8. Pivotal for good – Crisis Textline
http://blog.pivotal.io/data-science-pivotal

Contenu connexe

Tendances

The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library EMC
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with HadoopSangchul Song
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
 
The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines Jim Dowling
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Quick Understanding of NoSQL
Quick Understanding of NoSQLQuick Understanding of NoSQL
Quick Understanding of NoSQLEdward Yoon
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApachePivotalOpenSourceHub
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkEvan Casey
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLDatabricks
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018TigerGraph
 

Tendances (20)

The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with Hadoop
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Quick Understanding of NoSQL
Quick Understanding of NoSQLQuick Understanding of NoSQL
Quick Understanding of NoSQL
 
Apache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to ApacheApache HAWQ and Apache MADlib: Journey to Apache
Apache HAWQ and Apache MADlib: Journey to Apache
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 

Similaire à All thingspython@pivotal

Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309DrVictorFang
 
Data mining with Rattle For R
Data mining with Rattle For RData mining with Rattle For R
Data mining with Rattle For RAkhil Anil
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAjaved75
 
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...IRJET Journal
 
Introduction to Data Science.pdf
Introduction to Data Science.pdfIntroduction to Data Science.pdf
Introduction to Data Science.pdfUniversity of Sindh
 
YASH DATA SCIENCE SEMINAR.pptx
YASH DATA SCIENCE SEMINAR.pptxYASH DATA SCIENCE SEMINAR.pptx
YASH DATA SCIENCE SEMINAR.pptxYashShiva3
 
Artificial intelligence engineer course
Artificial intelligence engineer courseArtificial intelligence engineer course
Artificial intelligence engineer courseIbrahim Khleifat
 
On Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in PalestineOn Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in PalestineMustafa Jarrar
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Career guidance talk it makaut_ppt_sabyasachi mukhopadhyay
Career guidance talk it makaut_ppt_sabyasachi mukhopadhyayCareer guidance talk it makaut_ppt_sabyasachi mukhopadhyay
Career guidance talk it makaut_ppt_sabyasachi mukhopadhyaySabyasachi Mukhopadhyay
 
Building Data Scientists
Building Data ScientistsBuilding Data Scientists
Building Data ScientistsMitch Sanders
 
Data Science ppt for the asjdbhsadbmsnc.pptx
Data Science ppt for the asjdbhsadbmsnc.pptxData Science ppt for the asjdbhsadbmsnc.pptx
Data Science ppt for the asjdbhsadbmsnc.pptxsa3302
 
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIMAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIBig Data Week
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4Ferdin Joe John Joseph PhD
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Databricks
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Karthik Murugesan
 

Similaire à All thingspython@pivotal (20)

Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309
 
Data mining with Rattle For R
Data mining with Rattle For RData mining with Rattle For R
Data mining with Rattle For R
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...IRJET-  	  Comparative Analysis of Various Tools for Data Mining and Big Data...
IRJET- Comparative Analysis of Various Tools for Data Mining and Big Data...
 
Introduction to Data Science.pdf
Introduction to Data Science.pdfIntroduction to Data Science.pdf
Introduction to Data Science.pdf
 
resume_MH
resume_MHresume_MH
resume_MH
 
YASH DATA SCIENCE SEMINAR.pptx
YASH DATA SCIENCE SEMINAR.pptxYASH DATA SCIENCE SEMINAR.pptx
YASH DATA SCIENCE SEMINAR.pptx
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
Artificial intelligence engineer course
Artificial intelligence engineer courseArtificial intelligence engineer course
Artificial intelligence engineer course
 
On Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in PalestineOn Computer Science Trends and Priorities in Palestine
On Computer Science Trends and Priorities in Palestine
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
On Big Data
On Big DataOn Big Data
On Big Data
 
Career guidance talk it makaut_ppt_sabyasachi mukhopadhyay
Career guidance talk it makaut_ppt_sabyasachi mukhopadhyayCareer guidance talk it makaut_ppt_sabyasachi mukhopadhyay
Career guidance talk it makaut_ppt_sabyasachi mukhopadhyay
 
Building Data Scientists
Building Data ScientistsBuilding Data Scientists
Building Data Scientists
 
Data Science ppt for the asjdbhsadbmsnc.pptx
Data Science ppt for the asjdbhsadbmsnc.pptxData Science ppt for the asjdbhsadbmsnc.pptx
Data Science ppt for the asjdbhsadbmsnc.pptx
 
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIMAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAI
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4
 
Data Science for Cyber Risk
Data Science for Cyber RiskData Science for Cyber Risk
Data Science for Cyber Risk
 
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...
 
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
Uber - Building Intelligent Applications, Experimental ML with Uber’s Data Sc...
 

Dernier

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Dernier (20)

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

All thingspython@pivotal

  • 1. 1© Copyright 2013 Pivotal. All rights reserved. 1© Copyright 2013 Pivotal. All rights reserved. All things Python @ Pivotal (Data Science) Oct 15, 2015 POSH meetup Srivatsan Ramanujam Principal Data Scientist Pivotal Labs @being_bayesian https://xkcd.com/353/ Joint work with Pivotal Data Science & MADlib team
  • 2. 2© Copyright 2013 Pivotal. All rights reserved. About Me Graduate School Software Engineer Analytics Natural Language Scientist Research Intern Principal Data Scientist, Data Science R&D Lead Machine Learning Engineer (Drug Discovery) https://www.linkedin.com/pub/srivatsan-ramanujam/7/91b/888
  • 3. 3© Copyright 2013 Pivotal. All rights reserved. Agenda  Pivotal Data Science – Introduction  Technology Stack  Python on the client  Python on our Big Data Platform (BDS) – Data Parallelism – Model Parallelism  Python on our Cloud Platform (PCF)  Putting it all together – demo!
  • 4. 4© Copyright 2013 Pivotal. All rights reserved. Pivotal Data Science – Introduction
  • 5. 5© Copyright 2013 Pivotal. All rights reserved. Pivotal Data Science Our Charter: Pivotal Data Science is Pivotal’s differentiated and highly opinionated data-centric service delivery organization (part of Pivotal Labs) Our Goals: Expedite customer time-to-value and ROI, by driving business-aligned innovation and solutions assurance within Pivotal’s Data Fabric technologies. Drive customer adoption and autonomy across the full spectrum of Pivotal Data technologies through best-in- class data science and data engineering services, with a deep emphasis on knowledge transfer. Data Science Data Engineering App Dev
  • 6. 6© Copyright 2013 Pivotal. All rights reserved. Pivotal Data Science Knowledge Development
  • 7. 7© Copyright 2013 Pivotal. All rights reserved. PIVOTAL DATA SCIENCE TEAM • Annika Jimenez – Global head of Data Science Services (Sr. Director, Audience and Advertising Analytics at Yahoo!, M.I.A. in International Management, UCSD) • Kaushik Das – Mathematical Modeling in Energy, Retail and Telco(Director of Analytics at M-Factor, M.S. in Mineral Engineering, UC Berkeley) • Michael Brand –Text, Speech and Video Research for Retail, Finance and Gaming (Chief Scientist at Verint Systems, M.S. in Applied Mathematics, Weizmann Institute) • Woo Jung – Bayesian Inference and Demand Analysis (Sr. Statistician at M- Factor, M.S. in Statistics, Stanford) • Noelle Sio – Digital Media Analytics and Mathematical Modeling (Sr. Analyst at eHarmony, Fox Interactive Media (Myspace), M.S. in Applied Mathematics, Cal Poly Pomona) • Rashmi Raghu – Computational Methods and Analysis (Ph.D. in Mechanical Engineering, Stanford) • Jarrod Vawdrey – Marketing Analytics & SAS (Analytics Consultant at Aspen Marketing, B.S. in Mathematics, Kennesaw State University) • Sarah Aerni – Genomics and Machine Learning (Ph.D. in Biomedical Informatics, Stanford) • Srivatsan Ramanujam – NLP and Text Mining (Natural Language Scientist at Sony, Salesforce.com, M.S. in Computer Sciences, UT Austin) • Niels Kasch – Text Analytics and NLP (Ph.D. in Computer Science, UMBC) • Regunathan Radhakrishnan – Machine Learning, Signal Processing, Multimedia Content Analysis, Fingerprinting & Watermarking (Research Staff at Dolby Laboratories, MERL, Ph.D. in Electrical Engineering, NYU-Poly, Brooklyn) • Cao Yi – Optimization and Statistical Data Mining (Sr. Marketing Analyst at Energy Market Company Singapore, Ph.D. in Operations Research, National University of Singapore) • Ian Huston – Numerical Modeling, Simulation, and Analysis (Ph.D. in Theoretical Cosmology, Queen Mary, University of London) • Michael Natusch – Director EMEA Data Science (Chief Analyst at Cumulus Analytics, Ph.D. in Theoretical Condensed Matter Physics, University of Cambridge) • Greg Whalen – Director APJ Data Science (VP, Global Development Center at Experian, M.S. in Computer Science, Columbia University) • Hulya Farinas – Optimization, Resource Allocation in Healthcare (Modeler at M-Factor, IBM, Ph.D. in Operations Research, University of Florida) • Derek Lin – Network Security, Fraud Detection, Speech and Language Processing, (Principal Scientist at RSA, M.S. in Signal Processing, USC) • Kee Siong Ng – Statistical Modeling in Energy, Retail and Healthcare (Consulting Lead Data Scientist at Reliance, Ph.D. in Computer Science, Australian National University) • Jin Yu – Stochastic Optimization, Robust Statistics in Machine Learning, Computer Vision (Research Associate at U of Adelaide, Ph.D. in Machine Learning, Australian National University) • Gautam Muralidhar – PhD Biomed UT Austin, Image Processing, Signal Processing • Ailey Crow – PhD Bio-physics, UC Berkeley, Image Processing, Bio Med • Hong Ooi – Insurance and Finance Risk Modeling (Statistician at ANZ, Ph.D. in Statistics, Australian National University) • Mariann Micsinai – Next Generation Sequencing (Market Risk Management Associate at Lehman Brothers, Ph.D. in Computational Biology, NYU / Yale) • Victor Fang – Imaging and Graph Analytics, Machine Learning (Sr. Scientist at Riverain Medical, Ph.D. in Computer Sciences, University of Cincinnati) • Anirudh Kondaveeti – Trajectory Data Mining and Machine Learning (Ph.D. in Computing & Dec. Systems Eng, Arizona State University) • Alexander Kagoshima – Time Series, Statistics and Machine Learning (M.S. in Economics/Computer Science, TU Berlin) • Ronert Obst – Machine Learning, Bayesian Inference, Time Series (M.S. in Statistics, LMU Munich)
  • 8. 8© Copyright 2013 Pivotal. All rights reserved. Technology and Tools
  • 9. 9© Copyright 2013 Pivotal. All rights reserved. Data Science Toolkit KEY LANGUAGES P L A T F O R M KEY TOOLS MLlib PL/X ModelingTools VisualizationTools Platform
  • 10. 10© Copyright 2013 Pivotal. All rights reserved. Data Lake Business Levers Apps Pipeline of a Data Science Driven App MLlib PL/X Model Building Model Tuning Continuous Model Improvement Data Feeds Ingest Filter Enrich Sink SpringXD Greenplum
  • 11. 11© Copyright 2013 Pivotal. All rights reserved. Python on the client
  • 12. 12© Copyright 2013 Pivotal. All rights reserved. Data Science Lab – Sample Timeline Week 2 4 6 8 10 12 Data Review Feature Creation Optimization & Validation Code QA & Scoring Insights Presentation Model and Code Handoff Feature Review Data Review Knowledge Transfer Model Development Model Review Phase 2 Phase 3 Phase 4 Model Building Phase 5 Model Enablement
  • 13. 13© Copyright 2013 Pivotal. All rights reserved. Data Science Storytelling  We primarily use Python on the client (laptop) for data exploration, visualization and data science story-telling.  Complex statistical models and data wrangling are run in the backend on our Big Data Suite (MPP databases like Greenplum and HAWQ).  We typically use a connector like psycopg2 to talk to the backend database and use a Jupyter notebook to document our analysis on a laptop.
  • 14. 14© Copyright 2013 Pivotal. All rights reserved. Python Distribution  We love Anaconda - Python with “batteries included” – Contains all the great libraries in the PyData stack that we often use for data science (numpy, scipy, sklearn, statsmodels, searborn, matplotlib, nltk etc.)  Conda package manager takes the pain out of Python package management (remember the dreaded “pip install numpy scipy matplotlib” ?)
  • 15. 15© Copyright 2013 Pivotal. All rights reserved. Notebooks  Open source, interactive data science and scientific computing across over 40 programming languages.  Great for data science story-telling  Living document, models and insights “don’t die in Powerpoint slides”. https://jupyter.org/ Data science lab templates
  • 16. 16© Copyright 2013 Pivotal. All rights reserved. Seaborn  Based on Matplotlib with the aesthetics of ggplot2 (thank you Michael Waskom!)  Intuitive interface, tightly integrated with PyData stack including support for numpy and pandas data structures and statistical routines from scipy and statsmodels. http://stanford.edu/~mwaskom/software/seaborn/index.html
  • 17. 17© Copyright 2013 Pivotal. All rights reserved. What about machine learning? Source: the interwebs
  • 18. 18© Copyright 2013 Pivotal. All rights reserved. Machine Learning in Python : Scikit Learn http://scikit-learn.org/stable/
  • 19. 19© Copyright 2013 Pivotal. All rights reserved. Scikit Learn Cheat Sheet http://scikit-learn.org/stable/tutorial/machine_learning_map/ ‘Cheat’ with care 
  • 20. 20© Copyright 2013 Pivotal. All rights reserved. Numerous other libraries topic modeling for humans PyMC
  • 21. 21© Copyright 2013 Pivotal. All rights reserved. Python in-database
  • 22. 22© Copyright 2013 Pivotal. All rights reserved. • For embarrassingly parallel tasks, we can use procedural languages to easily parallelize any stand-alone library in Java, Python, R, pgSQL or C/C++ • The interpreter/VM of the language ‘X’ is installed on each node of the MPP environment Standby Master … Master Host SQL Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Data Parallelism through PL/X : X in Python, R, Java, C/C++ and pgSQL • plpython and python are loaded as dynamic libraries on the master and segment nodes (libpython.so and plpython.so are under $GPHOME/ext/python)
  • 23. 23© Copyright 2013 Pivotal. All rights reserved. What exactly does PL/Python do? PostgreSQL type Python type boolean bool smallint, Int int bigint Long (py2.x), int (py 3.x) real, double float numeric decimal bytea str in (py2.x), bytes (py3.x) array list record Python mapping (dict) NULL None Input Conversion Output Conversion PostgreSQL type Python type boolean 0, ‘’ is false bytea retval -> str -> bytea record retval can be list, tuple or dict, but not set Everything else retval is converted to python str and constructor for corresponding postgres datatype is invoked
  • 24. 24© Copyright 2013 Pivotal. All rights reserved. User Defined Functions (UDFs) in PL/Python  Procedural languages need to be installed on each database used.  Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside. CREATE FUNCTION pymax (a integer, b integer) RETURNS integer AS $$ if a > b: return a return b $$ LANGUAGE plpythonu; SQL wrapper SQL wrapper Normal Python
  • 25. 25© Copyright 2013 Pivotal. All rights reserved. Returning Results  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.)  Composite types can be returned by creating a composite type in the database: CREATE TYPE named_value AS ( name text, value integer );  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table: CREATE FUNCTION make_pair (name text, value integer) RETURNS named_value AS $$ return [ name, value ] # or alternatively, as tuple: return ( name, value ) # or as dict: return { "name": name, "value": value } # or as an object with attributes .name and .value $$ LANGUAGE plpythonu;  For functions which return multiple rows, prefix “setof” before the return type http://www.slideshare.net/PyData/massively-parallel-process-with-prodedural-python-ian-huston
  • 26. 26© Copyright 2013 Pivotal. All rights reserved. Returning more results You can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator: CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ return ([ name, 1 ], [ name, 2 ], [ name, 3]) $$ LANGUAGE plpythonu; Sequence Generator CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ for i in range(3): yield (name, i) $$ LANGUAGE plpythonu;
  • 27. 27© Copyright 2013 Pivotal. All rights reserved. Accessing Packages  On Greenplum DB: packages must be installed on the individual segment nodes. – Can use “parallel ssh” tool gpssh to install – Currently Greenplum DB ships with Python 2.6 (!)  Then just import as usual inside the UDF: CREATE FUNCTION make_pair (name text) RETURNS named_value AS $$ import numpy as np return ((name,i) for i in np.arange(3)) $$ LANGUAGE plpythonu; Anaconda PL/Python coming in GPDB 5.0
  • 28. 28© Copyright 2013 Pivotal. All rights reserved. UCI Auto MPG Dataset – A toy problem Sample Data  Sample Task: Aero-dynamics aside (attributable to body style), what is the effect of engine parameters (bore, stroke, compression_ratio, horsepower, peak_rpm) on the highway mpg of cars?  Solution: Build a Linear Regression model for each body style (hatchback, sedan) using the features bore, stroke, compression ration, horsepower and peak_rpm with highway_mpg as the target label.  This is a data parallel task which can be executed in parallel by simply piggybacking on the MPP architecture. One segment can build a model for Hatchbacks another for Sedan http://archive.ics.uci.edu/ml/datasets/Auto+MPG
  • 29. 29© Copyright 2013 Pivotal. All rights reserved. Ridge Regression with scikit-learn on PL/Python Python SQL wrapper SQL wrapper User Defined Function User Defined Type User Defined Aggregate
  • 30. 30© Copyright 2013 Pivotal. All rights reserved. PL/Python + scikit-learn : Model Coefficients Physical machine on the cluster in which the regression model was built Invoke UDF Build Feature Vector Choose Features One model per body style
  • 31. 31© Copyright 2013 Pivotal. All rights reserved. Model Parallelism  Data Parallel computation via PL/Python libraries only allow us to run ‘n’ models in parallel.  This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data  For this, we use MADlib – an open source library of parallel in-database machine learning algorithms.
  • 32. 32© Copyright 2013 Pivotal. All rights reserved. MADlib : Scalable, in-database Machine Learning http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf
  • 33. 33© Copyright 2013 Pivotal. All rights reserved. Supported Platforms PHD HDP Other ODPi distros GPDB PostgreSQL @MADlib_analytic
  • 34. 34 Functions Supervised Learning Regression Models • Cox Proportional Hazards Regression • Elastic Net Regularization • Generalized Linear Models • Linear Regression • Logistic Regression • Marginal Effects • Multinomial Regression • Ordinal Regression • Robust Variance, Clustered Variance • Support Vector Machines Tree Methods • Decision Tree • Random Forest Other Methods • Conditional Random Field • Naïve Bayes Unsupervised Learning • Association Rules (Apriori) • Clustering (K-means) • Topic Modeling (LDA) Statistics Descriptive • Cardinality Estimators • Correlation • Summary Inferential • Hypothesis Tests Other Statistics • Probability Functions Other Modules • Conjugate Gradient • Linear Solvers • PMML Export • Random Sampling • Term Frequency for Text Time Series • ARIMA Aug 2015 Data Types and Transformations • Array Operations • Dimensionality Reduction (PCA) • Encoding Categorical Variables • Matrix Operations • Matrix Factorization (SVD, Low Rank) • Norms and Distance Functions • Sparse Vectors Model Evaluation • Cross Validation Predictive Analytics Library @MADlib_analytic
  • 35. 35 Architecture C API (Greenplum, PostgreSQL, HAWQ) Low-level Abstraction Layer (array operations, C++ to DB type-bridge, …) RDBMS Built-in Functions User Interface High-level Iteration Layer (iteration controller, …) Functions for Inner Loops (implements ML logic) Python SQL C++ Eigen @MADlib_analytic
  • 36. 36© Copyright 2013 Pivotal. All rights reserved. Convex optimization framework 98 4.475 1.151 63 13.35 3.263 40 45.48 13.10 8 171.7 84.59 ecution times igure6: TheArchetypical Convex Function f(x) = x2 . Application Objective Each step has an analytical formulation that can be performed in parallel • WI TH RECURSI VE • – • CREATE TEMP TABLE t emp! I NSERT I NTO t emp SELECT st ep( . . . ) FROM . . . ! SELECT conver ged( . . . ) FROM t emp, . . . ! SELECT r esul t ( . . . ) ! FROM t emp! @MADlib_analytic
  • 37. 37 What are our customers saying about us? k-means clustering: • finding items that are similar within an n- dimensional space • Lloyd’s local-search heuristic works well in practice • Two fundamental steps: 1. Assign each point to its closest centroid 2. Move each centroid to the barycenter/mean of all points currently assigned to it@MADlib_analytic
  • 38. 38 What are our customers saying about us? @MADlib_analytic
  • 39. 39 What are our customers saying about us? @MADlib_analytic
  • 40. 40 What are our customers saying about us? @MADlib_analytic
  • 41. 41 What are our customers saying about us? @MADlib_analytic
  • 42. 42 What are our customers saying about us? @MADlib_analytic
  • 43. 43 What are our customers saying about us? @MADlib_analytic
  • 44. 44 What are our customers saying about us? @MADlib_analytic
  • 45. 45 What are our customers saying about us? @MADlib_analytic
  • 46. 46 What are our customers saying about us? @MADlib_analytic
  • 47. 47 What are our customers saying about us? @MADlib_analytic
  • 48. 48 What are our customers saying about us? @MADlib_analytic
  • 49. 49 What are our customers saying about us? @MADlib_analytic
  • 50. 50 What are our customers saying about us? @MADlib_analytic
  • 51. 51 • innova • leader • design • speed • graphics • improvement • bug • installation • download What are our customers saying about us? @MADlib_analytic
  • 52. 52 K-means: Parallel Computation Segment 1 Segment 2 Iteration end Master @MADlib_analytic
  • 53. 53© Copyright 2013 Pivotal. All rights reserved. Driver Functions in PL/Python  Every PL/Python UDF has access to a module called plpy, which allows you to execute SQL queries from within the PL/Python UDF  Gives the ability to “drive” distributed computation Will run and fetch data from segment nodes Runs on the master only Runs on the master only • plpy.debug(msg), plpy.log(msg), plpy.info(msg), plpy.notice(msg), plpy.warning(msg), plpy.error(msg) are useful utility functions for logging
  • 54. 54© Copyright 2013 Pivotal. All rights reserved. In-database parallel grid search using https://github.com/vatsan/gp_xgboost_gridsearch • XGBoost (eXtreme Gradient Boosting) is a popular library used in many prize winning Kaggle contests. • Implemented in C++ with Python and R bindings • Supports multi-core • Implemented in-database parallel grid-search for XGBoost using PL/Python
  • 55. 55© Copyright 2013 Pivotal. All rights reserved. In-database grid search - Approach https://github.com/vatsan/gp_xgboost_gridsearch Refreshed data (incoming daily/weekly/monthly updates) feature gen. pipeline training dataset (distributed table) Model selection structured, unstructured data sources scored results grid search params dict Grid params table (expanded) master segments param-list-1 param-list-n. . . training set(serialized) training set(serialized) Driver function (PL/Python) pickle and distribute mdl-1 mdl-n. . .
  • 56. 56© Copyright 2013 Pivotal. All rights reserved. Model Training and Scoring : XGBoost https://github.com/vatsan/gp_xgboost_gridsearch Training Scoring
  • 57. 57© Copyright 2013 Pivotal. All rights reserved. Python on Cloud Foundry Ian Huston, Ronert Obst, Alex Kagoshima
  • 58. 58© Copyright 2013 Pivotal. All rights reserved. What is Cloud Foundry? http://cloudfoundry.org Open Source Cloud Platform Simple App Deployment, Scaling & Availability No Cloud Provider Lock In @ianhuston
  • 59. 59© Copyright 2013 Pivotal. All rights reserved. How can CF help data scientists?  Jamie is a data scientist who has just finished some analysis. They want to put up a simple internal web app with Javascript visualisations connected to internal data stores.  Sam is a data engineer who wants to set up a REST API to expose a production machine learning model as a service.  Alex is a data scientist who has an existing RShiny or Python app that they want to make available with multiple instances. @ianhuston
  • 60. 60© Copyright 2013 Pivotal. All rights reserved. Cloud Foundry is a Platform You bring the apps, the rest is taken care of! Source: Albert Barron (IBM), https://www.linkedin.com/pulse/20140730172610-9679881-pizza-as-a-service @ianhuston
  • 61. 61© Copyright 2013 Pivotal. All rights reserved. Cloud Foundry Foundation: Industry Standard Gold Silver @ianhuston
  • 62. 62© Copyright 2013 Pivotal. All rights reserved. CF for data scientists & developers Easily deploy your web app cf push myapp Scale up and out quickly cf scale myapp –i 5 –m 1G Create and bind services cf bind-service myapp redis @ianhuston
  • 63. 63© Copyright 2013 Pivotal. All rights reserved. Python on Cloud Foundry  First class language (with Go, Java, Ruby, Node.js, PHP)  Automatic app type detection – Looks for requirements.txt or setup.py  Buildpack takes care of – Detecting that a Python app is being pushed – Installing Python interpreter – Installing packages in requirements.txt using pip – Starting web app as requested (e.g. python myapp.py) @ianhuston
  • 64. 64© Copyright 2013 Pivotal. All rights reserved. Official Python Buildpack  Great for simple pip based requirements  Well tested and officially maintained  Covers both Python 2 and 3 ✗Suffers from the Python Packaging Problem: - Hard to build packages with C, C++ or Fortran extensions - Complicated local configuration of libraries and paths needed - Takes a long time to build main PyData packages from source @ianhuston
  • 65. 65© Copyright 2013 Pivotal. All rights reserved. Using conda for package management  http://conda.pydata.org  Benefits: – Uses precompiled binary packages – No fiddling with Fortran or C compilers and library paths – Known good combinations of main package versions – Really simple environment management (better than virtualenv) – Easy to run Python 2 and 3 side-by-side Go try it out if you haven’t already! @ianhuston
  • 66. 66© Copyright 2013 Pivotal. All rights reserved. How to use the conda buildpack https://github.com/ihuston/python-conda-buildpack  Specify as a custom buildpack when pushing app with manifest or -b command line option.  Export your current environment to a environment.yml file  Or write requirements.txt (pip) and conda_requirements.txt  Send me feedback & pull requests!
  • 67. 67© Copyright 2013 Pivotal. All rights reserved. Putting it all together : Topic and Sentiment Analysis Demo Srivatsan Ramanujam, Greg Cobb, Vinson Chuong, Ofri Afek, Jarrod Vawdrey, Joelle Gernez
  • 68. 68© Copyright 2013 Pivotal. All rights reserved. Data Science + Agile = Quick Wins  The Team – 1 Data Scientist – 2 Agile Developers – 1 Designer (part-time) – 1 Project Manager (part-time)  Duration – 3 weeks!
  • 69. 69© Copyright 2013 Pivotal. All rights reserved. Text Analytics Pipeline Stored on Data Lake Tweet Stream (PXF/gpfdist) Loaded as external tables Parallel Parsing of JSON and extraction of fields using PL/Python Topic Analysis through MADlib pLDA Sentiment Analysis through custom PL/Python functions Pivotal Cloud Foundry 55 million tweets/day
  • 70. 70© Copyright 2013 Pivotal. All rights reserved. Topic and Sentiment Analysis Engine (Demo) http://www.slideshare.net/SrivatsanRamanujam/python-powered-data-science-at-pivotal-pydata-2013
  • 71. 71© Copyright 2013 Pivotal. All rights reserved. Appendix
  • 72. 72© Copyright 2013 Pivotal. All rights reserved. Pivotal Data Science Blogs 1. Scaling native (C++) apps on Pivotal MPP 2. Predicting commodity futures through Tweets 3. A pipeline for distributed topic & sentiment analysis of tweets on Greenplum 4. Using data science to predict TV viewer behavior 5. Twitter NLP: Scaling part-of-speech tagging 6. Distributed deep learning on MPP and Hadoop 7. Multi-variate time series forecasting 8. Pivotal for good – Crisis Textline http://blog.pivotal.io/data-science-pivotal