SlideShare une entreprise Scribd logo
1  sur  76
1Dataiku6/4/2013
6/4/2013Dataiku 2
Hi !
Current Life:
CEO, Dataiku
Tweet about this: @dataiku @club_dsi_gun
Past Life:
Criteo
IsCool Entertainment
Exalead
Florian
Douetteau
Available on Slide Share
http://www.slideshare.net/Dataiku
Goals Today:
• Concrete Feedback on Data Analytics
Projects
• Data Team in practice and Key technologies
• Motivate you to start a data science project
Slide deck allergic ? Check:
https://github.com/dataiku
6/4/2013Dataiku 3
Dataiku
Dataiku : An open source platform
to help you build your data lab
‟
”
6/4/2013Dataiku 4
Collocation
6/4/2013Dataiku 5
Big Apple
Big Mama
Big Data
A familiar grouping of words,
especially words that habitually appear
together and thereby convey meaning
by association.
C
o
l
l
o
c
“Big” Data in 1999
6/4/2013Dataiku 6
struct Element {
Key key;
void* stat_data ;
}
….
C
Optimized Data structures
Perfect Hashing
HP-UNIX Servers – 4GB Ram
100 GB data
Web Crawler – Socket reuse
HTTP 0.9
1 Month
 Hadoop
 Java / Pig / Hive / Scala /
Closure / …
 A Dozen NoSQL data store
 MPP Databases
 Real-Time
6/4/2013Dataiku 7
Big Data in 2013
1 Hour
Data Analytics: The Stakes
6/4/2013Dataiku 8
1 TB
? $
Social Gaming
2011Web Search
1999
Logistics
2004
Online
Advertising
2012
1 TB
100M $
E-
Commerce
2013
Banking
CRM
2008
1 TB
1B $
Web
Search
2010
100 TB
? $
10 TB
10M $
1000TB
500M $
50TB
1B$
Meet Hal Alowne
6/4/2013Dataiku - Data Tuesday 9
Big Guys
• 10B$+ Revenue
• 100M+ customers
• 100+ Data Scientist
Hal Alowne
BI Manager
Dim’s Private Showroom
Hey Hal ! We need
a big data platform
like the big guys.
Let’s just do as they do!
‟
”European E-commerce Web site
• 100M$ Revenue
• 1 Million customer
• 1 Data Analyst (Hal Himself)
Dim Sum
CEO & Founder
Dim’s Private Showroom
Big Data
Copy Cat
Project
Technology is complex
6/4/2013Dataiku 10
Hadoop
Ceph
Sphere
Cassandra
Spark
Scikit-Learn
Mahout
WEKA
MLBase
RapidMiner
Panda
D3
Crossfilter
InfiniDB
LucidDB
Impala
Elastic Search
SOLR
MongoDB
Riak
Membase
Pig
Hive
Cascading
Talend
Machine Learning
Mystery Land
Scalability CentralNoSQL-Slavia
SQL Colunnar Republic
Vizualization County
Data Clean Wasteland
Statistician Old
House
R
Statistics and Machine Learning is
complex !
6/4/2013Dataiku 11
 Try to understand
myself
(Some Book you might want to read)
6/4/2013Dataiku 12
Plumbing is not complex
(but difficult)
6/4/2013Dataiku 13
Implicit User Data
(Views, Searches…)
Content Data
(Title, Categories, Price, …)
Explicit User Data
(Click, Buy, …)
User Information
(Location, Graph…)
500TB
50TB
1TB
200GB
Transformation
Matrix
Transformation
Predictor
Per User Stats
Per Content Stats
User Similarity
Rank Predictor
Content Similarity
MERIT = TIME + ROI
6/4/2013Dataiku 14
Targeted
Newsletter
Recommender
Systems
Adapted Product
/ Promotions
TIME : 6 MONTHS ROI : APPS
 Build a lab in 6 months
(rather than 18 months)
Find the right
people
(6 months?)
Choose the
technology
(6 months?)
Make it work
(6 months?)
Build the lab
(6 months)
 Deploy apps
that actually deliver value
2013 2014
2013
• Train People
• Reuse working patterns
The Problem
6/4/2013Dataiku 15
It’s utterly complex and
unreasonable
Our Goal
6/4/2013Dataiku 16
Our Goal:
Change his perspective
on data science projects
(sorry, we couldn’t
find a picture of Hal
Smiling)
 Why and For What ?
◦ Business Theory
◦ Concrete Projects
 How people and project ?
◦ How to start
◦ Dedicated team ?
 What technologies ?
◦ Machine Learning
◦ Architecture
Agenda
6/4/2013Dataiku 17
Embodiment of Knowledge
6/4/2013Dataiku 18
 Product Success
driven by Quality !
 Margin / Customer
Value / Traffic /
Acquisition
6/4/2013Dataiku 19
Example: Launching an App
on the App Store
 Margin for new
customers might
decline …
 Margin for new
features might
decline …
 Is your business
really scalable ?
6/4/2013Dataiku 20
you continue growing ….
 Existing Customers
Profiles
 Existing Product Assets
 Existing Specific
Business Model
 And your KNOWLEDGE
of it
6/4/2013Dataiku 21
Where is your core business
advantage ?
6/4/2013Dataiku 22
Data Driven Business
What your value ?
Number of
Customers
Customer Knowledge
Increase over time with:
- Time spend in your app
- User relationship (network effet)
- Partner / Other Apps Interactions
Your Value
Data Impact
Not all business equals
6/4/2013Dataiku 23
Online
Advertising
Telecommunication
Insurance
Ability
to Acquire
Margin
New
Services
Overall
Subscription
Market
Infrastructure
Driver
Selling Data
Risk / Price
Optimization
Subscription
Market
Subscription
Market
From Theory To Practice
6/4/2013Dataiku 24
 What should be free
in the application ?
 How to optimize
conversion ?
 How to plan and
create a business
model ?
Main Pain Point:
How to plan and
optimize pricing in
the application ?
6/4/2013Dataiku 25
Freemium Application
Example (Freemium Application)
Fremium Model Optimization
6/4/2013Dataiku 26
Business
Model
User
Cluster
Simulation
 Optimized Pricing: Margin
+23%
 Business Planning
Capability
1 month  9 months
 R + Python + InfiniDB
On-Premise
1TB Dataset
5 weeks project
 Business Intelligence
Stack as Scalability and
maintenance issues
 Backoffice implements
business rules that are
challenged
 Existing infrastructure
cannot cope with per-
user information
Main Pain Point:
23 hours 52 minutes to
compute Business Intelligence
aggregates for one day.
6/4/2013Dataiku 27
Large E-Retailer
• Relieve their current DWH and
accelerate production of some
aggregates/KPIs
• Be the backbone for new
personalized user experience
on their website: more
recommendations, more
profiling, etc.,
• Train existing people around
machine learning and
segmentation experience
 1h12 to perform the
aggregate, available every
morning
 New home page
personalization deployed in a
few weeks
 Hadoop Cluster (24 cores)
Google Compute Engine
Python + R + Vertica
12 TB dataset
6 weeks projects
6/4/2013Dataiku - Data Tuesday 28
Large E-Retailer : The Datalab
 BI performed directly on
production databases
 New reports required the
CTO direct work for
design and
implementation
 Each photo tag manually
validated and completed
Large Photo Bank
6/4/2013Dataiku - Data Tuesday 29
Main pain point:
No visibility on new users
behaviours
 Implementing a Cloud-based
data lab to :
• centralize all available data,
previously scattered between
SQL DB and file systems,
• improve web tracking
granularity to enhance
customer knowledge via
behavior modeling and
segmentation,
• create content-based
recommendation engines with
keywords clustering and
association.
6/4/2013Dataiku - Data Tuesday 30
Large Photo Bank : The Datalab
 R + Vertica + Hadoop
Amazon Web Services
8 weeks projects
 Automated content filtering
and recommendation
 Large set of
manually crafted
linguistic resources
for interpreting
users queries
 New Brands, rare
terms .. hard to
maintain
6/4/2013Dataiku 31
Large Online Directory
Main Pain Point:
Ability to maintain a very
large ontological knowledge
sets, with more than 100k
concepts
 Analyze clicks,
rephrasing navigation to
detect queries that
require specific
processing
 Gather web and external
data to enrich the
existing index
 Train team to Hadoop
and Machine Learning
 Continuous Relevance
Monitoring
 Automated enrichment 
2x more productivity
 Hadoop (48 cores)
Python
On Premise
10 weeks projects
6/4/2013Dataiku 32
Large Online Directory: The Data Lab
 Launch A Marketing
campaign
 After a few days
PREDICT based on
behaviours
◦  Total ARPU for users
after 3 months
◦  Efficiency of a campaign
◦ Continue or not ?
Example ( E-Application )
Marketing Campaign Prediction
Dataiku 33
A very large community
Some mid-size
communities
Lots of small clusters
mostly 2 players)
 Correlation
◦ between community size
and engagement / virality
 Meaningul patterns
◦ 2 players / Family / Group
 What is the minimum
number of friends to
have in the application
to get additional
engagement ?
Example (Social Gaming)
Social Gaming Communities
6/4/2013Dataiku 34
 What others do ?
◦ Concrete Projects
 How people and project ?
◦ How to start
◦ Dedicated team ?
 What technologies ?
◦ Machine Learning
◦ Architecture
Agenda
6/4/2013Dataiku 35
6/4/2013Dataiku 36
 A / B Test
(or equivalent for your
business) is the first step to
get into a “data-driven”
mind set
 No advanced analytics
requires, some existing
tools can help
 Changing a color button
+21%
6/4/2013Dataiku 37
(1) Be Data Driven
 People  Microsoft Excel
6/4/2013Dataiku 38
(2) Use Excel
 Data Team  Data Tools
6/4/2013Dataiku 39
(3) Build a team
The Business Expert
who knows maths
The Analyst
that reveals patterns
The Coding Guy That
is enthusiastic
 data lab, (n. m): a small group
with all the expertise, including
business minded people,
machine learning knowledge and
the right technology
 A proven organization used by
successful data-driven
companies over the past few
years (eBay, LinkedIn, Walmart…)
TEAM + TOOLS = LAB
6/4/2013Dataiku 40
Organization
6/4/2013Dataiku 41
Targeted campaings
Price optimization
Personalized
experience
Quality Assurance
Workload and yield
management
User Feedback (A/B Test)
Continuous improvement
Data
Product
Designer
Business
&
Marketing
Engineers
User
Voice
Short Term Focus Long Term Drive
Business People Optimize Margin, …. Create new business
revenue streams
Marketing People Optimize click ratio Brand awareness and
impact
IT People Make IT work Clean and efficient
Architecture
Data People Get Stats Right, make
predictions
Create Data Driven
Features
It’s just a new team
6/4/2013Dataiku 42
Super Intern
6/4/2013Dataiku 43
What is your ability to integrate a new
smart guy and give him any
data he would need and any computing
power he would need to enhance
your product ?
 What others do ?
◦ Concrete Projects
 How people and project ?
◦ How to start
◦ Dedicated team ?
 What technologies ?
◦ Machine Learning
◦ Architecture
Agenda
6/4/2013Dataiku 44
An oversimplified view of big data architecture
6/4/2013Dataiku 45
6/4/2013Dataiku 46
Database Business Layer Application
(What it really looks like)
6/4/2013Dataiku 47
What kind of scale?
6/4/2013Dataiku 48
Database Business Layer Application
Or
Data Science App
Or ?
What kind of interaction ?
6/4/2013Dataiku 49
Database Business Layer Application
Data Science App
?
?
? ? ?
?
Classic Columnar Architecture
6/4/2013Dataiku 50
Some data Some Place To
Pour It In
Some Tool To
To Some Maths And Graphs
Classic Columnar Architecture
6/4/2013Dataiku 51
Lots of data Some Place To
Pour It In
Some Tool To
To Some Maths And Graphs
Web Tracking Logs
Raw Server Logs
Order / Product / Customer
Facebook Info
Open Data (Weather, Currency …)
The Corinthian Architecture
6/4/2013Dataiku 52
Lots of data
Some Place
To Perform
Rapid Calculations
Some Tools To
Do Some Maths
And Charts
Some Place To
Pour It In And
Clean / Prepare It
Data Storage And Preparation
6/4/2013Dataiku 53
Large Scale:
Hadoop Cluster
Cassandra
MPP SQL Columnar
Medium/Large Scale:
CouchBase
MongoDB
….
Selection Drivers
Volume
Scalability
Calculations
6/4/2013Dataiku 54
Classic Database
• PostgresSQL
• MySQL
• ….
MPP SQL Database
• Vertica, Vectorwise, InfiniDB,
GreenplumHD….
Hadoop New Databases
• Impala
…
Selection Drivers:
Speed ( Interactivity )
Expressivity
The Corinthian Architecture
6/4/2013Dataiku 55
Lots of data
Some Place
To Perform
Rapid Calculations
Some Tools To
Do Some Maths
And Charts
Some Place To
Pour It In And
Clean / Prepare It
Statistics
Cohorts
Regressions
Bar Charts For Marketing
Nice Infography for you Company Board
The Corinthian Architecture
6/4/2013Dataiku 56
Lots of data
Some Database
To Perform
Rapid Calculations
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts
Some Place To
Pour It In And
Clean / Prepare It
Statistical Tools
6/4/2013Dataiku 57
Open Source:
• IPython
• Rstudio
Commercial
• RapidMiner
• SAS
• RevolutionR
Selection Drivers
Existing Knowhow
Scalability
6/4/2013Dataiku 58
What is a statistical tool ?
 Interact and explore
data
 Some stats
capabilities
 Some Graph
Capabilities
Visualization Tools
6/4/2013Dataiku 59
Open Source:
• SpotFire
• Tableau
• QlikView
SAAS
• BIME
• ChartIO
• RevolutionR
HTML5 / AdHoc
• D3
• GraphViz
Selection Drivers
How Many Contributors /
Readers ?
Scalability
The One Database won’t
make it all problem
6/4/2013Dataiku 60
Lots of data
Some Database
To Perform
Rapid Calculations
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts
Some Place To
Pour It In And
Clean / Prepare It
JOIN / Aggregate
Rapid Goup By Computations
Direct Access to the computed Results
to production etc..
The Roman Social Forum
6/4/2013Dataiku 61
Lots of data
Some Database
To Perform
Rapid Calculations
And Some Database
For Graphs
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts
Some Place To
Pour It In And
Clean / Prepare It
Graph
6/4/2013Dataiku 62
Databases
• Neo4J
• Titan
• OrientDB
• InfiniteGraph
Analytic / Visualization
• Gephi
Selection Drivers
Scalability
What Algorithms ?
Licensing Constraints
The Key Value Store
6/4/2013Dataiku 63
Lots of data
Some Database
To Perform
Rapid Calculations
And Some Database
For Graphs And
Some Distributed Key
Value Store
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts
Some Place To
Pour It In And
Clean / Prepare It
NoSQL
6/4/2013Dataiku 64
Search
• SOLR
• ElasticSearch
Document
• MongoDB
• CouchDB
KeyValue
• Redis
• Hbase
…
Selection Drivers
Durability / Avaiability …
Performance
Ease of use and API
Indexing
Action requires Prediction
6/4/2013Dataiku 65
Lots of data
Some Database
To Perform
Rapid Calculations
And some database
for graphs And
Some Distributed Key
Value Store
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts
Some Place To
Pour It In And
Clean / Prepare It
Draw A Line  For the future
What are my real users groups ?
Should I launch a discount offering or not ?
To everybody or to specific users only ?
The Medieval Fairy Land
6/4/2013Dataiku 66
Lots of data
Some Tools To
Do Some Maths
Some Other
To Do Some
Charts and some
MACHINE LEARNING
Some Place To
Pour It In And
Clean / Prepare It
Some Database
To Perform
Rapid Calculations
And Some Database
For Graphs And
Some Distributed Key
Value Store
Predictions
6/4/2013Dataiku 67
Java
• Mahout (Hadoop)
• WEKA
Python
• Scikit-Learn
• PyML
R
Commercial
• Kxen
• SAS
• SPSS…
Selection Drivers
Scalability
Black Box / White Box ?
Data Management Integration
Can be fun
6/4/2013Dataiku 68
 Exploratory Data Analysis
◦ Identifying and visualizing key patterns and correlations within the dataset
 Unsupervised Learning
◦ Create groups of similar observations sharing same patterns (aka Clustering, Segmentation)
 Supervised Learning
◦ Modeling a variable using independent features (aka Scoring, Predictive Modeling, Classification)
 Time Series Prevision
◦ Predict a time-dependent variable using its own history, and sometimes other covariates (variables)
 Graph Analysis
◦ Analyzing relationships between a set of “nodes”, linked by “edges”
 Associations / Sequences Mining
◦ Identifying frequently associated items within transactions/ events databases, sometimes ordered over time
 And many more…
Classes of Machine Learning Problems
04/06/2013Dataiku - Innovation Services 69
Mapping ML to Business Questions
04/06/2013Dataiku - Innovation Services 70
Class Sample Business Questions
Exploratory Data Analysis What does my dataset look like ? What are the key correlations in my data ?
Unsupervised Learning Can I create groups of users who share the same purchasing behavior ? The
same navigation behavior ?
Supervised Learning What users are likely to click on ad X ? What users are likely to convert to paying
users ? Who is going to leave my service ? What is the profile of the users who
do X ?
Time Series Prevision What is the prevision of my revenue next month ? Given the weather forecast,
can I also forecast my sales ?
Product Sale Forecast (for surbooking)
Graph Analysis Can I identify influencers in my users community ? Can I recommend new friends
to my users ?
Association & Sequences Mining Which products are frequently bought together ? What is the typical navigation
path on my website ?
Machine Learning Methods Detailed
04/06/2013Dataiku - Innovation Services 71
Analytical Task ML Task Sample Algorithms Shape of Dataset
Exploratory Data Analysis Univariate Analysis Distribution, frequencies, histogram, boxplots, fit tests... N obs. (1 row per obs.) * P features
Bivariate Analysis Scatterplots, correlations (Pearson, Spearman), GLM, Chi Square... N obs. (1 row per obs.) * P features
Multivariate Analysis Principal components analysis, multi-dimensional scaling
correspondence analysis, factor analysis…
N obs. (1 row per obs.) * P features
“Oriented” Data Analysis Unsupervised Learning K-means, K-medoids, hierarchical clustering, gaussian mixture
models, mean shift, dbscan, spectral clustering...
N obs. (1 row per obs.) * P features
Supervised Learning Linear & logistic regression, decision trees, neural networks, SVM,
naïve Bayes, K-NN, random forests…
N obs. (1 row per obs.) * P features
Time Series Prevision ARMA, VARMAX, ARIMA… Time Series (rows: time period,
columns: measures)
Graph Analysis Centrality (closeness, betweeness, Page Rank, HITS), modularity
(Louvain)…
Nodes and Edges lists (+
attributes)
Associations &
Sequences
Frequent Itemsets, A priori, Market Basket… (Timestamped) events or
transactions
 Cluster a dataset
into K Buckets by
choosing the
“closest”
neighbours
6/4/2013Dataiku 72
Unsupervised Method
K-Means
 Predict the color of
a point depending
on the colors of its
K closest
neighbours
6/4/2013Dataiku 73
Supervised
K-Nearest-Neighbours
 Find the most
“significant” input
variable and split
value
 Split the dataset
recursively
6/4/2013Dataiku 74
Supervised
Decision Tree
Several Paths to Machine Learning
04/06/2013Dataiku - Innovation Services 75
Analytical
Dataset
I’m looking
for clusters
I want to
predict a
variable
I’m looking
variable by
variable, or
pairs
I know how
many groups
to look for
HCA
…
Partitioning (K-
means…)
GMM
…
DP
GMM
…
K-means + Gap
| Silhouette | …
2-steps
clustering
I just want
to
explore
Yes
No
Ye
s
No
Small
Dataset
(<<1K)
Ye
s
No
Medium Dataset
(<<100K)
Ye
s
No
I can
sample
Ye
s
No
Affinity
Propagation,
Mean Shift…
Unsupervised Learning
Ye
s
No
All my
variables
are
numeric Ye
s
No
CA…
I have a
distance
matrix
Ye
s
No
MDS...
PCA
…
Exploratory Data Analysis Data
Viz...
Ye
s
Not
Only
I value
interpretability
Generalized
Linear
Model
Simple
Decision
Tree
Supervised Learning*
Correlation
Analysis
GLM
Parametric and non
parametric stat.
tests
* Methods generally working for both classification & regression
Support
Vector
Machines
Neural
Networks
K-Nearest
Neighbors
Ensembles (Random
Forest, Gradient
Boosted Tree
MARS
Generalized
Additive
Model
6/4/2013Dataiku 76
Questions ?
 Take Away
◦ There are new ways to perform data
analytics that are within your reach and
can bring business value
 Some Additional Resources
◦ Open Source Projects
 Dataiku Cloud Transport Client
http://dctc.io
 Dataiku Web Tracker
https://github.com/dataiku/wt1
◦ Our Technical Blog
 http://www.dataiku.com/blog

Contenu connexe

Tendances

Tableau 7.0 prsentation
Tableau 7.0 prsentationTableau 7.0 prsentation
Tableau 7.0 prsentation
inam_slides
 

Tendances (20)

Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AI
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Power BI On AIR - Melissa Coates: "What You Need to Know to Administer Power BI"
Power BI On AIR - Melissa Coates: "What You Need to Know to Administer Power BI"Power BI On AIR - Melissa Coates: "What You Need to Know to Administer Power BI"
Power BI On AIR - Melissa Coates: "What You Need to Know to Administer Power BI"
 
Tableau 7.0 prsentation
Tableau 7.0 prsentationTableau 7.0 prsentation
Tableau 7.0 prsentation
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Power BI Overview, Deployment and Governance
Power BI Overview, Deployment and GovernancePower BI Overview, Deployment and Governance
Power BI Overview, Deployment and Governance
 
Oracle Cloud Reference Architecture
Oracle Cloud Reference ArchitectureOracle Cloud Reference Architecture
Oracle Cloud Reference Architecture
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy8 Steps to Creating a Data Strategy
8 Steps to Creating a Data Strategy
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
The Evolving Role of the Data Architect – What Does It Mean for Your Career?
The Evolving Role of the Data Architect – What Does It Mean for Your Career?The Evolving Role of the Data Architect – What Does It Mean for Your Career?
The Evolving Role of the Data Architect – What Does It Mean for Your Career?
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
 

Similaire à Dataiku - From Big Data To Machine Learning

Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for Fun
Dataiku
 

Similaire à Dataiku - From Big Data To Machine Learning (20)

Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...
Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...
Conférence Laboratoire des Mondes Virtuels_Dataiku_Choix technologiques pour ...
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 
Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for Fun
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
Data Architecture Strategies: Artificial Intelligence - Real-World Applicatio...
 
Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...
Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...
Data Architecture Strategies Webinar: Emerging Trends in Data Architecture – ...
 
Data analytics course archtype
Data analytics course archtypeData analytics course archtype
Data analytics course archtype
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
MVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost LabsMVP (Minimum Viable Product) Readiness | Boost Labs
MVP (Minimum Viable Product) Readiness | Boost Labs
 
Lunch and Learn: You have the data, now what?
Lunch and Learn: You have the data, now what?Lunch and Learn: You have the data, now what?
Lunch and Learn: You have the data, now what?
 
Building successful data science teams
Building successful data science teamsBuilding successful data science teams
Building successful data science teams
 
Success Through an Actionable Data Science Stack
Success Through an Actionable Data Science StackSuccess Through an Actionable Data Science Stack
Success Through an Actionable Data Science Stack
 
Big Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven CultureBig Data LA 2016: Backstage to a Data Driven Culture
Big Data LA 2016: Backstage to a Data Driven Culture
 
Drinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire HoseDrinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire Hose
 
Data is not the new snake oil
Data is not the new snake oilData is not the new snake oil
Data is not the new snake oil
 
AI Orange Belt - Session 3
AI Orange Belt - Session 3AI Orange Belt - Session 3
AI Orange Belt - Session 3
 
First Steps on Big Data
First Steps on Big DataFirst Steps on Big Data
First Steps on Big Data
 
BIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantage
BIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantageBIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantage
BIG DATA IN BUSINESS Implement and use Big Data to your organization’s advantage
 
Your Company Cares About Open Source Sustainability, But Are You Measuring an...
Your Company Cares About Open Source Sustainability, But Are You Measuring an...Your Company Cares About Open Source Sustainability, But Are You Measuring an...
Your Company Cares About Open Source Sustainability, But Are You Measuring an...
 

Plus de Dataiku

Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
Dataiku
 

Plus de Dataiku (20)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Dataiku - From Big Data To Machine Learning

  • 2. 6/4/2013Dataiku 2 Hi ! Current Life: CEO, Dataiku Tweet about this: @dataiku @club_dsi_gun Past Life: Criteo IsCool Entertainment Exalead Florian Douetteau Available on Slide Share http://www.slideshare.net/Dataiku Goals Today: • Concrete Feedback on Data Analytics Projects • Data Team in practice and Key technologies • Motivate you to start a data science project Slide deck allergic ? Check: https://github.com/dataiku
  • 3. 6/4/2013Dataiku 3 Dataiku Dataiku : An open source platform to help you build your data lab ‟ ”
  • 5. Collocation 6/4/2013Dataiku 5 Big Apple Big Mama Big Data A familiar grouping of words, especially words that habitually appear together and thereby convey meaning by association. C o l l o c
  • 6. “Big” Data in 1999 6/4/2013Dataiku 6 struct Element { Key key; void* stat_data ; } …. C Optimized Data structures Perfect Hashing HP-UNIX Servers – 4GB Ram 100 GB data Web Crawler – Socket reuse HTTP 0.9 1 Month
  • 7.  Hadoop  Java / Pig / Hive / Scala / Closure / …  A Dozen NoSQL data store  MPP Databases  Real-Time 6/4/2013Dataiku 7 Big Data in 2013 1 Hour
  • 8. Data Analytics: The Stakes 6/4/2013Dataiku 8 1 TB ? $ Social Gaming 2011Web Search 1999 Logistics 2004 Online Advertising 2012 1 TB 100M $ E- Commerce 2013 Banking CRM 2008 1 TB 1B $ Web Search 2010 100 TB ? $ 10 TB 10M $ 1000TB 500M $ 50TB 1B$
  • 9. Meet Hal Alowne 6/4/2013Dataiku - Data Tuesday 9 Big Guys • 10B$+ Revenue • 100M+ customers • 100+ Data Scientist Hal Alowne BI Manager Dim’s Private Showroom Hey Hal ! We need a big data platform like the big guys. Let’s just do as they do! ‟ ”European E-commerce Web site • 100M$ Revenue • 1 Million customer • 1 Data Analyst (Hal Himself) Dim Sum CEO & Founder Dim’s Private Showroom Big Data Copy Cat Project
  • 10. Technology is complex 6/4/2013Dataiku 10 Hadoop Ceph Sphere Cassandra Spark Scikit-Learn Mahout WEKA MLBase RapidMiner Panda D3 Crossfilter InfiniDB LucidDB Impala Elastic Search SOLR MongoDB Riak Membase Pig Hive Cascading Talend Machine Learning Mystery Land Scalability CentralNoSQL-Slavia SQL Colunnar Republic Vizualization County Data Clean Wasteland Statistician Old House R
  • 11. Statistics and Machine Learning is complex ! 6/4/2013Dataiku 11  Try to understand myself
  • 12. (Some Book you might want to read) 6/4/2013Dataiku 12
  • 13. Plumbing is not complex (but difficult) 6/4/2013Dataiku 13 Implicit User Data (Views, Searches…) Content Data (Title, Categories, Price, …) Explicit User Data (Click, Buy, …) User Information (Location, Graph…) 500TB 50TB 1TB 200GB Transformation Matrix Transformation Predictor Per User Stats Per Content Stats User Similarity Rank Predictor Content Similarity
  • 14. MERIT = TIME + ROI 6/4/2013Dataiku 14 Targeted Newsletter Recommender Systems Adapted Product / Promotions TIME : 6 MONTHS ROI : APPS  Build a lab in 6 months (rather than 18 months) Find the right people (6 months?) Choose the technology (6 months?) Make it work (6 months?) Build the lab (6 months)  Deploy apps that actually deliver value 2013 2014 2013 • Train People • Reuse working patterns
  • 15. The Problem 6/4/2013Dataiku 15 It’s utterly complex and unreasonable
  • 16. Our Goal 6/4/2013Dataiku 16 Our Goal: Change his perspective on data science projects (sorry, we couldn’t find a picture of Hal Smiling)
  • 17.  Why and For What ? ◦ Business Theory ◦ Concrete Projects  How people and project ? ◦ How to start ◦ Dedicated team ?  What technologies ? ◦ Machine Learning ◦ Architecture Agenda 6/4/2013Dataiku 17
  • 19.  Product Success driven by Quality !  Margin / Customer Value / Traffic / Acquisition 6/4/2013Dataiku 19 Example: Launching an App on the App Store
  • 20.  Margin for new customers might decline …  Margin for new features might decline …  Is your business really scalable ? 6/4/2013Dataiku 20 you continue growing ….
  • 21.  Existing Customers Profiles  Existing Product Assets  Existing Specific Business Model  And your KNOWLEDGE of it 6/4/2013Dataiku 21 Where is your core business advantage ?
  • 22. 6/4/2013Dataiku 22 Data Driven Business What your value ? Number of Customers Customer Knowledge Increase over time with: - Time spend in your app - User relationship (network effet) - Partner / Other Apps Interactions Your Value
  • 23. Data Impact Not all business equals 6/4/2013Dataiku 23 Online Advertising Telecommunication Insurance Ability to Acquire Margin New Services Overall Subscription Market Infrastructure Driver Selling Data Risk / Price Optimization Subscription Market Subscription Market
  • 24. From Theory To Practice 6/4/2013Dataiku 24
  • 25.  What should be free in the application ?  How to optimize conversion ?  How to plan and create a business model ? Main Pain Point: How to plan and optimize pricing in the application ? 6/4/2013Dataiku 25 Freemium Application
  • 26. Example (Freemium Application) Fremium Model Optimization 6/4/2013Dataiku 26 Business Model User Cluster Simulation  Optimized Pricing: Margin +23%  Business Planning Capability 1 month  9 months  R + Python + InfiniDB On-Premise 1TB Dataset 5 weeks project
  • 27.  Business Intelligence Stack as Scalability and maintenance issues  Backoffice implements business rules that are challenged  Existing infrastructure cannot cope with per- user information Main Pain Point: 23 hours 52 minutes to compute Business Intelligence aggregates for one day. 6/4/2013Dataiku 27 Large E-Retailer
  • 28. • Relieve their current DWH and accelerate production of some aggregates/KPIs • Be the backbone for new personalized user experience on their website: more recommendations, more profiling, etc., • Train existing people around machine learning and segmentation experience  1h12 to perform the aggregate, available every morning  New home page personalization deployed in a few weeks  Hadoop Cluster (24 cores) Google Compute Engine Python + R + Vertica 12 TB dataset 6 weeks projects 6/4/2013Dataiku - Data Tuesday 28 Large E-Retailer : The Datalab
  • 29.  BI performed directly on production databases  New reports required the CTO direct work for design and implementation  Each photo tag manually validated and completed Large Photo Bank 6/4/2013Dataiku - Data Tuesday 29 Main pain point: No visibility on new users behaviours
  • 30.  Implementing a Cloud-based data lab to : • centralize all available data, previously scattered between SQL DB and file systems, • improve web tracking granularity to enhance customer knowledge via behavior modeling and segmentation, • create content-based recommendation engines with keywords clustering and association. 6/4/2013Dataiku - Data Tuesday 30 Large Photo Bank : The Datalab  R + Vertica + Hadoop Amazon Web Services 8 weeks projects  Automated content filtering and recommendation
  • 31.  Large set of manually crafted linguistic resources for interpreting users queries  New Brands, rare terms .. hard to maintain 6/4/2013Dataiku 31 Large Online Directory Main Pain Point: Ability to maintain a very large ontological knowledge sets, with more than 100k concepts
  • 32.  Analyze clicks, rephrasing navigation to detect queries that require specific processing  Gather web and external data to enrich the existing index  Train team to Hadoop and Machine Learning  Continuous Relevance Monitoring  Automated enrichment  2x more productivity  Hadoop (48 cores) Python On Premise 10 weeks projects 6/4/2013Dataiku 32 Large Online Directory: The Data Lab
  • 33.  Launch A Marketing campaign  After a few days PREDICT based on behaviours ◦  Total ARPU for users after 3 months ◦  Efficiency of a campaign ◦ Continue or not ? Example ( E-Application ) Marketing Campaign Prediction Dataiku 33
  • 34. A very large community Some mid-size communities Lots of small clusters mostly 2 players)  Correlation ◦ between community size and engagement / virality  Meaningul patterns ◦ 2 players / Family / Group  What is the minimum number of friends to have in the application to get additional engagement ? Example (Social Gaming) Social Gaming Communities 6/4/2013Dataiku 34
  • 35.  What others do ? ◦ Concrete Projects  How people and project ? ◦ How to start ◦ Dedicated team ?  What technologies ? ◦ Machine Learning ◦ Architecture Agenda 6/4/2013Dataiku 35
  • 37.  A / B Test (or equivalent for your business) is the first step to get into a “data-driven” mind set  No advanced analytics requires, some existing tools can help  Changing a color button +21% 6/4/2013Dataiku 37 (1) Be Data Driven
  • 38.  People  Microsoft Excel 6/4/2013Dataiku 38 (2) Use Excel
  • 39.  Data Team  Data Tools 6/4/2013Dataiku 39 (3) Build a team The Business Expert who knows maths The Analyst that reveals patterns The Coding Guy That is enthusiastic
  • 40.  data lab, (n. m): a small group with all the expertise, including business minded people, machine learning knowledge and the right technology  A proven organization used by successful data-driven companies over the past few years (eBay, LinkedIn, Walmart…) TEAM + TOOLS = LAB 6/4/2013Dataiku 40
  • 41. Organization 6/4/2013Dataiku 41 Targeted campaings Price optimization Personalized experience Quality Assurance Workload and yield management User Feedback (A/B Test) Continuous improvement Data Product Designer Business & Marketing Engineers User Voice
  • 42. Short Term Focus Long Term Drive Business People Optimize Margin, …. Create new business revenue streams Marketing People Optimize click ratio Brand awareness and impact IT People Make IT work Clean and efficient Architecture Data People Get Stats Right, make predictions Create Data Driven Features It’s just a new team 6/4/2013Dataiku 42
  • 43. Super Intern 6/4/2013Dataiku 43 What is your ability to integrate a new smart guy and give him any data he would need and any computing power he would need to enhance your product ?
  • 44.  What others do ? ◦ Concrete Projects  How people and project ? ◦ How to start ◦ Dedicated team ?  What technologies ? ◦ Machine Learning ◦ Architecture Agenda 6/4/2013Dataiku 44
  • 45. An oversimplified view of big data architecture 6/4/2013Dataiku 45
  • 47. (What it really looks like) 6/4/2013Dataiku 47
  • 48. What kind of scale? 6/4/2013Dataiku 48 Database Business Layer Application Or Data Science App Or ?
  • 49. What kind of interaction ? 6/4/2013Dataiku 49 Database Business Layer Application Data Science App ? ? ? ? ? ?
  • 50. Classic Columnar Architecture 6/4/2013Dataiku 50 Some data Some Place To Pour It In Some Tool To To Some Maths And Graphs
  • 51. Classic Columnar Architecture 6/4/2013Dataiku 51 Lots of data Some Place To Pour It In Some Tool To To Some Maths And Graphs Web Tracking Logs Raw Server Logs Order / Product / Customer Facebook Info Open Data (Weather, Currency …)
  • 52. The Corinthian Architecture 6/4/2013Dataiku 52 Lots of data Some Place To Perform Rapid Calculations Some Tools To Do Some Maths And Charts Some Place To Pour It In And Clean / Prepare It
  • 53. Data Storage And Preparation 6/4/2013Dataiku 53 Large Scale: Hadoop Cluster Cassandra MPP SQL Columnar Medium/Large Scale: CouchBase MongoDB …. Selection Drivers Volume Scalability
  • 54. Calculations 6/4/2013Dataiku 54 Classic Database • PostgresSQL • MySQL • …. MPP SQL Database • Vertica, Vectorwise, InfiniDB, GreenplumHD…. Hadoop New Databases • Impala … Selection Drivers: Speed ( Interactivity ) Expressivity
  • 55. The Corinthian Architecture 6/4/2013Dataiku 55 Lots of data Some Place To Perform Rapid Calculations Some Tools To Do Some Maths And Charts Some Place To Pour It In And Clean / Prepare It Statistics Cohorts Regressions Bar Charts For Marketing Nice Infography for you Company Board
  • 56. The Corinthian Architecture 6/4/2013Dataiku 56 Lots of data Some Database To Perform Rapid Calculations Some Tools To Do Some Maths Some Other To Do Some Charts Some Place To Pour It In And Clean / Prepare It
  • 57. Statistical Tools 6/4/2013Dataiku 57 Open Source: • IPython • Rstudio Commercial • RapidMiner • SAS • RevolutionR Selection Drivers Existing Knowhow Scalability
  • 58. 6/4/2013Dataiku 58 What is a statistical tool ?  Interact and explore data  Some stats capabilities  Some Graph Capabilities
  • 59. Visualization Tools 6/4/2013Dataiku 59 Open Source: • SpotFire • Tableau • QlikView SAAS • BIME • ChartIO • RevolutionR HTML5 / AdHoc • D3 • GraphViz Selection Drivers How Many Contributors / Readers ? Scalability
  • 60. The One Database won’t make it all problem 6/4/2013Dataiku 60 Lots of data Some Database To Perform Rapid Calculations Some Tools To Do Some Maths Some Other To Do Some Charts Some Place To Pour It In And Clean / Prepare It JOIN / Aggregate Rapid Goup By Computations Direct Access to the computed Results to production etc..
  • 61. The Roman Social Forum 6/4/2013Dataiku 61 Lots of data Some Database To Perform Rapid Calculations And Some Database For Graphs Some Tools To Do Some Maths Some Other To Do Some Charts Some Place To Pour It In And Clean / Prepare It
  • 62. Graph 6/4/2013Dataiku 62 Databases • Neo4J • Titan • OrientDB • InfiniteGraph Analytic / Visualization • Gephi Selection Drivers Scalability What Algorithms ? Licensing Constraints
  • 63. The Key Value Store 6/4/2013Dataiku 63 Lots of data Some Database To Perform Rapid Calculations And Some Database For Graphs And Some Distributed Key Value Store Some Tools To Do Some Maths Some Other To Do Some Charts Some Place To Pour It In And Clean / Prepare It
  • 64. NoSQL 6/4/2013Dataiku 64 Search • SOLR • ElasticSearch Document • MongoDB • CouchDB KeyValue • Redis • Hbase … Selection Drivers Durability / Avaiability … Performance Ease of use and API Indexing
  • 65. Action requires Prediction 6/4/2013Dataiku 65 Lots of data Some Database To Perform Rapid Calculations And some database for graphs And Some Distributed Key Value Store Some Tools To Do Some Maths Some Other To Do Some Charts Some Place To Pour It In And Clean / Prepare It Draw A Line  For the future What are my real users groups ? Should I launch a discount offering or not ? To everybody or to specific users only ?
  • 66. The Medieval Fairy Land 6/4/2013Dataiku 66 Lots of data Some Tools To Do Some Maths Some Other To Do Some Charts and some MACHINE LEARNING Some Place To Pour It In And Clean / Prepare It Some Database To Perform Rapid Calculations And Some Database For Graphs And Some Distributed Key Value Store
  • 67. Predictions 6/4/2013Dataiku 67 Java • Mahout (Hadoop) • WEKA Python • Scikit-Learn • PyML R Commercial • Kxen • SAS • SPSS… Selection Drivers Scalability Black Box / White Box ? Data Management Integration
  • 69.  Exploratory Data Analysis ◦ Identifying and visualizing key patterns and correlations within the dataset  Unsupervised Learning ◦ Create groups of similar observations sharing same patterns (aka Clustering, Segmentation)  Supervised Learning ◦ Modeling a variable using independent features (aka Scoring, Predictive Modeling, Classification)  Time Series Prevision ◦ Predict a time-dependent variable using its own history, and sometimes other covariates (variables)  Graph Analysis ◦ Analyzing relationships between a set of “nodes”, linked by “edges”  Associations / Sequences Mining ◦ Identifying frequently associated items within transactions/ events databases, sometimes ordered over time  And many more… Classes of Machine Learning Problems 04/06/2013Dataiku - Innovation Services 69
  • 70. Mapping ML to Business Questions 04/06/2013Dataiku - Innovation Services 70 Class Sample Business Questions Exploratory Data Analysis What does my dataset look like ? What are the key correlations in my data ? Unsupervised Learning Can I create groups of users who share the same purchasing behavior ? The same navigation behavior ? Supervised Learning What users are likely to click on ad X ? What users are likely to convert to paying users ? Who is going to leave my service ? What is the profile of the users who do X ? Time Series Prevision What is the prevision of my revenue next month ? Given the weather forecast, can I also forecast my sales ? Product Sale Forecast (for surbooking) Graph Analysis Can I identify influencers in my users community ? Can I recommend new friends to my users ? Association & Sequences Mining Which products are frequently bought together ? What is the typical navigation path on my website ?
  • 71. Machine Learning Methods Detailed 04/06/2013Dataiku - Innovation Services 71 Analytical Task ML Task Sample Algorithms Shape of Dataset Exploratory Data Analysis Univariate Analysis Distribution, frequencies, histogram, boxplots, fit tests... N obs. (1 row per obs.) * P features Bivariate Analysis Scatterplots, correlations (Pearson, Spearman), GLM, Chi Square... N obs. (1 row per obs.) * P features Multivariate Analysis Principal components analysis, multi-dimensional scaling correspondence analysis, factor analysis… N obs. (1 row per obs.) * P features “Oriented” Data Analysis Unsupervised Learning K-means, K-medoids, hierarchical clustering, gaussian mixture models, mean shift, dbscan, spectral clustering... N obs. (1 row per obs.) * P features Supervised Learning Linear & logistic regression, decision trees, neural networks, SVM, naïve Bayes, K-NN, random forests… N obs. (1 row per obs.) * P features Time Series Prevision ARMA, VARMAX, ARIMA… Time Series (rows: time period, columns: measures) Graph Analysis Centrality (closeness, betweeness, Page Rank, HITS), modularity (Louvain)… Nodes and Edges lists (+ attributes) Associations & Sequences Frequent Itemsets, A priori, Market Basket… (Timestamped) events or transactions
  • 72.  Cluster a dataset into K Buckets by choosing the “closest” neighbours 6/4/2013Dataiku 72 Unsupervised Method K-Means
  • 73.  Predict the color of a point depending on the colors of its K closest neighbours 6/4/2013Dataiku 73 Supervised K-Nearest-Neighbours
  • 74.  Find the most “significant” input variable and split value  Split the dataset recursively 6/4/2013Dataiku 74 Supervised Decision Tree
  • 75. Several Paths to Machine Learning 04/06/2013Dataiku - Innovation Services 75 Analytical Dataset I’m looking for clusters I want to predict a variable I’m looking variable by variable, or pairs I know how many groups to look for HCA … Partitioning (K- means…) GMM … DP GMM … K-means + Gap | Silhouette | … 2-steps clustering I just want to explore Yes No Ye s No Small Dataset (<<1K) Ye s No Medium Dataset (<<100K) Ye s No I can sample Ye s No Affinity Propagation, Mean Shift… Unsupervised Learning Ye s No All my variables are numeric Ye s No CA… I have a distance matrix Ye s No MDS... PCA … Exploratory Data Analysis Data Viz... Ye s Not Only I value interpretability Generalized Linear Model Simple Decision Tree Supervised Learning* Correlation Analysis GLM Parametric and non parametric stat. tests * Methods generally working for both classification & regression Support Vector Machines Neural Networks K-Nearest Neighbors Ensembles (Random Forest, Gradient Boosted Tree MARS Generalized Additive Model
  • 76. 6/4/2013Dataiku 76 Questions ?  Take Away ◦ There are new ways to perform data analytics that are within your reach and can bring business value  Some Additional Resources ◦ Open Source Projects  Dataiku Cloud Transport Client http://dctc.io  Dataiku Web Tracker https://github.com/dataiku/wt1 ◦ Our Technical Blog  http://www.dataiku.com/blog