SlideShare une entreprise Scribd logo
1  sur  72
Télécharger pour lire hors ligne
Vishal Patel
January 2018
Exploring The Data Science Process
Richmond Data Science Community
 Vishal Patel
 Founder of DERIVE, LLC
 Data Science services
 Automated advanced analytics products
 MS in Computer Science, and MS in Decision Sciences
w w w. d e r i v e . i o
Data
Science
Machine Learning
Deep Learning
Data Science
Gartner's Hype Cycle
Expectations
Time
Innovation Trigger
Peak of Inflated Expectations
Trough of Disillusionment
Slope of Enlightenment
Plateau of Productivity
BUILD
A MACHINE LEARNING MODEL
IN JUST THREE
QUICK AND EASY STEPS
USING […]!!!
– Most tutorials
1
How to Become a Data Scientist?
How to Draw
An Owl
50% of analytic projects fail.
– Gartner, 2015
2
https://xkcd.com/1838/
Data + Machine Learning = Profit
On September 21, 2009, the grand prize of US$1,000,000 was given to the BellKor’s
Pragmatic Chaos team which bested Netflix's own algorithm for predicting ratings by 10.06%.
“[T]he additional accuracy gains that we measured did not seem
to justify the engineering effort needed to bring them into a
production environment.”
Analytic projects fail because…
…they aren’t completed within budget or on schedule,
or because they fail to deliver the features and benefits
that are optimistically agreed on at their outset.
How to Avoid Failure?
1 Build with Organizational Buy-in
2 Build with End In Mind
3 Build with a Structured Approach
How to Avoid Failure?
1 Build with Organizational Buy-in
2 Build with End In Mind
3 Build with a Structured Approach
Data
Data
Science
Value
Business Question
Data
Science
The Blind Men and the Elephant
It was six men of Indostan
To learning much inclined,
Who went to see the Elephant
(Though all of them were blind),
That each by observation
Might satisfy his mind.
And so these men of Indostan
Disputed loud and long,
Each in his own opinion
Exceeding stiff and strong,
Though each was partly in the right
And all were in the wrong!
– John Godfrey Saxe
Data Science
STATISTICS
BUSINESS
COMPUTER
SCIENCE
Data Munging
Model Training
Model Deployment
Data Preparation Model Tracking
Business
Understanding
Model Evaluation
Data Science Process
STATISTICSBUSINESS COMPUTER SCIENCE
1
2
3
4
5
6
7
Data Munging
Model Training
Model Deployment
Data Preparation Model Tracking
Business
Understanding
Model Evaluation
The Data Science Process
STATISTICSBUSINESS COMPUTER SCIENCE
Data
Munging
Model
Training
Model
Evaluation
Model
Deployment
Model
Tracking
Data
Preparation
Business
Understanding
Business
Understanding
Far better
an approximate answer to the right question
than
an exact answer to the wrong question.
– John Tukey
Business
Understanding
1
2
3
DETERMINE
UNDERSTAND
MAP
Business
Understanding
1
2
3
DETERMINE
UNDERSTAND
MAP
What does the client want to achieve?
Primary Objective
o Reduce attrition
o Customized targeting
o Plan future media spend
o Prevent fraud
o Recommend Products
Business
Understanding
1
2
3
DETERMINE
UNDERSTAND
MAP
o Understand success criteria.
o Specific, measurable, time-bound
o List assumptions, constraints, and important factors.
o Identify secondary or competing objectives.
o Study existing solutions (if any).
Business
Understanding
1
2
3
DETERMINE
UNDERSTAND
MAP
o State the project objective(s) in technical terms.
o Describe how the data science project will help solve
the business problem.
o Explore successful scenarios.
Business Objective  Technical Objective
Business
Understanding
OBJECTIVE TECHNIQUE EXAMPLES
Predict Values Regression
Linear regression,
Bayesian regression,
Decision Trees
Predict Categories Classification
Logistic regression,
SVM, Decision Trees
Predict Preference Recommender System
Collaborative / Content-
based filtering
Discover groups Clustering
k-means, Hierarchical
clustering
Identify unusual data
points
Anomaly Detection k-NN, One-class SVM
…
1
2
3
DETERMINE
UNDERSTAND
MAP
Business
Understanding
If all you have is a hammer
then everything looks like a nail.
Business
Understanding
o Primary Objective: Prevent attrition  Increase subscription renewals
o Competing Objective: High value customers are also targeted for up-sell
o Constraints: Avoid targeting customers too close to their contract expiration
o Success Criteria: Current renewal rate = 65%  Improve by 8% within the next quarter
o Existing Solution: Business-rule-based targeting
o Data Science Objective: Build a binary classification model to identify customers who are
not likely to renew their subscriptions at least three months in advance of their contract
expiration.
o Success Scenario: The model correctly identifies 80% of the future attritors; the
promotional campaign targets all likely attritors, and successfully converts 20% of them
into non-attritors.
Titanic at Southampton docks, prior to departure
Business
Understanding
o Duration
o Inventory of resources
o Tools and techniques
o Risks and contingencies
o Costs and benefits
o Milestones
The thought that disaster is impossible
often leads to an unthinkable disaster.
– Gerald Weinberg
Project Plan
Data
Preparation
1 IDENTIFY
2 COLLECT
3 ASSESS
4 VECTORIZE
Data
Preparation
o Data sources, formats
o Database, Streaming API’s, Logs, Excel files, Websites, etc.
o Entity Relationship Diagram (ERD)
o Identify additional data sources.
o Demographics data appends,
o Geographical data,
o Census data, etc.
o Identify relevant data.
o Record unavailable data.
o How long a history is available and one should use?
1 IDENTIFY
2 COLLECT
3 ASSESS
4 VECTORIZE
Data
Preparation
o Access or acquire all relevant data in a central location
o Quality control checks and tests
o File formats, delimiters
o Number of records, columns
o Primary keys
1 IDENTIFY
2 COLLECT
3 ASSESS
4 VECTORIZE
Data
Preparation
First look at the data
o Get familiar with the data.
o Study seasonality.
o Monthly/weekly/daily patterns
o Unexplained gaps or spikes
o Detect mistakes.
o Extreme or outlier values
o Unusual values
o Special missing values
o Check assumptions.
o Review distributions.
1 IDENTIFY
3 ASSESS
4 VECTORIZE
2 COLLECT
Trust, but verify.
Happy families are all alike;
Every unhappy family is unhappy in its own way.
- Hadley Wickham
Tidy datasets
messydataset messy
Data
Preparation
GOAL: Create the Analysis Dataset
1 IDENTIFY
4 VECTORIZE
2 COLLECT
3 ASSESS
𝑦1
𝑦2
𝑦3
.
.
.
𝑦 𝑛
𝑥11 𝑥12 𝑥13 … 𝑥1𝑗
𝑥21 𝑥22 𝑥23 … 𝑥2𝑗
𝑥31 𝑥32 𝑥33 … 𝑥3𝑗
. . . .
. . . .
. . . .
𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 … 𝑥 𝑛𝑗
𝑦 = 𝑋 =
Outcome
Target / Labels
Independent Variable
Inputs
Features / Attributes
Dependent Variables
Target Definition
o Churn = 90 days of consecutive inactivity (for a pre-paid telecom customer)
o What’s inactivity?
o Incoming and outgoing calls
o Data usage
o Incoming text
o Promotional texts
o Voicemail usage
o Call forwarding
o Etc.
o Customers may change their device or phone number.
o Churn at the individual (person) level, or at the device (phone) level?
o Customers may return (become active again) after 90 days of inactivity?
o Prediction window
o Predict 90 days of consecutive inactivity?
o Would 10 days of consecutive inactivity suffice?
o How many customers return after x days of inactivity?
o Fraud, Involuntary churn
o Etc.
Accurate but not Precise
Modeling Sample
o Historical trends and seasonality
o Are there certain timeframes that should be discarded?
o The model should be generalizable.
o Eligible, relevant population
o Must align with the business goals
o Eligible, relevant markets
o Must align with the business goals
o E.g., within a certain drive-time distance
o Outdated products or events
Selection Bias
Abraham Wald's Work on Aircraft Survivability
Journal of the American Statistical Association Vol. 79, No. 386 (Jun., 1984)
Information Leakage
… May 2017 Jun Jul Aug Sep Oct Nov Dec
…to predict whether customers
will [do something]
Use all available information
(“Leading Indicators”)
as of the end of Sep…
OBSERVATION WIDOW PREDICTION WIDOW
o The leading indicators must be calculated from the timeframe leading up to the event
– it must not overlap with the prediction window.
o Beware of proxy events, e.g., future bookings.
Data Aggregation
o Attribute creation
o Derived attributes: Household income / Number of adults = Income per adult
o Brainstorm with team members (both technical and non-technical)
𝑥11 𝑥12 𝑥13 … 𝑥1𝑗
𝑥21 𝑥22 𝑥23 … 𝑥2𝑗
𝑥31 𝑥32 𝑥33 … 𝑥3𝑗
. . . .
. . . .
. . . .
𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 … 𝑥 𝑛𝑗
𝑋 =
Data Aggregation
CUSTOMER ID PURCHASE DATE
1001 02-12-2015:05:20:39
1001 05-13-2015:12:18:09
1001 12-20-2016:00:15:59
1002 01-19-2014:04:28:54
1003 01-12-2015:09:20:36
1003 05-31-2015:10:10:02
… …
1. Number of transactions (Frequency)
2. Days since the last transaction (Recency)
3. Days since the earliest transaction (Tenure)
4. Avg. days between transaction
5. # of transactions during weekends
6. % of transactions during weekends
7. # of transactions by day-part (breakfast, lunch, etc.)
8. % of transactions by day-part
9. Days since last transaction / Avg. days between
transactions
10.…
CUSTOMER ID 𝑥1 𝑥2 … 𝑥𝑗
1001 … … …
1002 … … …
1003 … … …
… … … … …
Data
Preparation
1 IDENTIFY
4 VECTORIZE
2 COLLECT
3 ASSESS
OUTPUT: The Analysis Dataset
𝑦1
𝑦2
𝑦3
.
.
.
𝑦 𝑛
𝑥11 𝑥12 𝑥13 … 𝑥1𝑗
𝑥21 𝑥22 𝑥23 … 𝑥2𝑗
𝑥31 𝑥32 𝑥33 … 𝑥3𝑗
. . . .
. . . .
. . . .
𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 … 𝑥 𝑛𝑗
𝑦 = 𝑋 =
Data
Munging
Model
Training
Model
Evaluation
Model
Deployment
Model
Tracking
Data
Preparation
Business
Understanding
80% 20%
Data
Munging
Model
Building
Time
Spent
Give me six hours to chop down a tree
and I will spend the first four sharpening the axe.
– Anonymous
Data
Munging
o Descriptive statistics
o Review with the client
o Correlation analysis
o Review with the client
o Watch out for data leakage
o Impute missing values
o Trim extreme values
o Process categorical attributes
o Transformations (square, log, etc.)
o Binning / variable smoothing
o Multicollinearity
o Reduce redundancy
o Create additional feature
o Interactions
o Normalization (scaling)
Machine learning experts display
cleaned data samples
in preparation for modeling.
Annibale Caracci, c.1600
via @vboykis
Univariate Multivariate
Non-Graphical
Graphical
o Cross-tabulation
o Univariate statistics by category
o Correlation matrices
o Histograms
o Box plots, stem-and-leaf plots
o Quantile-normal plots
o Categorical: Tabulated frequencies
o Quantitative:
o Central tendency: mean, median, mode
o Spread: Standard deviation, inter-
quartile range
o Skewness and kurtosis
o Univariate graphs by category (e.g.,
side-by-side box-plots)
o Scatterplots
o Correlation matrix plots
Data
Munging
Data
Munging
 Feature Reduction: The process of selecting a subset of features for use in model construction
 Useful for both supervised and unsupervised learning problems
Art is the elimination of the unnecessary.
– Pablo Picasso
Data
Munging
 True dimensionality <<< Observed dimensionality
 The abundance of redundant and irrelevant features
 Curse of dimensionality
 With a fixed number of training samples, the predictive power reduces as the dimensionality
increases. [Hughes phenomenon]
 With 𝑑 binary variables, the number of possible combinations is 𝑂(2 𝑑
).
 Goal of the Analysis
 Descriptive  Diagnostic  Predictive  Prescriptive
 Law of Parsimony [Occam’s Razor]
 Other things being equal, simpler explanations are generally better than complex ones.
 Overfitting
 Execution time (Algorithm and data)
Hindsight Insight Foresight
Feature Reduction: Why
Data
Munging
1. Percent missing values
2. Amount of variation
3. Pairwise correlation
4. Multicolinearity
5. Principal Component Analysis (PCA)
6. Cluster analysis
7. Correlation (with the target)
8. Forward selection
9. Backward elimination
10. Stepwise selection
11. LASSO
12. Tree-based selection
Feature
Reduction
Techniques
A practical guide to dimensionality reduction techniques – Vishal Patel
Model
Training
 Try more than one machine learning technique.
 Fine-tune parameters.
 Assess model performance.
 Avoid Over-fitting.
Assess Model Performance
 New Age: Area Under the ROC Curve (AUC), Confusion Matrix, Precision, Recall, Log-loss, etc.
 Old School: Model Lift, Model Gains, Kolmogorov-Smirnov (KS), etc.
When a measure becomes a target,
it ceases to be a good measure.
Goodhart's law
Pic Courtesy: @auxesis
Tri-fold Partition
TestTrain
Dataset
…
k-fold cross-validation
20%60% Validation20%
o Fine-tune and select the best model based
on Train + Test sets.
o Evaluate the chosen algorithm on the
Validation set (i.e., completely unseen data).
Bias-Variance Tradeoff
With four parameters I can fit an elephant,
and with five I can make him wiggle his trunk.
– John von Neumann
Model
Evaluation
1 MODEL SELECTION
2 ASSESSMENT
3 PRESENTATION
Model
Evaluation
1 MODEL SELECTION
2 ASSESSMENT
3 PRESENTATION
 Law of Parsimony (Occam’s Razor)
 Model execution time
 Deployment complexity
Build the simplest solution that can
adequately answer the question.
Model
Evaluation
1 MODEL SELECTION
2 ASSESSMENT
3 PRESENTATION
Dataset
Validation20%
Temporal
or
Random
Model
Evaluation
1 MODEL SELECTION
2 ASSESSMENT
3 PRESENTATION
 AUC, etc.
 Cumulative Gains Chart / Lift Chart
 Compare against existing business rules/model
 Predictor Importance
 Each predictor’s relationship with the target
 Reason-coding
 Model usage recommendations
 Decile reports
 Personify
 Model peer-review (Quality Control)
Interpret results as they relate to the business application.
Model
Deployment
 Model production cycle
 Scoring code, or publish model as a web service
 Hand-off
 Model Documentation (Technical Specifications)
 Data preparation, transformations, imputations, parameter settings, etc.
 Reproducibility
 Docker containers
 Model Persistence vs. Model Transience
Model Persistence vs. Model Transience
Aug Sep Oct Nov Dec Jan Feb Mar Apr
Model Build Model Decay Tracking Model Rebuild Model Decay Tracking
Scoring Algorithm Scoring Algorithm
Model
Persistence
Aug Sep Oct Nov Dec
Model Build
Model
Transience
Model Build Model Build Model Build Model Build
• Traditional approach
• Provides stability
• Less resource intensive
• Modern approach
• Able to capture recent trends
• Resource intensive
Model
Tracking
1 MONITOR
2 MAINTAIN
3 TEST
Model
Tracking
 Model decay tracking (monitoring) plan
 Model performance over time
 Predictor distribution
1 MONITOR
2 MAINTAIN
3 TEST
Model
Tracking
 Model maintenance plan
 Adding new data sources
 Version control
1 MONITOR
2 MAINTAIN
3 TEST
Model
Tracking
 Campaign Set-up and Execution
 Experimental Design (A/B tests, Fractional Factorial)
1 MONITOR
2 MAINTAIN
3 TEST
No Selection
(Random)
Experimental Design
Marketing Treatment No Treatment
Selection
Based on Model
A Test
C Control
B
Selection
Hold-out
D
Random
Hold-out
Measure
Improve
Test
Determine
Understand
Map
Business
Understanding
Data
Preparation
Data
Munging
Model
Training
Model
Evaluation
Model
Deployment
Model
Tracking
Identify
Collect
Assess
Vectorize
Impute
Transform
Reduce
Train
Assess
Select
Evaluate
Peer Review
Present
Deploy
Document
Monitor
Maintain
Test
DISCUSS COLLATE WRANGLE PERFORM COMMUNICATE EXECUTE TRACK
Data Science Process: Recap
Process as Proxy
“Good process serves you so you can serve customers.
But if you’re not watchful, the process can become the proxy for the result you want.
You stop looking at outcomes and just make sure you’re doing the process right.
Gulp.
It’s always worth asking, do we own the process or does the process own us?”
– Jeff Bezos
THANK YOU!
vishal@derive.io
www.linkedIn.com/in/VishalJP

Contenu connexe

Tendances

Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycleManoj Mishra
 
Introduction to Data Visualization
Introduction to Data Visualization Introduction to Data Visualization
Introduction to Data Visualization Ana Jofre
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data sciencebhavesh lande
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceSrishti44
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.pptneelamoberoi1030
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data ScienceMaloy Manna, PMP®
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 

Tendances (20)

Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
Introduction to Data Visualization
Introduction to Data Visualization Introduction to Data Visualization
Introduction to Data Visualization
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
Classification of data
Classification of dataClassification of data
Classification of data
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Ppt
PptPpt
Ppt
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Lecture #01
Lecture #01Lecture #01
Lecture #01
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data Science
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data visualization
Data visualizationData visualization
Data visualization
 

Similaire à The Data Science Process

Exploring the Data science Process
Exploring the Data science ProcessExploring the Data science Process
Exploring the Data science ProcessVishal Patel
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...Kai Wähner
 
UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...
UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...
UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...Joe Lamantia
 
UX STRAT USA Presentation: Joe Lamantia, Bottomline Technologies
UX STRAT USA Presentation: Joe Lamantia, Bottomline TechnologiesUX STRAT USA Presentation: Joe Lamantia, Bottomline Technologies
UX STRAT USA Presentation: Joe Lamantia, Bottomline TechnologiesUX STRAT
 
Applications of AI in Supply Chain Management: Hype versus Reality
Applications of AI in Supply Chain Management: Hype versus RealityApplications of AI in Supply Chain Management: Hype versus Reality
Applications of AI in Supply Chain Management: Hype versus RealityGanes Kesari
 
Domains and data analytics
Domains and data analyticsDomains and data analytics
Domains and data analyticsPratik Shukla
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Edureka!
 
L'évolution du métier du DAF induite par la transformation digitale
L'évolution du métier du DAF induite par la transformation digitale L'évolution du métier du DAF induite par la transformation digitale
L'évolution du métier du DAF induite par la transformation digitale Microsoft Ideas
 
Pivotal Digital Transformation Forum: Data Science
Pivotal Digital Transformation Forum: Data Science Pivotal Digital Transformation Forum: Data Science
Pivotal Digital Transformation Forum: Data Science VMware Tanzu
 
Pivotal Digital Transformation Forum: Data Science Bridging the Gap
Pivotal Digital Transformation Forum: Data Science Bridging the GapPivotal Digital Transformation Forum: Data Science Bridging the Gap
Pivotal Digital Transformation Forum: Data Science Bridging the GapVMware Tanzu
 
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...Neo4j
 
Making the Most of Customer Data
Making the Most of Customer DataMaking the Most of Customer Data
Making the Most of Customer DataWSO2
 
The Journey to Big Data Analytics
The Journey to Big Data AnalyticsThe Journey to Big Data Analytics
The Journey to Big Data AnalyticsDr.Stefan Radtke
 
Big Data Monetisation
Big Data MonetisationBig Data Monetisation
Big Data MonetisationPSD Group Ltd
 
Big Data: selling the Business Case to the business
Big Data: selling the Business Case to the businessBig Data: selling the Business Case to the business
Big Data: selling the Business Case to the businessJ On The Beach
 
The future of customer service handout
The future of customer service handoutThe future of customer service handout
The future of customer service handoutMartin Hill-Wilson
 
The Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine LearninThe Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine LearninInside Analysis
 

Similaire à The Data Science Process (20)

Exploring the Data science Process
Exploring the Data science ProcessExploring the Data science Process
Exploring the Data science Process
 
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
 
UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...
UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...
UX STRAT 2018 | Flying Blind On a Rocket Cycle: Pioneering Experience Centere...
 
UX STRAT USA Presentation: Joe Lamantia, Bottomline Technologies
UX STRAT USA Presentation: Joe Lamantia, Bottomline TechnologiesUX STRAT USA Presentation: Joe Lamantia, Bottomline Technologies
UX STRAT USA Presentation: Joe Lamantia, Bottomline Technologies
 
Applications of AI in Supply Chain Management: Hype versus Reality
Applications of AI in Supply Chain Management: Hype versus RealityApplications of AI in Supply Chain Management: Hype versus Reality
Applications of AI in Supply Chain Management: Hype versus Reality
 
Domains and data analytics
Domains and data analyticsDomains and data analytics
Domains and data analytics
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
 
L'évolution du métier du DAF induite par la transformation digitale
L'évolution du métier du DAF induite par la transformation digitale L'évolution du métier du DAF induite par la transformation digitale
L'évolution du métier du DAF induite par la transformation digitale
 
Day 1 (Lecture 2): Business Analytics
Day 1 (Lecture 2): Business AnalyticsDay 1 (Lecture 2): Business Analytics
Day 1 (Lecture 2): Business Analytics
 
Data science
Data scienceData science
Data science
 
Pivotal Digital Transformation Forum: Data Science
Pivotal Digital Transformation Forum: Data Science Pivotal Digital Transformation Forum: Data Science
Pivotal Digital Transformation Forum: Data Science
 
Pivotal Digital Transformation Forum: Data Science Bridging the Gap
Pivotal Digital Transformation Forum: Data Science Bridging the GapPivotal Digital Transformation Forum: Data Science Bridging the Gap
Pivotal Digital Transformation Forum: Data Science Bridging the Gap
 
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
EY + Neo4j: Why graph technology makes sense for fraud detection and customer...
 
Making the Most of Customer Data
Making the Most of Customer DataMaking the Most of Customer Data
Making the Most of Customer Data
 
The Journey to Big Data Analytics
The Journey to Big Data AnalyticsThe Journey to Big Data Analytics
The Journey to Big Data Analytics
 
Big Data Monetisation
Big Data MonetisationBig Data Monetisation
Big Data Monetisation
 
Big Data: selling the Business Case to the business
Big Data: selling the Business Case to the businessBig Data: selling the Business Case to the business
Big Data: selling the Business Case to the business
 
Achieving Business Success with Data.pdf
Achieving Business Success with Data.pdfAchieving Business Success with Data.pdf
Achieving Business Success with Data.pdf
 
The future of customer service handout
The future of customer service handoutThe future of customer service handout
The future of customer service handout
 
The Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine LearninThe Sky’s the Limit – The Rise of Machine Learnin
The Sky’s the Limit – The Rise of Machine Learnin
 

Dernier

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 

Dernier (20)

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 

The Data Science Process

  • 1. Vishal Patel January 2018 Exploring The Data Science Process Richmond Data Science Community
  • 2.  Vishal Patel  Founder of DERIVE, LLC  Data Science services  Automated advanced analytics products  MS in Computer Science, and MS in Decision Sciences w w w. d e r i v e . i o
  • 4. Machine Learning Deep Learning Data Science Gartner's Hype Cycle Expectations Time Innovation Trigger Peak of Inflated Expectations Trough of Disillusionment Slope of Enlightenment Plateau of Productivity
  • 5. BUILD A MACHINE LEARNING MODEL IN JUST THREE QUICK AND EASY STEPS USING […]!!! – Most tutorials 1
  • 6. How to Become a Data Scientist? How to Draw An Owl
  • 7. 50% of analytic projects fail. – Gartner, 2015 2
  • 9. On September 21, 2009, the grand prize of US$1,000,000 was given to the BellKor’s Pragmatic Chaos team which bested Netflix's own algorithm for predicting ratings by 10.06%.
  • 10. “[T]he additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.”
  • 11. Analytic projects fail because… …they aren’t completed within budget or on schedule, or because they fail to deliver the features and benefits that are optimistically agreed on at their outset.
  • 12. How to Avoid Failure? 1 Build with Organizational Buy-in 2 Build with End In Mind 3 Build with a Structured Approach
  • 13. How to Avoid Failure? 1 Build with Organizational Buy-in 2 Build with End In Mind 3 Build with a Structured Approach
  • 16. The Blind Men and the Elephant It was six men of Indostan To learning much inclined, Who went to see the Elephant (Though all of them were blind), That each by observation Might satisfy his mind. And so these men of Indostan Disputed loud and long, Each in his own opinion Exceeding stiff and strong, Though each was partly in the right And all were in the wrong! – John Godfrey Saxe
  • 18. Data Munging Model Training Model Deployment Data Preparation Model Tracking Business Understanding Model Evaluation Data Science Process STATISTICSBUSINESS COMPUTER SCIENCE 1 2 3 4 5 6 7
  • 19. Data Munging Model Training Model Deployment Data Preparation Model Tracking Business Understanding Model Evaluation The Data Science Process STATISTICSBUSINESS COMPUTER SCIENCE
  • 21. Business Understanding Far better an approximate answer to the right question than an exact answer to the wrong question. – John Tukey
  • 23. Business Understanding 1 2 3 DETERMINE UNDERSTAND MAP What does the client want to achieve? Primary Objective o Reduce attrition o Customized targeting o Plan future media spend o Prevent fraud o Recommend Products
  • 24. Business Understanding 1 2 3 DETERMINE UNDERSTAND MAP o Understand success criteria. o Specific, measurable, time-bound o List assumptions, constraints, and important factors. o Identify secondary or competing objectives. o Study existing solutions (if any).
  • 25. Business Understanding 1 2 3 DETERMINE UNDERSTAND MAP o State the project objective(s) in technical terms. o Describe how the data science project will help solve the business problem. o Explore successful scenarios. Business Objective  Technical Objective
  • 26. Business Understanding OBJECTIVE TECHNIQUE EXAMPLES Predict Values Regression Linear regression, Bayesian regression, Decision Trees Predict Categories Classification Logistic regression, SVM, Decision Trees Predict Preference Recommender System Collaborative / Content- based filtering Discover groups Clustering k-means, Hierarchical clustering Identify unusual data points Anomaly Detection k-NN, One-class SVM … 1 2 3 DETERMINE UNDERSTAND MAP
  • 28. If all you have is a hammer then everything looks like a nail.
  • 29. Business Understanding o Primary Objective: Prevent attrition  Increase subscription renewals o Competing Objective: High value customers are also targeted for up-sell o Constraints: Avoid targeting customers too close to their contract expiration o Success Criteria: Current renewal rate = 65%  Improve by 8% within the next quarter o Existing Solution: Business-rule-based targeting o Data Science Objective: Build a binary classification model to identify customers who are not likely to renew their subscriptions at least three months in advance of their contract expiration. o Success Scenario: The model correctly identifies 80% of the future attritors; the promotional campaign targets all likely attritors, and successfully converts 20% of them into non-attritors.
  • 30. Titanic at Southampton docks, prior to departure Business Understanding o Duration o Inventory of resources o Tools and techniques o Risks and contingencies o Costs and benefits o Milestones The thought that disaster is impossible often leads to an unthinkable disaster. – Gerald Weinberg Project Plan
  • 32. Data Preparation o Data sources, formats o Database, Streaming API’s, Logs, Excel files, Websites, etc. o Entity Relationship Diagram (ERD) o Identify additional data sources. o Demographics data appends, o Geographical data, o Census data, etc. o Identify relevant data. o Record unavailable data. o How long a history is available and one should use? 1 IDENTIFY 2 COLLECT 3 ASSESS 4 VECTORIZE
  • 33. Data Preparation o Access or acquire all relevant data in a central location o Quality control checks and tests o File formats, delimiters o Number of records, columns o Primary keys 1 IDENTIFY 2 COLLECT 3 ASSESS 4 VECTORIZE
  • 34. Data Preparation First look at the data o Get familiar with the data. o Study seasonality. o Monthly/weekly/daily patterns o Unexplained gaps or spikes o Detect mistakes. o Extreme or outlier values o Unusual values o Special missing values o Check assumptions. o Review distributions. 1 IDENTIFY 3 ASSESS 4 VECTORIZE 2 COLLECT Trust, but verify.
  • 35. Happy families are all alike; Every unhappy family is unhappy in its own way. - Hadley Wickham Tidy datasets messydataset messy
  • 36. Data Preparation GOAL: Create the Analysis Dataset 1 IDENTIFY 4 VECTORIZE 2 COLLECT 3 ASSESS 𝑦1 𝑦2 𝑦3 . . . 𝑦 𝑛 𝑥11 𝑥12 𝑥13 … 𝑥1𝑗 𝑥21 𝑥22 𝑥23 … 𝑥2𝑗 𝑥31 𝑥32 𝑥33 … 𝑥3𝑗 . . . . . . . . . . . . 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 … 𝑥 𝑛𝑗 𝑦 = 𝑋 = Outcome Target / Labels Independent Variable Inputs Features / Attributes Dependent Variables
  • 37. Target Definition o Churn = 90 days of consecutive inactivity (for a pre-paid telecom customer) o What’s inactivity? o Incoming and outgoing calls o Data usage o Incoming text o Promotional texts o Voicemail usage o Call forwarding o Etc. o Customers may change their device or phone number. o Churn at the individual (person) level, or at the device (phone) level? o Customers may return (become active again) after 90 days of inactivity? o Prediction window o Predict 90 days of consecutive inactivity? o Would 10 days of consecutive inactivity suffice? o How many customers return after x days of inactivity? o Fraud, Involuntary churn o Etc.
  • 38. Accurate but not Precise
  • 39. Modeling Sample o Historical trends and seasonality o Are there certain timeframes that should be discarded? o The model should be generalizable. o Eligible, relevant population o Must align with the business goals o Eligible, relevant markets o Must align with the business goals o E.g., within a certain drive-time distance o Outdated products or events
  • 40. Selection Bias Abraham Wald's Work on Aircraft Survivability Journal of the American Statistical Association Vol. 79, No. 386 (Jun., 1984)
  • 41. Information Leakage … May 2017 Jun Jul Aug Sep Oct Nov Dec …to predict whether customers will [do something] Use all available information (“Leading Indicators”) as of the end of Sep… OBSERVATION WIDOW PREDICTION WIDOW o The leading indicators must be calculated from the timeframe leading up to the event – it must not overlap with the prediction window. o Beware of proxy events, e.g., future bookings.
  • 42. Data Aggregation o Attribute creation o Derived attributes: Household income / Number of adults = Income per adult o Brainstorm with team members (both technical and non-technical) 𝑥11 𝑥12 𝑥13 … 𝑥1𝑗 𝑥21 𝑥22 𝑥23 … 𝑥2𝑗 𝑥31 𝑥32 𝑥33 … 𝑥3𝑗 . . . . . . . . . . . . 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 … 𝑥 𝑛𝑗 𝑋 =
  • 43. Data Aggregation CUSTOMER ID PURCHASE DATE 1001 02-12-2015:05:20:39 1001 05-13-2015:12:18:09 1001 12-20-2016:00:15:59 1002 01-19-2014:04:28:54 1003 01-12-2015:09:20:36 1003 05-31-2015:10:10:02 … … 1. Number of transactions (Frequency) 2. Days since the last transaction (Recency) 3. Days since the earliest transaction (Tenure) 4. Avg. days between transaction 5. # of transactions during weekends 6. % of transactions during weekends 7. # of transactions by day-part (breakfast, lunch, etc.) 8. % of transactions by day-part 9. Days since last transaction / Avg. days between transactions 10.… CUSTOMER ID 𝑥1 𝑥2 … 𝑥𝑗 1001 … … … 1002 … … … 1003 … … … … … … … …
  • 44. Data Preparation 1 IDENTIFY 4 VECTORIZE 2 COLLECT 3 ASSESS OUTPUT: The Analysis Dataset 𝑦1 𝑦2 𝑦3 . . . 𝑦 𝑛 𝑥11 𝑥12 𝑥13 … 𝑥1𝑗 𝑥21 𝑥22 𝑥23 … 𝑥2𝑗 𝑥31 𝑥32 𝑥33 … 𝑥3𝑗 . . . . . . . . . . . . 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 … 𝑥 𝑛𝑗 𝑦 = 𝑋 =
  • 46. Give me six hours to chop down a tree and I will spend the first four sharpening the axe. – Anonymous
  • 47. Data Munging o Descriptive statistics o Review with the client o Correlation analysis o Review with the client o Watch out for data leakage o Impute missing values o Trim extreme values o Process categorical attributes o Transformations (square, log, etc.) o Binning / variable smoothing o Multicollinearity o Reduce redundancy o Create additional feature o Interactions o Normalization (scaling) Machine learning experts display cleaned data samples in preparation for modeling. Annibale Caracci, c.1600 via @vboykis
  • 48. Univariate Multivariate Non-Graphical Graphical o Cross-tabulation o Univariate statistics by category o Correlation matrices o Histograms o Box plots, stem-and-leaf plots o Quantile-normal plots o Categorical: Tabulated frequencies o Quantitative: o Central tendency: mean, median, mode o Spread: Standard deviation, inter- quartile range o Skewness and kurtosis o Univariate graphs by category (e.g., side-by-side box-plots) o Scatterplots o Correlation matrix plots Data Munging
  • 49. Data Munging  Feature Reduction: The process of selecting a subset of features for use in model construction  Useful for both supervised and unsupervised learning problems Art is the elimination of the unnecessary. – Pablo Picasso
  • 50. Data Munging  True dimensionality <<< Observed dimensionality  The abundance of redundant and irrelevant features  Curse of dimensionality  With a fixed number of training samples, the predictive power reduces as the dimensionality increases. [Hughes phenomenon]  With 𝑑 binary variables, the number of possible combinations is 𝑂(2 𝑑 ).  Goal of the Analysis  Descriptive  Diagnostic  Predictive  Prescriptive  Law of Parsimony [Occam’s Razor]  Other things being equal, simpler explanations are generally better than complex ones.  Overfitting  Execution time (Algorithm and data) Hindsight Insight Foresight Feature Reduction: Why
  • 51. Data Munging 1. Percent missing values 2. Amount of variation 3. Pairwise correlation 4. Multicolinearity 5. Principal Component Analysis (PCA) 6. Cluster analysis 7. Correlation (with the target) 8. Forward selection 9. Backward elimination 10. Stepwise selection 11. LASSO 12. Tree-based selection Feature Reduction Techniques A practical guide to dimensionality reduction techniques – Vishal Patel
  • 52. Model Training  Try more than one machine learning technique.  Fine-tune parameters.  Assess model performance.  Avoid Over-fitting.
  • 53. Assess Model Performance  New Age: Area Under the ROC Curve (AUC), Confusion Matrix, Precision, Recall, Log-loss, etc.  Old School: Model Lift, Model Gains, Kolmogorov-Smirnov (KS), etc.
  • 54. When a measure becomes a target, it ceases to be a good measure. Goodhart's law Pic Courtesy: @auxesis
  • 55. Tri-fold Partition TestTrain Dataset … k-fold cross-validation 20%60% Validation20% o Fine-tune and select the best model based on Train + Test sets. o Evaluate the chosen algorithm on the Validation set (i.e., completely unseen data).
  • 57. With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. – John von Neumann
  • 58. Model Evaluation 1 MODEL SELECTION 2 ASSESSMENT 3 PRESENTATION
  • 59. Model Evaluation 1 MODEL SELECTION 2 ASSESSMENT 3 PRESENTATION  Law of Parsimony (Occam’s Razor)  Model execution time  Deployment complexity Build the simplest solution that can adequately answer the question.
  • 60. Model Evaluation 1 MODEL SELECTION 2 ASSESSMENT 3 PRESENTATION Dataset Validation20% Temporal or Random
  • 61. Model Evaluation 1 MODEL SELECTION 2 ASSESSMENT 3 PRESENTATION  AUC, etc.  Cumulative Gains Chart / Lift Chart  Compare against existing business rules/model  Predictor Importance  Each predictor’s relationship with the target  Reason-coding  Model usage recommendations  Decile reports  Personify  Model peer-review (Quality Control) Interpret results as they relate to the business application.
  • 62. Model Deployment  Model production cycle  Scoring code, or publish model as a web service  Hand-off  Model Documentation (Technical Specifications)  Data preparation, transformations, imputations, parameter settings, etc.  Reproducibility  Docker containers  Model Persistence vs. Model Transience
  • 63. Model Persistence vs. Model Transience Aug Sep Oct Nov Dec Jan Feb Mar Apr Model Build Model Decay Tracking Model Rebuild Model Decay Tracking Scoring Algorithm Scoring Algorithm Model Persistence Aug Sep Oct Nov Dec Model Build Model Transience Model Build Model Build Model Build Model Build • Traditional approach • Provides stability • Less resource intensive • Modern approach • Able to capture recent trends • Resource intensive
  • 65. Model Tracking  Model decay tracking (monitoring) plan  Model performance over time  Predictor distribution 1 MONITOR 2 MAINTAIN 3 TEST
  • 66. Model Tracking  Model maintenance plan  Adding new data sources  Version control 1 MONITOR 2 MAINTAIN 3 TEST
  • 67. Model Tracking  Campaign Set-up and Execution  Experimental Design (A/B tests, Fractional Factorial) 1 MONITOR 2 MAINTAIN 3 TEST
  • 68. No Selection (Random) Experimental Design Marketing Treatment No Treatment Selection Based on Model A Test C Control B Selection Hold-out D Random Hold-out
  • 71. Process as Proxy “Good process serves you so you can serve customers. But if you’re not watchful, the process can become the proxy for the result you want. You stop looking at outcomes and just make sure you’re doing the process right. Gulp. It’s always worth asking, do we own the process or does the process own us?” – Jeff Bezos