Machine learning in action at Pipedrive

Machine Learning in Action
Andres Kull
Product Analyst @ Pipedrive
Machine Learning Estonia meetup, October 11, 2016

About me
• Pipedrive:
• product analyst, from Feb 2016
• Funderbeam:
• data scientist, 2013-2015
• Elvior
• CEO, 1992 - 2012
• software development, test automation
• model based testing (PhD, 2009)

Topics
• About Pipedrive
• Predictive analytics in Pipedrive
• Closer insight to one predictive model
• Thought process
• Tools and methods used
• Deployment
• Monitoring

Some facts about Pipedrive
• > 30k paying customers
• from 155 countries
• biggest markets are US, Brazil
• > 220 employees
• ~ 180 in Estonia (Tallinn, Tartu)
• ~ 40 in New York
• 20 different nationalities
image area

My tasks @Pipedrive
• Serve product teams with data insight
• Predictive analytics / towards sales AI

CRM companies are rushing to AI
$75 mil

CRM AI opportunities
Predictive leads scoring
Predict deals outcome
likelihood to close
estimated close date
Recommend user actions
type of next action
email content
next action date/time
Teach users how to improve

Predictive analytics solutions at Pipedrive
For
marketing,
sales and
support
For users • predicting open deals pipeline value
• provides means to adjust selling process to meet the sales goals
Deals success prediction
• identify customers who are about to churn
• provides health score of subscription
Churn prediction
• identify inactive companies in trial
Trial conversion prediction

My R toolbox
Storage Access
RPostgreSQL, aws.s3
Dataframe Operations
dplyr, tidy (Hadley Wickham)
Machine Learning
Boruta, randomForest, caret, ROCR, AUC
Visual Data Exploration
ggplot2
RStudio
IDE
R packages

R references
Everything you need is to follow
Hadley Wickham
http://hadley.nz/
@hadleywickham
#rstats
… and you are in good hands

Business goal
• increase trial users conversion rate
• identify such trial companies who need
some engagement triggers
image area

Initial questions from business development
• what converting trial users do differently than non-converting ones?
• identify mandatory actions that has to be done during trial period to
become converting?
• actions:
• add deal
• add activity/reminder to deals
• invite other users
• …

Actions of successful trial companies
Percentages of successful trial companies who have done particular actions by
7th, 14th, 30th day

Actions split by successful and unsuccessful trials
• which percentage of companies
have performed particular action
by day 7 split by converting and
non-converting companies

Training a decision tree model
resulting modelselected features

Decision tree model
resulting model
IF activities_add < 5.5 THEN
IF user_joined < 0.5
THEN success = FALSE
ELSE IF user_joined < 1.5
ELSE success = TRUE
IF user_joined < 0.5 THEN
IF activities_add < 13.5
ELSE IF activities_add >= 179
ELSE success = TRUE
ELSE success = TRUE

ROC curve of decision tree modelTruepositiverate
AUC = 0.7
False positive rate
Area Under the ROC
Curve (AUC)

Can we do any better?
• Sure!
• Better feature selection
• Better ML algorithm (random forest)
• Better model evaluation with cross validation training

Let’s revisit model prediction goals
• act before most users are done
• predict trials success using first
5 day actions

Model development workﬂow
Features selection
Model training
Model evaluation Model deployment
good
enough?
Yes
No
• Iteratively:
• remove less important features
• add some new features
• evaluate if added or removed features
increased or decreased model accuracy
• continue until satisfied

Feature selection
• Select all relevant features
• Let the ML algorithm do the work
• filter out irrelevant features
• order features by importance
All relevant features I
can imagine
Selected
features

Filter out irrelevant features
• R Boruta package was used
• bor <- Boruta(y ~ ., data = train)
• bor <- TentativeRoughFix(bor) # for fixing Tentative features
• bor$finalDecision # <- contains Confirmed / Rejected for all features
• Only confirmed features will be passed to model training phase

List of features
• activities_edit
• deals_edit
• organizations_add
• persons_add
• added_activity
• changed_deal_stage
• clicked_import_from_other_crm
• clicked_taf_facebook_invite_button
• clicked_taf_invites_send_button
• clicked_taf_twitter_invite_button
• completed_activity
• edited_stage
• enabled_google_calendar_sync
• enabled_google_contacts_sync
• feature_deal_probability
• feature_products
• invite_friend
• invited_existing_user
• invited_new_user
• logged_in
• lost_a_deal
• user_joined
• won_a_deal

Order features by importance
• R RandomForest trained model object includes feature importances
• First you train the model
• rf_model <- randomForest(y ~ ., data = train, )
• … and then access the features relative importances
• imp_var <- varImp(rf_model)$importance

Features ordered by relative importance
1 persons_add
2 organizations_add
3 added_deal
4 logged_in
5 deals_edit
6 added_activity
7 changed_deal_stage
8 activities_edit
9 user_joined
10 invited_new_user
11 completed_activity
12 won_a_deal
13 lost_a_deal
14 feature_products
15 feature_deal_probability
16 invited_existing_user
17 edited_stage
100.000000
88.070291
85.828879
84.296198
74.448121
69.044263
61.072545
51.355769
28.947384
28.329157
21.877124
17.906090
12.477377
9.518529
8.309032
3.781910
0.000000

Split data to training and test data
• inTrain <- createDataPartition(y = all_ev[y][,1], p = 0.7, list = FALSE)
• train <- all_ev[inTrain, ]
• test <- all_ev[-inTrain, ]
• training set: 70% of companies
• hold-out test set: 30% of companies
• R caret package createDataPartition() function was used to split data

Model training using 5-fold cross validation
rf_model <- train(y ~ ., data = train,
method = "rf",
trControl = trainControl(
method = "cv",
number = 5
),
metric = "ROC",
tuneGrid = expand.grid(mtry = c(2, 4, 6, 8))
)
• R caret package train() and trainControl() functions do the job

Model evaluation
mtry <- rf_model$bestTune$mtry
train_auc <- rf_model$results$ROC[as.numeric(rownames(rf_model$bestTune))]
• model AUC on training data
• model AUC on test data
score <- predict(rf_model, newdata = test, type = "prob")
pred <- prediction(score[, 2], test[y])
test_auc <- performance(pred, "auc")
• AUC on training data 0.82..0.88
• AUC on test data 0.83..0.88
• Benchmark (decision tree) AUC = 0.7

Daily training and prediction
• Model is trained daily
Model training date-1 month-2 months
Companies that started
30 day trials here
Moving training data window
• Predictions are calculated daily for all companies in trial

Model and predictions traceability
• Use cases
• Monitoring processes
• Explaining prediction results
• Relevant data has to be saved for traceability

Model training traceability
• All model instances are saved in S3
• The following training data is saved in DB
• training timestamp
• model location
• mtry
• train_auc
• test_auc
• n_test
• n_train
• feature importances
• model training duration

Predictions traceability
• The following data is saved
• prediction timestamp
• model id
• company id
• predicted trial success likelihood
• feature values used in prediction

Monitoring
Re:dash is used for dashboards
redash.io

• andres.kull@pipedrive.com
• @andres_kull
Q & A

Machine learning in action at Pipedrive

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Machine learning in action at Pipedrive

Similaire à Machine learning in action at Pipedrive (20)

Plus de André Karpištšenko

Plus de André Karpištšenko (6)

Dernier

Dernier (20)

Machine learning in action at Pipedrive