SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Featurizing log data
before XGBoost
Xavier Conort
Thursday, August 20, 2015 @
● XuetangX, a Chinese MOOC learning platform initiated
by Tsinghua University,
● launched online on Oct 10th, 2013.
● more than 100 Chinese courses and over 260
international courses
● high dropout rate
The competition host
● challenge: predict whether a user will drop a course
within next 10 days based on his or her prior activities.
● data:
○ enrollment_train (120K rows) / enrollment_test (80K rows):
■ Columns: enrollment_id, username, course_id
○ log_train / log_test
■ Columns: enrollment_id, time, source, event, object
○ object
■ Columns: course_id, module_id, category, children, start
○ truth_train
■ Columns: enrollment_id, dropped_out
Problem to solve
Log data 5890
objects
Team
Chief Product Officer Chief Data Scientist
Data Scientist Data Scientist
(O. Zhang)
How we worked as a Team
● worked separately on feature engineering. 90% of
our time was spent here.
● delegated Modeling part to DataRobot to:
○ find best algorithm (with XGboost as a winner!)
○ model text features
○ tune hyperparameters
○ experiment different feature sets and blend 8 XGBoost
using different sets
○ communicate results
Feature engineering techniques used
● counts
● time statistics (min, mean, max, diff)
● entropy
● sequences treated as text on which we ran
○ SVD on 3grams
○ DataRobot text mining solution
● 20 first components of SVD on user x object
NB: removed duplicated log info and used training + test
sets to build most features
How to build efficient features in R
Key course features
● course_id
● first log time
● enrollment counts
● unique log counts
● mean time interval
Key enrollment count features
● log counts
● unique log counts
● ratio between unique log counts over log counts
● unique log counts by event (nagivate, access,
problem, video, page_close, discussion, wiki)
● unique log counts before end of course (5 days, 10
days and 30 days before)
● sequence number of enrollment in that course
Key enrollment time stats
● log time stats (min, mean, max)
● gap between first and last log of enrollment
● gap between enrollment first log and course first log
● gap between enrollment last log and course last logs
● difference between mean log time and mid point
between first and last log
● log interval stats (mean, 90, 99 and 100 quantiles)
Enrollment entropy features
enrollment entropy over
● days
● weekdays
● fraction (4) of weekdays
● hours of the day
● hours of the day for the last 1/3/7 days before last
logs
● object (when event == problem)
● chapter ids
Example of entropy feature
- log(weekday_log_count / enrollment_log_count) *
weekday_log_count / enrollment_log_count
Sum => weekday_entropy[enrollment_id==1]
1.589988
Enrollment sequence features
● for each enrollment_id, built sequences of
○ weekdays
○ objects
■ all objects / 'problem' and 'video' objects only
○ events
● treated sequences as 4 text variables. Ran for each
○ svd on 3 grams => first 10 components
○ DataRobot stacked predictions from logistic regr.
& Nystroem SVM on (tuned) n-grams
Extract of enrollment object sequences
1/2-grams from Object sequences
DataRobot on Object 1-2 grams
Key user count features and time
stats
● enrollment count
● binary indicator whether user signed up for each of
the 38 courses
● unique log count
● mean log time interval
● sequence number of enrollment for that user
User entropy features
user entropy over
● days
● weekdays
● fraction (4) of weekdays
● hours of the day
User sequence features
● for each user, built sequences of
○ weekdays
○ chapter_ids
○ events
● treated them as 3 text variables. Ran
○ SVD on 3 grams => first 10 components
○ DataRobot stacked predictions from logistic regr.
+ Nystroem SVM on (tuned) n-grams
How we got to the TOP3
● entropy features mentioned before
● exploited info in
○ log count in the 5 / 10 / 20 days after end of course
○ log counts by event, sign_up counts and day entropy in the next
10 days after end of course
○ time to sign up for new course
○ time until the next log for same user
added ~0.001 to AUC (vs
less powerful features)
added ~0.002 to AUC
XGBoost
Thank you!

Contenu connexe

Tendances

K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmPalin analytics
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierNeha Kulkarni
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Simplilearn
 
Random Forest Classifier in Machine Learning | Palin Analytics
Random Forest Classifier in Machine Learning | Palin AnalyticsRandom Forest Classifier in Machine Learning | Palin Analytics
Random Forest Classifier in Machine Learning | Palin AnalyticsPalin analytics
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Mustafa Sherazi
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
Binary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningBinary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningPaxcel Technologies
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronMostafa G. M. Mostafa
 
Machine learning with ADA Boost
Machine learning with ADA BoostMachine learning with ADA Boost
Machine learning with ADA BoostAman Patel
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...Edureka!
 

Tendances (20)

K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
 
Ensemble methods
Ensemble methods Ensemble methods
Ensemble methods
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Xgboost
XgboostXgboost
Xgboost
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
 
supervised learning
supervised learningsupervised learning
supervised learning
 
Random Forest Classifier in Machine Learning | Palin Analytics
Random Forest Classifier in Machine Learning | Palin AnalyticsRandom Forest Classifier in Machine Learning | Palin Analytics
Random Forest Classifier in Machine Learning | Palin Analytics
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Binary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine LearningBinary Class and Multi Class Strategies for Machine Learning
Binary Class and Multi Class Strategies for Machine Learning
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
 
Machine learning with ADA Boost
Machine learning with ADA BoostMachine learning with ADA Boost
Machine learning with ADA Boost
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
 

En vedette

Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions odsc
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHackerEarth
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovationHackerEarth
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case StudyHackerEarth
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation systemHackerEarth
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionJeong-Yoon Lee
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEoHackerEarth
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarthHackerEarth
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?HackerEarth
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command lineSharat Chikkerur
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at ScaleDomino Data Lab
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT MinistryJeong-Yoon Lee
 
Tda presentation
Tda presentationTda presentation
Tda presentationHJ van Veen
 
Vowpal Wabbit
Vowpal WabbitVowpal Wabbit
Vowpal Wabbitodsc
 
HackerEarth Sourcing Solution
HackerEarth Sourcing SolutionHackerEarth Sourcing Solution
HackerEarth Sourcing SolutionHackerEarth
 

En vedette (20)

Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growth
 
Work - LIGHT Ministry
Work - LIGHT MinistryWork - LIGHT Ministry
Work - LIGHT Ministry
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovation
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case Study
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Kill the wabbit
Kill the wabbitKill the wabbit
Kill the wabbit
 
No-Bullshit Data Science
No-Bullshit Data ScienceNo-Bullshit Data Science
No-Bullshit Data Science
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation system
 
USC LIGHT Ministry Introduction
USC LIGHT Ministry IntroductionUSC LIGHT Ministry Introduction
USC LIGHT Ministry Introduction
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEo
 
Druva Casestudy - HackerEarth
Druva Casestudy - HackerEarthDruva Casestudy - HackerEarth
Druva Casestudy - HackerEarth
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT Ministry
 
Tda presentation
Tda presentationTda presentation
Tda presentation
 
Vowpal Wabbit
Vowpal WabbitVowpal Wabbit
Vowpal Wabbit
 
HackerEarth Sourcing Solution
HackerEarth Sourcing SolutionHackerEarth Sourcing Solution
HackerEarth Sourcing Solution
 

Similaire à Featurizing log data before XGBoost

Job Queues Overview
Job Queues OverviewJob Queues Overview
Job Queues Overviewjoeyrobert
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Lviv Startup Club
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
 
Agile Testing Analytics
Agile Testing AnalyticsAgile Testing Analytics
Agile Testing AnalyticsQASymphony
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django applicationbangaloredjangousergroup
 
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...GeeksLab Odessa
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data ModelingMatthew Dennis
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
Active Learning on Question Answering with Dialogues
 Active Learning on Question Answering with Dialogues Active Learning on Question Answering with Dialogues
Active Learning on Question Answering with DialoguesJinho Choi
 
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk Development
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk DevelopmentTSTAS, the Life of a Splunk Trainer and using DevOps in Splunk Development
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk DevelopmentHarry McLaren
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestBerker Kozan
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Technical Debt Management
Technical Debt ManagementTechnical Debt Management
Technical Debt ManagementMark Niebergall
 
Agile Testing Process Analytics: From Data to Insightful Information
Agile Testing Process Analytics: From Data to Insightful InformationAgile Testing Process Analytics: From Data to Insightful Information
Agile Testing Process Analytics: From Data to Insightful InformationTechWell
 
Pytest - testing tips and useful plugins
Pytest - testing tips and useful pluginsPytest - testing tips and useful plugins
Pytest - testing tips and useful pluginsAndreu Vallbona Plazas
 
B.sc CSIT 2nd semester C++ unit-1
B.sc CSIT  2nd semester C++ unit-1B.sc CSIT  2nd semester C++ unit-1
B.sc CSIT 2nd semester C++ unit-1Tekendra Nath Yogi
 
Interop 2015: Hardly Enough Theory, Barley Enough Code
Interop 2015: Hardly Enough Theory, Barley Enough CodeInterop 2015: Hardly Enough Theory, Barley Enough Code
Interop 2015: Hardly Enough Theory, Barley Enough CodeJeremy Schulman
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupMárton Kodok
 

Similaire à Featurizing log data before XGBoost (20)

Job Queues Overview
Job Queues OverviewJob Queues Overview
Job Queues Overview
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Agile Testing Analytics
Agile Testing AnalyticsAgile Testing Analytics
Agile Testing Analytics
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
WebCamp:Front-end Developers Day. Алексей Ященко, Сергей Руденко "Фронтенд-мо...
 
Application Metrics
Application MetricsApplication Metrics
Application Metrics
 
Cassandra Data Modeling
Cassandra Data ModelingCassandra Data Modeling
Cassandra Data Modeling
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
Active Learning on Question Answering with Dialogues
 Active Learning on Question Answering with Dialogues Active Learning on Question Answering with Dialogues
Active Learning on Question Answering with Dialogues
 
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk Development
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk DevelopmentTSTAS, the Life of a Splunk Trainer and using DevOps in Splunk Development
TSTAS, the Life of a Splunk Trainer and using DevOps in Splunk Development
 
A Kaggle Talk
A Kaggle TalkA Kaggle Talk
A Kaggle Talk
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Technical Debt Management
Technical Debt ManagementTechnical Debt Management
Technical Debt Management
 
Agile Testing Process Analytics: From Data to Insightful Information
Agile Testing Process Analytics: From Data to Insightful InformationAgile Testing Process Analytics: From Data to Insightful Information
Agile Testing Process Analytics: From Data to Insightful Information
 
Pytest - testing tips and useful plugins
Pytest - testing tips and useful pluginsPytest - testing tips and useful plugins
Pytest - testing tips and useful plugins
 
B.sc CSIT 2nd semester C++ unit-1
B.sc CSIT  2nd semester C++ unit-1B.sc CSIT  2nd semester C++ unit-1
B.sc CSIT 2nd semester C++ unit-1
 
Interop 2015: Hardly Enough Theory, Barley Enough Code
Interop 2015: Hardly Enough Theory, Barley Enough CodeInterop 2015: Hardly Enough Theory, Barley Enough Code
Interop 2015: Hardly Enough Theory, Barley Enough Code
 
Complex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch WarmupComplex realtime event analytics using BigQuery @Crunch Warmup
Complex realtime event analytics using BigQuery @Crunch Warmup
 

Plus de DataRobot

How to Understand a DataRobot Model
How to Understand a DataRobot ModelHow to Understand a DataRobot Model
How to Understand a DataRobot ModelDataRobot
 
Artificial Intelligence in Banking: An Infographic
Artificial Intelligence in Banking: An InfographicArtificial Intelligence in Banking: An Infographic
Artificial Intelligence in Banking: An InfographicDataRobot
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringDataRobot
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R PackageDataRobot
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle CompetitionsDataRobot
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 

Plus de DataRobot (6)

How to Understand a DataRobot Model
How to Understand a DataRobot ModelHow to Understand a DataRobot Model
How to Understand a DataRobot Model
 
Artificial Intelligence in Banking: An Infographic
Artificial Intelligence in Banking: An InfographicArtificial Intelligence in Banking: An Infographic
Artificial Intelligence in Banking: An Infographic
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R Package
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 

Dernier

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Dernier (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Featurizing log data before XGBoost

  • 1. Featurizing log data before XGBoost Xavier Conort Thursday, August 20, 2015 @
  • 2. ● XuetangX, a Chinese MOOC learning platform initiated by Tsinghua University, ● launched online on Oct 10th, 2013. ● more than 100 Chinese courses and over 260 international courses ● high dropout rate The competition host
  • 3. ● challenge: predict whether a user will drop a course within next 10 days based on his or her prior activities. ● data: ○ enrollment_train (120K rows) / enrollment_test (80K rows): ■ Columns: enrollment_id, username, course_id ○ log_train / log_test ■ Columns: enrollment_id, time, source, event, object ○ object ■ Columns: course_id, module_id, category, children, start ○ truth_train ■ Columns: enrollment_id, dropped_out Problem to solve
  • 5. Team Chief Product Officer Chief Data Scientist Data Scientist Data Scientist (O. Zhang)
  • 6. How we worked as a Team ● worked separately on feature engineering. 90% of our time was spent here. ● delegated Modeling part to DataRobot to: ○ find best algorithm (with XGboost as a winner!) ○ model text features ○ tune hyperparameters ○ experiment different feature sets and blend 8 XGBoost using different sets ○ communicate results
  • 7. Feature engineering techniques used ● counts ● time statistics (min, mean, max, diff) ● entropy ● sequences treated as text on which we ran ○ SVD on 3grams ○ DataRobot text mining solution ● 20 first components of SVD on user x object NB: removed duplicated log info and used training + test sets to build most features
  • 8. How to build efficient features in R
  • 9. Key course features ● course_id ● first log time ● enrollment counts ● unique log counts ● mean time interval
  • 10. Key enrollment count features ● log counts ● unique log counts ● ratio between unique log counts over log counts ● unique log counts by event (nagivate, access, problem, video, page_close, discussion, wiki) ● unique log counts before end of course (5 days, 10 days and 30 days before) ● sequence number of enrollment in that course
  • 11. Key enrollment time stats ● log time stats (min, mean, max) ● gap between first and last log of enrollment ● gap between enrollment first log and course first log ● gap between enrollment last log and course last logs ● difference between mean log time and mid point between first and last log ● log interval stats (mean, 90, 99 and 100 quantiles)
  • 12. Enrollment entropy features enrollment entropy over ● days ● weekdays ● fraction (4) of weekdays ● hours of the day ● hours of the day for the last 1/3/7 days before last logs ● object (when event == problem) ● chapter ids
  • 13. Example of entropy feature - log(weekday_log_count / enrollment_log_count) * weekday_log_count / enrollment_log_count Sum => weekday_entropy[enrollment_id==1] 1.589988
  • 14. Enrollment sequence features ● for each enrollment_id, built sequences of ○ weekdays ○ objects ■ all objects / 'problem' and 'video' objects only ○ events ● treated sequences as 4 text variables. Ran for each ○ svd on 3 grams => first 10 components ○ DataRobot stacked predictions from logistic regr. & Nystroem SVM on (tuned) n-grams
  • 15. Extract of enrollment object sequences
  • 17. DataRobot on Object 1-2 grams
  • 18. Key user count features and time stats ● enrollment count ● binary indicator whether user signed up for each of the 38 courses ● unique log count ● mean log time interval ● sequence number of enrollment for that user
  • 19. User entropy features user entropy over ● days ● weekdays ● fraction (4) of weekdays ● hours of the day
  • 20. User sequence features ● for each user, built sequences of ○ weekdays ○ chapter_ids ○ events ● treated them as 3 text variables. Ran ○ SVD on 3 grams => first 10 components ○ DataRobot stacked predictions from logistic regr. + Nystroem SVM on (tuned) n-grams
  • 21. How we got to the TOP3 ● entropy features mentioned before ● exploited info in ○ log count in the 5 / 10 / 20 days after end of course ○ log counts by event, sign_up counts and day entropy in the next 10 days after end of course ○ time to sign up for new course ○ time until the next log for same user added ~0.001 to AUC (vs less powerful features) added ~0.002 to AUC