Larry will discuss what data science means in general, and more specifically at Udemy. He will describe some key data science frameworks, and what it means for them to be agile. He will also discuss ideally what it would mean to be a data scientist at Udemy.
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Data Science at Udemy
1. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
Data Science at Udemy
Larry Wai
Principal Data Scientist @Udemy
2. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
Overview of talk
● What is data science?
● Udemy in a nutshell
● Data science projects at Udemy
● Data science work cycle
● What does it mean to be a data scientist?
3. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
What is data science?
data science in consumer internet = application of the scientific method using big data computational methods to
ascertain, predict, and utilize user behavior for business purposes
Inherits from three historical schools of thought
1. Research of natural phenomena using the scientific method
○ e.g. physics, astronomy
○ data science arises from substituting the study of natural phenomena with study of user behavior
2. Research of computational methods
○ e.g. mathematics, computer science
○ data science arises from pushing the limits of existing methods to compute that which could not be
computed before
3. Research of human behavior
○ e.g. economics, psychology
○ data science arises from applying big data to the study of microscopic human behavior, i.e. millions of
users x thousands of items = billions of user-item calculations
Other definitions (too general IMO):
● data science > statistics (only); stats does not require engineering skills
● data science > computer science (only); engineering does not require training in the scientific method
● data science > business analytics (only); analytics does not require engineering skills nor training in the scientific
method
4. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
Udemy in a nutshell
● consumer online education marketplace
● instructors get 50% of enrollment fee
● no certification requirements
● typical enrollment price point (paid) is $20-$40
● get to critical mass (instructors and students)
in each language through marketing
● above critical mass, leverage marketplace
(organic) driven growth
● Udemy currently has ~7 million students, ~30
thousand courses
● relevance of search and recommendations is
key to fostering growth
● learning goal data science is key to fostering
long term growth Google search trends for selected online education
companies
● Udemy (blue). Exponential marketplace growth.
● Coursera (yellow), Udacity (red), Lynda (green).
Incremental growth.
● note: this chart convinced me to join Udemy :)
5. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
Udemy web site
6. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
Data science projects at Udemy
search & recommendation
● real time recommendation (web, mobile)
● real time search
● batch e-mail recommendation
learning goals
● course learning process optimization
● learning goal paths
● career learning goals
+ more projects
7. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
Search and recommendation (in experiment)
Feature classes
● course historical averages
● personal historical behavior
● search term matching
Overall ranking strategy
● compute global score per visitor per
course per day
● consider modules as filters on the total
available inventory
● the module score will be the sum of the
global course scores for the top N
courses in the module
● individual courses are ranked within
each module according to the global
course score
course 1 course 2 course 3 course 4
course 5 course 6 course 7 course 8
course 9 course 10 course 11 course 12
module A
module B
module C
8. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
Learning goals (conceptual stage)
Course learning goal clustering
● goals are hierarchical
● goals are linked
● goals are dynamic
Overall learning goal strategy
● continuously update learning goal
clustering
● quantify and evaluate student progress
towards learning goals
● identify learning goal paths according
to desired careers or hobbies
goal 1 goal 2 goal 3
goal 4 goal 5 goal 6course A
course B
9. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
Data science work cycle
experiment
setup
exploratory
analysis
model
deployment
model
building
data
collection ideal cycling time
is ~days to
~weeks
10. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
Exploratory analysis
● data to be explored can in general be defined
as a multi-dimensional cube, a.k.a.
“hypercube”, where each side of the
hypercube is an exploratory “dimension” and
the “measures” of the user behavior are
aggregates in each cell
● the hypercube is the minimal representation
required for the exploratory analysis; e.g. we
minimize cardinality for continuous variables
● the human mind is unable to easily
comprehend more than 3 dimensions,
therefore exploratory analysis must be broken
down into actions which project the entire
hypercube onto different dimensions in
sequence
● goal for the analyst is to understand the multi-
dimensional user behavior, which may take
many projections in sequence (~100)
11. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
model building
● platforms such as R allow us to leverage open
source modeling packages and compare
models with relatively low overhead
● most user behavior features are non-linear
and correlated; thus, the simplest “black box”
non-linear models which handle correlations
are practical to use, e.g. decision trees
● use residuals on holdout to validate model
model
12. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
model deployment
● standardized predictive model markup language (PMML) allows abstraction of models in deployment
● “plug-in” model deployment is agile because no new production code is needed for model updates
● shifts focus of algo development from production code development to data mining methods
● this approach allows a single person to build and deploy models quickly
● this approach is cutting edge and is being tested now at Udemy
create training dataset
create predictive
model, e.g. decision
trees, random forest
offline analysis;
residuals;
feature importance
loop through courses,
compute feature
vector per course
compute score per
course
sort by score
predictive model
store
(PMML format)
in memory model;
load on initialization;
periodic updates
model
building
model
deployment
model
storage
model
scoring
13. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
experiment setup
Practical requirements for experiments, a.k.a. A/B tests
● need enough users to measure an interesting
effect
● conversely, if an effect is not large enough to
measure, then it is not interesting, at least from a
data science point of view, and potentially from a
business point of view
● e.g. an interesting effect from a business point of
view would be +5% relative lift of conversion rate
● to achieve +5% relative lift at 95% confidence level
(on say typical 1 conversion per 10 sessions),
need to have 30,000 sessions in each of A and B
samples, i.e. >60,000 sessions
● ideally, would like to measure lift within ~days; so
need >60,000 sessions per day
● Udemy currently has >200,000 sessions per day
(but 2 years ago it was more like 20,000 sessions
per day, so 10x slower to run experiments)
1. smoke test (~few days)
○ 1% for test variant(s)
○ verify that nothing is broken
○ 40% CONTROL_1, 40% CONTROL_2
○ validate that control is setup correctly
2. initial ramp (~1 week)
○ 5-10% for test variant(s)
○ sizing depends upon whether we’ve tested
something like this before, and any
revenue concerns
3. intermediate ramp (~few weeks)
○ 25%-50% for test variant
○ 40%-50% for CONTROL_1
4. final ramp / launch
○ 90% for test variant
○ 10% for CONTROL_1 (optional); turn off
after a few weeks of monitoring
○ rename “test” as new baseline
14. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
data collection
● data should be collected at the most granular
level, e.g. typically per visitor per item per day
● data should be pre-arranged in a way which
facilitates fast hypercube production, i.e. star
schema
● most granular data is located at the star core
● experiment variants can be incorporated as
an additional dimension in one of the star
limbs
core table with
grouping fields
A, B, C
limb table with
grouping field
A
limb table with
grouping fields
A, B
limb table with
grouping field
B
limb table with
grouping fields
B, C
limb table with
grouping fields
A, B, D
mapping table
with grouping
field C and
other field D
“star schema”
(with intermediate mapping)
15. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
What does it mean to be a data scientist?
A successful data scientist is somebody who can independently execute the entire data
science work cycle on the time scale of days to weeks.
Important personal factors
● technical chops in math, computational methods, and the scientific method
● a genuine research interest in the underlying user behavior
● good intuition for how the business works
Important environmental factors
● top-down knowledgeability and commitment to data science
● excellent data architect
● best practices data science infrastructure
16. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015
Udemy is hiring!
https://about.udemy.com/careers/