Publicité
Publicité

Contenu connexe

Similaire à Data Science at Udemy(20)

Publicité

Data Science at Udemy

  1. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 Data Science at Udemy Larry Wai Principal Data Scientist @Udemy
  2. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 Overview of talk ● What is data science? ● Udemy in a nutshell ● Data science projects at Udemy ● Data science work cycle ● What does it mean to be a data scientist?
  3. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 What is data science? data science in consumer internet = application of the scientific method using big data computational methods to ascertain, predict, and utilize user behavior for business purposes Inherits from three historical schools of thought 1. Research of natural phenomena using the scientific method ○ e.g. physics, astronomy ○ data science arises from substituting the study of natural phenomena with study of user behavior 2. Research of computational methods ○ e.g. mathematics, computer science ○ data science arises from pushing the limits of existing methods to compute that which could not be computed before 3. Research of human behavior ○ e.g. economics, psychology ○ data science arises from applying big data to the study of microscopic human behavior, i.e. millions of users x thousands of items = billions of user-item calculations Other definitions (too general IMO): ● data science > statistics (only); stats does not require engineering skills ● data science > computer science (only); engineering does not require training in the scientific method ● data science > business analytics (only); analytics does not require engineering skills nor training in the scientific method
  4. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 Udemy in a nutshell ● consumer online education marketplace ● instructors get 50% of enrollment fee ● no certification requirements ● typical enrollment price point (paid) is $20-$40 ● get to critical mass (instructors and students) in each language through marketing ● above critical mass, leverage marketplace (organic) driven growth ● Udemy currently has ~7 million students, ~30 thousand courses ● relevance of search and recommendations is key to fostering growth ● learning goal data science is key to fostering long term growth Google search trends for selected online education companies ● Udemy (blue). Exponential marketplace growth. ● Coursera (yellow), Udacity (red), Lynda (green). Incremental growth. ● note: this chart convinced me to join Udemy :)
  5. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 Udemy web site
  6. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 Data science projects at Udemy search & recommendation ● real time recommendation (web, mobile) ● real time search ● batch e-mail recommendation learning goals ● course learning process optimization ● learning goal paths ● career learning goals + more projects
  7. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 Search and recommendation (in experiment) Feature classes ● course historical averages ● personal historical behavior ● search term matching Overall ranking strategy ● compute global score per visitor per course per day ● consider modules as filters on the total available inventory ● the module score will be the sum of the global course scores for the top N courses in the module ● individual courses are ranked within each module according to the global course score course 1 course 2 course 3 course 4 course 5 course 6 course 7 course 8 course 9 course 10 course 11 course 12 module A module B module C
  8. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 Learning goals (conceptual stage) Course learning goal clustering ● goals are hierarchical ● goals are linked ● goals are dynamic Overall learning goal strategy ● continuously update learning goal clustering ● quantify and evaluate student progress towards learning goals ● identify learning goal paths according to desired careers or hobbies goal 1 goal 2 goal 3 goal 4 goal 5 goal 6course A course B
  9. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 Data science work cycle experiment setup exploratory analysis model deployment model building data collection ideal cycling time is ~days to ~weeks
  10. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 Exploratory analysis ● data to be explored can in general be defined as a multi-dimensional cube, a.k.a. “hypercube”, where each side of the hypercube is an exploratory “dimension” and the “measures” of the user behavior are aggregates in each cell ● the hypercube is the minimal representation required for the exploratory analysis; e.g. we minimize cardinality for continuous variables ● the human mind is unable to easily comprehend more than 3 dimensions, therefore exploratory analysis must be broken down into actions which project the entire hypercube onto different dimensions in sequence ● goal for the analyst is to understand the multi- dimensional user behavior, which may take many projections in sequence (~100)
  11. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 model building ● platforms such as R allow us to leverage open source modeling packages and compare models with relatively low overhead ● most user behavior features are non-linear and correlated; thus, the simplest “black box” non-linear models which handle correlations are practical to use, e.g. decision trees ● use residuals on holdout to validate model model
  12. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 model deployment ● standardized predictive model markup language (PMML) allows abstraction of models in deployment ● “plug-in” model deployment is agile because no new production code is needed for model updates ● shifts focus of algo development from production code development to data mining methods ● this approach allows a single person to build and deploy models quickly ● this approach is cutting edge and is being tested now at Udemy create training dataset create predictive model, e.g. decision trees, random forest offline analysis; residuals; feature importance loop through courses, compute feature vector per course compute score per course sort by score predictive model store (PMML format) in memory model; load on initialization; periodic updates model building model deployment model storage model scoring
  13. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 experiment setup Practical requirements for experiments, a.k.a. A/B tests ● need enough users to measure an interesting effect ● conversely, if an effect is not large enough to measure, then it is not interesting, at least from a data science point of view, and potentially from a business point of view ● e.g. an interesting effect from a business point of view would be +5% relative lift of conversion rate ● to achieve +5% relative lift at 95% confidence level (on say typical 1 conversion per 10 sessions), need to have 30,000 sessions in each of A and B samples, i.e. >60,000 sessions ● ideally, would like to measure lift within ~days; so need >60,000 sessions per day ● Udemy currently has >200,000 sessions per day (but 2 years ago it was more like 20,000 sessions per day, so 10x slower to run experiments) 1. smoke test (~few days) ○ 1% for test variant(s) ○ verify that nothing is broken ○ 40% CONTROL_1, 40% CONTROL_2 ○ validate that control is setup correctly 2. initial ramp (~1 week) ○ 5-10% for test variant(s) ○ sizing depends upon whether we’ve tested something like this before, and any revenue concerns 3. intermediate ramp (~few weeks) ○ 25%-50% for test variant ○ 40%-50% for CONTROL_1 4. final ramp / launch ○ 90% for test variant ○ 10% for CONTROL_1 (optional); turn off after a few weeks of monitoring ○ rename “test” as new baseline
  14. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 data collection ● data should be collected at the most granular level, e.g. typically per visitor per item per day ● data should be pre-arranged in a way which facilitates fast hypercube production, i.e. star schema ● most granular data is located at the star core ● experiment variants can be incorporated as an additional dimension in one of the star limbs core table with grouping fields A, B, C limb table with grouping field A limb table with grouping fields A, B limb table with grouping field B limb table with grouping fields B, C limb table with grouping fields A, B, D mapping table with grouping field C and other field D “star schema” (with intermediate mapping)
  15. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 What does it mean to be a data scientist? A successful data scientist is somebody who can independently execute the entire data science work cycle on the time scale of days to weeks. Important personal factors ● technical chops in math, computational methods, and the scientific method ● a genuine research interest in the underlying user behavior ● good intuition for how the business works Important environmental factors ● top-down knowledgeability and commitment to data science ● excellent data architect ● best practices data science infrastructure
  16. Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015 Udemy is hiring! https://about.udemy.com/careers/
Publicité