SlideShare une entreprise Scribd logo
1  sur  30
Data Engineering at Udemy
Keeyong Han
Principal Data Architect @Udemy
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
About Me
• 20+ years of experience from 9 different
companies
• Currently manages Data team at Udemy
• Prior to joining Udemy
– Manager of data/search team at Polyvore
– Director of Engineering at Yahoo Search
– Started career from Samsung Electronics in Korea
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Agenda
• Typical Evolution of Data Processing
• Data Engineering at Udemy
• Lessons Learned
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
TYPICAL EVOLUTION OF
DATA PROCESSING
From a small start-up
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
In the beginning
• You don’t have any data 
• So no data infrastructure or data science
– The most important thing is to survive and to keep
iterating
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
After a struggle you have some data
• Now you survived and now you have some
data to work with
– Data analysts are hired
– They want to analyze the data
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Then …
• You don’t know where the data is exactly
• You find your data but
– It is not clean and is missing key information
– Data is likely not in the format you want
• You store them in non-optimal storage
– MySQL is likely used to store all kinds of data
• But MySQL doesn’t scale
– You ask analysts to query MySQL
• They will kill the web site a few times 
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Now what to do? (I)
• You have to find a scalable and separate storage for
data analysis
– This is called Data Warehouse or Data Analytics
– This will be the central storage for your important data
– Udemy uses AWS Redshift
• Migrate some data from MySQL
– Key/Value data to NoSQL solution (Cassandra/Hbase,
MongoDB, …)
– Log type of data (use Nginx log for example)
– MySQL should only have key data which is needed from
Web service
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Now what to do? (II)
• The goal is to put every data into a single
storage
– This is the most important and the very first step
toward becoming “true” data organization
– This storage should be separated from runtime
storage (MySQL for example)
– This storage should be scalable
– Being consistent is more important than being
correct in the beginning
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Now You Add More Data
• Different Ways of Collecting Data
– This is called ETL (Extract, Transform and Load)
– Different Aspects to Consider
• Size: 1KB to 20GB
• Frequency: Hourly, Daily, Weekly, Monthly
• How to collect data:
– FTP, API, Webhook, S3, HTTP, mysql commandline
• You will have multiple data collection workflows
– Use cronjob (or some scheduler) to manage
– Udemy uses Pinball (Open Source from Pinterest)
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
How It Will Look Like
Your
Cool
Web
Service
Log Files
MySQL
Key/Value
Data
Warehouse
External
Data SourcesETL
ETL
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Simple Data import
• Just use some script language
– Many data sources are small and simple enough
to use a script language
• Udemy uses Python for this purpose
– Implemented a set of Python classes to handle
different types of data import
– Plan to open source this in 1st half of 2016
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Large Data Batch Import
• Large data import and processing will require
more scalable solution
• Hadoop can be used for this purpose
– SQL on Hadoop: Hive, Tajo, Presto and so on
– Pig, Java MapReduce
• Spark is getting a lot of attention and we plan
to evaluate
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Realtime Data import
• Some of data better be imported as it
happens
• This requires different type of technology
– Realtime Data Message Queue: Kafka, Kinesis
– Realtime Data Consumer: Storm, Samza, Spark
Streaming
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
What’s Next? (I)
• Build Summary Tables
– Having raw data tables is good but it can be too
detailed and too much information
– Build these tables in your Data Warehouse
• Track the performance of key metrics
– This should be from summary tables above
– You need dashboard tool (build one or use 3rd
party solution – Birst, chartIO, Tableau and so on)
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
What’s Next? (II)
• Provide this data to Data Science team
– Draw insight and create feedback loop
– Build machine learned models for recommendation,
search ranking and so on
– The topic for the next session (Thanks Larry!)
• Supporting Data Science from Infrastructure
– This will require scalable infrastructure
– Example: Scoring every pairs of user/course in Udemy
• 7M users X 30K courses = 210B pairs of computation
– You need scalable Serving Layer (Cassandra, Hbase, …)
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
DATA ENGINEERING AT UDEMY
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Data Warehouse/Analytics
• We use AWS Redshift as our data warehouse
(or data analytics backend)
• What is AWS Redshift?
– Scalable Postgresql Engine up to 1.6PB of data
– Roughly it is 600 USD per TB per month
– Mainly for offline batch processing
– Supports bulk update (through AWS S3)
– Two type of options: Compute vs. Storage
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Kind of Data Stored in Redshift
• 800+ tables with 2.4TB of data
• Key tables from MySQL
• Email Marketing Data
• Ads Campaign Performance Data
• SEO data from Google
• Data from Web access log
• Support Ticket Data
• A/B Test Data (Mobile, Web)
• Human curated data from Google Spreadsheets
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Details on ETL Pipelines
• All data pipelines are scheduled through
Pinball
– Every 5 minutes, hourly, daily, weekly and monthly
• Most pipelines are purely in Python
• Some uses Hadoop/Hive and Hadoop/Pig for
Batch Processing
• Start using Kinesis for Realtime Processing
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Pinball Screenshot
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Batching Processing Infrastructure
• We use Hadoop 2.6 with Hive and Pig
– CDH 5.4 (community version)
• We use our own hadoop cluster and AWS EMR
(ElasticMapReduce) at the same time
– This is used to do ETL on massive data
– This is also used to run massive model/scoring
pipelines from Data Science team
• Plan to evaluate Spark potentially as an
alternative
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Realtime Processing
• Applications
– The first application is to process web access log
– Eventually we plan to use this to generate
personalized recommendation on-the-fly
• Plan to use AWS Kinesis
– Evaluated Apache Kafka and AWS Kinesis
• They are very similar but Kafka is an open source while
Kinesis is a managed service from AWS
• Decided to use AWS Kinesis
• Plan to evaluate Realtime Consumer
– Samza, Storm, Spark Streaming
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
What is Kinesis (Kafka)?
• Realtime data processing service in AWS
– Publisher-Subscriber message broker
– Very similar to Kafka
• It has two components
– One is message queue where stream of data is stored
• 24 hours of retention period
• Pay hourly by the read/write rate of the queue
– The other is KCL (Kinesis Client Library)
• Using this, build Data Producer application or Data
Consumer Application
• This can be combined with Storm, Spark Streaming, …
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Data Serving Layer
• Redshift isn’t a good fit to read out the data in
realtime fashion so you need something else
• We are using (or plan to use) the followings:
– Cassandra
– Redis
– ElasticSearch
– MySQL
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
How It Looks Like
Udemy
Log Files
(Nginx)
MySQL
Key/Value
(Cassandra)
Data
Warehouse
(Redshift)
External
Data Sources
Data Science Pipeline
ETL
ETL
Data Science Pipeline
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
LESSONS LEARNED
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
• As a small start-up survive first and then work
on data
• Starting point is to store all data in a single
location (data warehouse)
• Start with batch processing and then realtime
• Consider the type of data you store
– Log vs. Key/Value vs. Transactional Record
• Store data in the form of log (change history)
– So that you can always go back and debug/replay
• Cloud is good unless you have really massive
data
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015
Q & A
Udemy is Hiring
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015

Contenu connexe

Tendances

Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
Open Analytics
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
Open Analytics
 
Model Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaModel Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and Verta
Databricks
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
Romeo Kienzler
 
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at NetflixDistributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
sfbiganalytics
 

Tendances (20)

Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
Model Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and VertaModel Monitoring at Scale with Apache Spark and Verta
Model Monitoring at Scale with Apache Spark and Verta
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
Machine Learning with PyCaret
Machine Learning with PyCaretMachine Learning with PyCaret
Machine Learning with PyCaret
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
 
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
Data Science Salon: Kaggle 1st Place in 30 minutes: Putting AutoML to Work wi...
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Graph-Powered Machine Learning
Graph-Powered Machine Learning Graph-Powered Machine Learning
Graph-Powered Machine Learning
 
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
 
Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
 
Metrics for Web Applications - Netcamp 2012
Metrics for Web Applications - Netcamp 2012Metrics for Web Applications - Netcamp 2012
Metrics for Web Applications - Netcamp 2012
 
Rakshit (Rocky) Bhatt Resume - 2022
Rakshit (Rocky) Bhatt Resume - 2022Rakshit (Rocky) Bhatt Resume - 2022
Rakshit (Rocky) Bhatt Resume - 2022
 
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajH2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
 
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
 
Pm.ais ummit 180917 final
Pm.ais ummit 180917 finalPm.ais ummit 180917 final
Pm.ais ummit 180917 final
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B Tests
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
 
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at NetflixDistributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
 

Similaire à Data Engineering at Udemy

advance computing and big adata analytic.pptx
advance computing and big adata analytic.pptxadvance computing and big adata analytic.pptx
advance computing and big adata analytic.pptx
TeddyIswahyudi1
 

Similaire à Data Engineering at Udemy (20)

Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Realtime Data Analytics
Realtime Data AnalyticsRealtime Data Analytics
Realtime Data Analytics
 
LeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration Services
 
Data-Driven Development Era and Its Technologies
Data-Driven Development Era and Its TechnologiesData-Driven Development Era and Its Technologies
Data-Driven Development Era and Its Technologies
 
AzureSynapse.pptx
AzureSynapse.pptxAzureSynapse.pptx
AzureSynapse.pptx
 
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from LynchpinMeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
EPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster AnalyticsEPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster Analytics
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
 
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarAdf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar
 
Text Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-ServiceText Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-Service
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
advance computing and big adata analytic.pptx
advance computing and big adata analytic.pptxadvance computing and big adata analytic.pptx
advance computing and big adata analytic.pptx
 
Data Ingestion Engine
Data Ingestion EngineData Ingestion Engine
Data Ingestion Engine
 
Holistic Approach To Monitoring
Holistic Approach To MonitoringHolistic Approach To Monitoring
Holistic Approach To Monitoring
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 

Dernier

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Dernier (20)

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

Data Engineering at Udemy

  • 1. Data Engineering at Udemy Keeyong Han Principal Data Architect @Udemy Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 2. About Me • 20+ years of experience from 9 different companies • Currently manages Data team at Udemy • Prior to joining Udemy – Manager of data/search team at Polyvore – Director of Engineering at Yahoo Search – Started career from Samsung Electronics in Korea Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 3. Agenda • Typical Evolution of Data Processing • Data Engineering at Udemy • Lessons Learned Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 4. TYPICAL EVOLUTION OF DATA PROCESSING From a small start-up Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 5. In the beginning • You don’t have any data  • So no data infrastructure or data science – The most important thing is to survive and to keep iterating Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 6. After a struggle you have some data • Now you survived and now you have some data to work with – Data analysts are hired – They want to analyze the data Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 7. Then … • You don’t know where the data is exactly • You find your data but – It is not clean and is missing key information – Data is likely not in the format you want • You store them in non-optimal storage – MySQL is likely used to store all kinds of data • But MySQL doesn’t scale – You ask analysts to query MySQL • They will kill the web site a few times  Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 8. Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 9. Now what to do? (I) • You have to find a scalable and separate storage for data analysis – This is called Data Warehouse or Data Analytics – This will be the central storage for your important data – Udemy uses AWS Redshift • Migrate some data from MySQL – Key/Value data to NoSQL solution (Cassandra/Hbase, MongoDB, …) – Log type of data (use Nginx log for example) – MySQL should only have key data which is needed from Web service Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 10. Now what to do? (II) • The goal is to put every data into a single storage – This is the most important and the very first step toward becoming “true” data organization – This storage should be separated from runtime storage (MySQL for example) – This storage should be scalable – Being consistent is more important than being correct in the beginning Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 11. Now You Add More Data • Different Ways of Collecting Data – This is called ETL (Extract, Transform and Load) – Different Aspects to Consider • Size: 1KB to 20GB • Frequency: Hourly, Daily, Weekly, Monthly • How to collect data: – FTP, API, Webhook, S3, HTTP, mysql commandline • You will have multiple data collection workflows – Use cronjob (or some scheduler) to manage – Udemy uses Pinball (Open Source from Pinterest) Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 12. How It Will Look Like Your Cool Web Service Log Files MySQL Key/Value Data Warehouse External Data SourcesETL ETL Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 13. Simple Data import • Just use some script language – Many data sources are small and simple enough to use a script language • Udemy uses Python for this purpose – Implemented a set of Python classes to handle different types of data import – Plan to open source this in 1st half of 2016 Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 14. Large Data Batch Import • Large data import and processing will require more scalable solution • Hadoop can be used for this purpose – SQL on Hadoop: Hive, Tajo, Presto and so on – Pig, Java MapReduce • Spark is getting a lot of attention and we plan to evaluate Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 15. Realtime Data import • Some of data better be imported as it happens • This requires different type of technology – Realtime Data Message Queue: Kafka, Kinesis – Realtime Data Consumer: Storm, Samza, Spark Streaming Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 16. What’s Next? (I) • Build Summary Tables – Having raw data tables is good but it can be too detailed and too much information – Build these tables in your Data Warehouse • Track the performance of key metrics – This should be from summary tables above – You need dashboard tool (build one or use 3rd party solution – Birst, chartIO, Tableau and so on) Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 17. What’s Next? (II) • Provide this data to Data Science team – Draw insight and create feedback loop – Build machine learned models for recommendation, search ranking and so on – The topic for the next session (Thanks Larry!) • Supporting Data Science from Infrastructure – This will require scalable infrastructure – Example: Scoring every pairs of user/course in Udemy • 7M users X 30K courses = 210B pairs of computation – You need scalable Serving Layer (Cassandra, Hbase, …) Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 18. DATA ENGINEERING AT UDEMY Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 19. Data Warehouse/Analytics • We use AWS Redshift as our data warehouse (or data analytics backend) • What is AWS Redshift? – Scalable Postgresql Engine up to 1.6PB of data – Roughly it is 600 USD per TB per month – Mainly for offline batch processing – Supports bulk update (through AWS S3) – Two type of options: Compute vs. Storage Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 20. Kind of Data Stored in Redshift • 800+ tables with 2.4TB of data • Key tables from MySQL • Email Marketing Data • Ads Campaign Performance Data • SEO data from Google • Data from Web access log • Support Ticket Data • A/B Test Data (Mobile, Web) • Human curated data from Google Spreadsheets Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 21. Details on ETL Pipelines • All data pipelines are scheduled through Pinball – Every 5 minutes, hourly, daily, weekly and monthly • Most pipelines are purely in Python • Some uses Hadoop/Hive and Hadoop/Pig for Batch Processing • Start using Kinesis for Realtime Processing Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 22. Pinball Screenshot Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 23. Batching Processing Infrastructure • We use Hadoop 2.6 with Hive and Pig – CDH 5.4 (community version) • We use our own hadoop cluster and AWS EMR (ElasticMapReduce) at the same time – This is used to do ETL on massive data – This is also used to run massive model/scoring pipelines from Data Science team • Plan to evaluate Spark potentially as an alternative Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 24. Realtime Processing • Applications – The first application is to process web access log – Eventually we plan to use this to generate personalized recommendation on-the-fly • Plan to use AWS Kinesis – Evaluated Apache Kafka and AWS Kinesis • They are very similar but Kafka is an open source while Kinesis is a managed service from AWS • Decided to use AWS Kinesis • Plan to evaluate Realtime Consumer – Samza, Storm, Spark Streaming Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 25. What is Kinesis (Kafka)? • Realtime data processing service in AWS – Publisher-Subscriber message broker – Very similar to Kafka • It has two components – One is message queue where stream of data is stored • 24 hours of retention period • Pay hourly by the read/write rate of the queue – The other is KCL (Kinesis Client Library) • Using this, build Data Producer application or Data Consumer Application • This can be combined with Storm, Spark Streaming, … Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 26. Data Serving Layer • Redshift isn’t a good fit to read out the data in realtime fashion so you need something else • We are using (or plan to use) the followings: – Cassandra – Redis – ElasticSearch – MySQL Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 27. How It Looks Like Udemy Log Files (Nginx) MySQL Key/Value (Cassandra) Data Warehouse (Redshift) External Data Sources Data Science Pipeline ETL ETL Data Science Pipeline Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 28. LESSONS LEARNED Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 29. • As a small start-up survive first and then work on data • Starting point is to store all data in a single location (data warehouse) • Start with batch processing and then realtime • Consider the type of data you store – Log vs. Key/Value vs. Transactional Record • Store data in the form of log (change history) – So that you can always go back and debug/replay • Cloud is good unless you have really massive data Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015
  • 30. Q & A Udemy is Hiring Ankara Data Meetup - Bilkent Cyberpark, August 5, 2015

Notes de l'éditeur

  1. Logging format. Don’t try to take a snapshot and do aggregation
  2. Add diagram MySQL was likely to be used to store all data and used by data analysts
  3. Add diagram MySQL was likely to be used to store all data and used by data analysts What happens when you don’t have this – everyone does their own analysis and derive their own conclusion – waste of resource from a lot of one-off efforts
  4. Add diagram MySQL was likely to be used to store all data and used by data analysts What happens when you don’t have this – everyone does their own analysis and derive their own conclusion – waste of resource from a lot of one-off efforts
  5. Realtime recommendation