Data Engineering at Udemy

Data Engineering at Udemy
Keeyong Han
Principal Data Architect @Udemy
Ankara Data Meetup - Bilkent Cyberpark,
August 5, 2015

About Me
• 20+ years of experience from 9 different
companies
• Currently manages Data team at Udemy
• Prior to joining Udemy
– Manager of data/search team at Polyvore
– Director of Engineering at Yahoo Search
– Started career from Samsung Electronics in Korea
August 5, 2015

Agenda
• Typical Evolution of Data Processing
• Data Engineering at Udemy
• Lessons Learned
August 5, 2015

TYPICAL EVOLUTION OF
DATA PROCESSING
From a small start-up
August 5, 2015

In the beginning
• You don’t have any data 
• So no data infrastructure or data science
– The most important thing is to survive and to keep
iterating
August 5, 2015

After a struggle you have some data
• Now you survived and now you have some
data to work with
– Data analysts are hired
– They want to analyze the data
August 5, 2015

Then …
• You don’t know where the data is exactly
• You find your data but
– It is not clean and is missing key information
– Data is likely not in the format you want
• You store them in non-optimal storage
– MySQL is likely used to store all kinds of data
• But MySQL doesn’t scale
– You ask analysts to query MySQL
• They will kill the web site a few times 
August 5, 2015

August 5, 2015

Now what to do? (I)
• You have to find a scalable and separate storage for
data analysis
– This is called Data Warehouse or Data Analytics
– This will be the central storage for your important data
– Udemy uses AWS Redshift
• Migrate some data from MySQL
– Key/Value data to NoSQL solution (Cassandra/Hbase,
MongoDB, …)
– Log type of data (use Nginx log for example)
– MySQL should only have key data which is needed from
Web service
August 5, 2015

Now what to do? (II)
• The goal is to put every data into a single
storage
– This is the most important and the very first step
toward becoming “true” data organization
– This storage should be separated from runtime
storage (MySQL for example)
– This storage should be scalable
– Being consistent is more important than being
correct in the beginning
August 5, 2015

Now You Add More Data
• Different Ways of Collecting Data
– This is called ETL (Extract, Transform and Load)
– Different Aspects to Consider
• Size: 1KB to 20GB
• Frequency: Hourly, Daily, Weekly, Monthly
• How to collect data:
– FTP, API, Webhook, S3, HTTP, mysql commandline
• You will have multiple data collection workflows
– Use cronjob (or some scheduler) to manage
– Udemy uses Pinball (Open Source from Pinterest)
August 5, 2015

How It Will Look Like
Your
Cool
Web
Service
Log Files
MySQL
Key/Value
Data
Warehouse
External
Data SourcesETL
ETL
August 5, 2015

Simple Data import
• Just use some script language
– Many data sources are small and simple enough
to use a script language
• Udemy uses Python for this purpose
– Implemented a set of Python classes to handle
different types of data import
– Plan to open source this in 1st half of 2016
August 5, 2015

Large Data Batch Import
• Large data import and processing will require
more scalable solution
• Hadoop can be used for this purpose
– SQL on Hadoop: Hive, Tajo, Presto and so on
– Pig, Java MapReduce
• Spark is getting a lot of attention and we plan
to evaluate
August 5, 2015

Realtime Data import
• Some of data better be imported as it
happens
• This requires different type of technology
– Realtime Data Message Queue: Kafka, Kinesis
– Realtime Data Consumer: Storm, Samza, Spark
Streaming
August 5, 2015

What’s Next? (I)
• Build Summary Tables
– Having raw data tables is good but it can be too
detailed and too much information
– Build these tables in your Data Warehouse
• Track the performance of key metrics
– This should be from summary tables above
– You need dashboard tool (build one or use 3rd
party solution – Birst, chartIO, Tableau and so on)
August 5, 2015

What’s Next? (II)
• Provide this data to Data Science team
– Draw insight and create feedback loop
– Build machine learned models for recommendation,
search ranking and so on
– The topic for the next session (Thanks Larry!)
• Supporting Data Science from Infrastructure
– This will require scalable infrastructure
– Example: Scoring every pairs of user/course in Udemy
• 7M users X 30K courses = 210B pairs of computation
– You need scalable Serving Layer (Cassandra, Hbase, …)
August 5, 2015

DATA ENGINEERING AT UDEMY
August 5, 2015

Data Warehouse/Analytics
• We use AWS Redshift as our data warehouse
(or data analytics backend)
• What is AWS Redshift?
– Scalable Postgresql Engine up to 1.6PB of data
– Roughly it is 600 USD per TB per month
– Mainly for offline batch processing
– Supports bulk update (through AWS S3)
– Two type of options: Compute vs. Storage
August 5, 2015

Kind of Data Stored in Redshift
• 800+ tables with 2.4TB of data
• Key tables from MySQL
• Email Marketing Data
• Ads Campaign Performance Data
• SEO data from Google
• Data from Web access log
• Support Ticket Data
• A/B Test Data (Mobile, Web)
• Human curated data from Google Spreadsheets
August 5, 2015

Details on ETL Pipelines
• All data pipelines are scheduled through
Pinball
– Every 5 minutes, hourly, daily, weekly and monthly
• Most pipelines are purely in Python
• Some uses Hadoop/Hive and Hadoop/Pig for
Batch Processing
• Start using Kinesis for Realtime Processing
August 5, 2015

Pinball Screenshot
August 5, 2015

Batching Processing Infrastructure
• We use Hadoop 2.6 with Hive and Pig
– CDH 5.4 (community version)
• We use our own hadoop cluster and AWS EMR
(ElasticMapReduce) at the same time
– This is used to do ETL on massive data
– This is also used to run massive model/scoring
pipelines from Data Science team
• Plan to evaluate Spark potentially as an
alternative
August 5, 2015

Realtime Processing
• Applications
– The first application is to process web access log
– Eventually we plan to use this to generate
personalized recommendation on-the-fly
• Plan to use AWS Kinesis
– Evaluated Apache Kafka and AWS Kinesis
• They are very similar but Kafka is an open source while
Kinesis is a managed service from AWS
• Decided to use AWS Kinesis
• Plan to evaluate Realtime Consumer
– Samza, Storm, Spark Streaming
August 5, 2015

What is Kinesis (Kafka)?
• Realtime data processing service in AWS
– Publisher-Subscriber message broker
– Very similar to Kafka
• It has two components
– One is message queue where stream of data is stored
• 24 hours of retention period
• Pay hourly by the read/write rate of the queue
– The other is KCL (Kinesis Client Library)
• Using this, build Data Producer application or Data
Consumer Application
• This can be combined with Storm, Spark Streaming, …
August 5, 2015

Data Serving Layer
• Redshift isn’t a good fit to read out the data in
realtime fashion so you need something else
• We are using (or plan to use) the followings:
– Cassandra
– Redis
– ElasticSearch
– MySQL
August 5, 2015

How It Looks Like
Udemy
Log Files
(Nginx)
MySQL
Key/Value
(Cassandra)
Data
Warehouse
(Redshift)
External
Data Sources
Data Science Pipeline
ETL
ETL
Data Science Pipeline
August 5, 2015

LESSONS LEARNED
August 5, 2015

• As a small start-up survive first and then work
on data
• Starting point is to store all data in a single
location (data warehouse)
• Start with batch processing and then realtime
• Consider the type of data you store
– Log vs. Key/Value vs. Transactional Record
• Store data in the form of log (change history)
– So that you can always go back and debug/replay
• Cloud is good unless you have really massive
data
August 5, 2015

Q & A
Udemy is Hiring
August 5, 2015

Data Engineering at Udemy

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data Engineering at Udemy

Similaire à Data Engineering at Udemy (20)

Dernier

Dernier (20)

Data Engineering at Udemy

Notes de l'éditeur