Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Analytics Infrastructure @ Viki
Grokking Engineering
Dec 2014

Talk Outline
• Introduction
• Background + Problems
• Data Architecture
– Data Collection & Storage
– Data Processing & Aggregation
– Data Presentation & Vizualization
– Real-time dashboard and alerts
• Other Comments

Youtube - A Typical Web Application
• Daily/weekly registered users by different platforms, countries?
• How many video uploads do we have everyday?

Behavioral Data? (vs Transactional Data)
• Transactional Data
 Mission-critical data (e.g user accounts, bookings, payments)
 Often fixed schema
 Lower volume
 Transaction control
• Behavioral Data
 Logging data (e.g. page view, video start, ad impression)
 Often semi-structure (JSON)
 Huge volume
 No transaction control

Data Infrastructure
1.Collect and Store Data
2.Centralize and Process Data
3.Present and Vizualize Data

1. Collect & Store Data
{
"origin":"tv_show_show", "app_ver":"2.9.3.151”,
"uuid":"80833c5a760597bf1c8339819636df04”,
"user_id":"5298933u”,
"vs_id":"1008912v-1380452660-7920”,
"app_id":"100004a”, "event":”video_play",
"timed_comment":"off”, "stream_quality":"variable”,
"bottom_subtitle":"en", "device_size":"tablet”,
"feature":"auto_play", "video_id":"1008912v”,
”subtitle_completion_percent":"100”,
"device_id":"iPad2,1|6.1.3|apple", "t":"1380452846”,
"ip":"99.232.169.246”, "country":"ca”,
"city_name":"Toronto”, "region_name":"ON”
}
• Samples: page view, video start,
ad impression, etc.
• Behavioural Data
 Semi-structured (JSON)
 Massive Volume (100M+/day)
 Does not fit traditional RDBMS
databases

1. Collect & Store Data
• fluentd
 Scalable
 Extensible
 Forward data to Hadoop, MongoDB, PostgreSQL etc.

Hydration System
• Inject time-sensitive information into events

2. Centralizing & Processing Data

2. Centralizing & Processing Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies

Getting All Data To 1 Place
thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1
thor db:cp --source A --destination B –t reporting.video_plays --increment

b) Click-stream Data (Hadoop)  Analytics DB:
{"origin":"tv_show_show", "app_ver":"2.9.3.151",
"uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u",
"vs_id":"1008912v-1380452660-7920", "app_id":"100004a”,
"event":”video_play","timed_comment":"off”, "stream_quality":"variable”,
"bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play",
"video_id":"1008912v", ”subtitle_completion_percent":"100",
"device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”,
"country":"ca", "city_name":"Toronto”, "region_name":"ON"}
…
date source partner event video_id country cnt
2013-09-29 ios viki video_play 1008912v ca 2
2013-09-29 android viki video_play 1008912v us 18
…
Hadoop
PostgreSQL
Aggregation (Hive)
Export Output / Sqoop

SELECT
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,
v['source'],
v['partner'],
v['event'],
v['video_id'],
v['country'],
COUNT(1) as cnt
FROM events
WHERE TIME_RANGE(time, '2013-09-29', '2013-09-30')
AND v['event'] = 'video_play'
GROUP BY
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ),
v['source'],
v['partner'],
v['event'],
v['video_id'],
v['country'];
Simple Aggregation SQL

But…
The Data Is Not Clean!
Event properties and names change as we
develop:
Old Version: {"user_id": "152", "country_code":"sg" }
{"user_id": "152u”, "country": "sg" }
New Version:

SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,
v['app_id'] AS `app_id`,
CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis'
WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon'
WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon'
WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo'
WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian'
WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren'
WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere'
ELSE LOWER( v['partner'] )
END AS `partner`,
CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'
WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'
WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed'
WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'
WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'
ELSE TRIM( v['source'] )
END AS `source` ,
LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2
THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )
ELSE NULL END ) AS `country` ,
COALESCE ( v['device_size'] ,v['device'] ) AS `device`,
COUNT( 1 ) AS `cnt`
FROM events
WHERE time >= 1380326400 AND time <= 1380412799
AND v['event'] = 'video_play'
GROUP BY
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['app_id'],
CASE WHEN v['app_ver'] LIKE '%_ax'
THEN 'axis' WHEN v['app_ver'] LIKE '%_kd'
THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf'
THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv'
THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx'
THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf'
THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp'
THEN 'samsung_viki_premiere'
ELSE LOWER( v['partner'] )
END ,
CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'
WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'
WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) )
THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'
WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'
ELSE TRIM( v['source'] ) END,
LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2
THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )
ELSE NULL END ),
COALESCE ( v['device_size'] ,v['device'] );
(Not so) simple Aggregation SQL
Hadoop

UPDATE "reporting"."cl_main_2013_09"
SET source = 'embed', partner = ’partner1'
WHERE app_id = '100105a' AND (source != 'embed' OR partner != ’partner1')
SET app_id = '100105a'
WHERE (source = 'embed' AND partner = ’partner1') AND (app_id != '100105a')
UPDATE reporting.cl_main_2013_09
SET user_id = user_id || 'u’
WHERE RIGHT(user_id, 1) ~ '[0-9]’
SET app_id = '100106a'
WHERE (source = 'embed' AND partner = ’partner2') AND (app_id != '100106a')
UPDATE reporting.cl_main_2013_09
SET source = 'raynor', partner = 'viki', app_id = '100000a’
WHERE event = 'pv’
AND source IS NULL
AND partner IS NULL
AND app_id IS NULL
…post-import cleanup
PostgreSQL
Cleaning Up Data Takes Lots of Time

Transforming Data
• Cleaning Data

Transforming Data
…
Table A
Table B
…
Analytics DB (PostgreSQL)

a) Reducing Table Size By Dropping Dimension (Aggregation)
date source partner event video_id country cnt
…
date source partner event country cnt
2013-09-29 ios viki video_play ca 20
…
PostgreSQL
20M records
4M records
video_plays_with_video_id
video_plays

b) Injecting Extra Fields For Analysis
1 n
id title
1c Game of Thrones
2c How I Met Your
Mother
…
PostgreSQL
id title num_videos
1c Game of
Thrones
30
2c How I Met
Your Mother
16
…
shows videos
shows shows

Injecting Extra Fields For Analysis
containers containers
id title
1c Game of Thrones
2c My Girlfriend Is A
Gumiho
…
PostgreSQL
id title video_count
1c Game of
Thrones
30
2c My Girlfriend
Is A Gumiho
16
…
1 n
containers videos

Chunk Tables By Month
…
video_plays_2013_06
video_plays_2013_07
video_plays_2013_08
video_plays_2013_09
video_plays (parent table)
ALTER TABLE video_plays_2013_09 INHERIT
video_plays;
ALTER TABLE video_plays_2013_09
ADD CONSTRAINT CHECK
date >= '2013-09-01'
AND date < '2013-10-01';

Managing Job Dependency
• Cleaning Data

…
Job A
Job B
…

…
tableA
tableB
…

Azkaban
Cron dependency
management
(Viki Cron Dependency Graph)

3. Data Presentation and Visualization

Summary report
• Higher level view of metrics
• See changes over time
• (screen shot)

Data Explorer
“The world is your oyster”

Real Time Infrastructure (Apache
Storm)

Alerts
Know when the house is burning down!

Then Global Content Source and
Consumption

Our Technology Stack
• Languages/Frameworks
– Ruby, Rails, Python, Go, JavaScript, NodeJS
– Fluentd (Log collector)
– Java, Apache Storm, Kestrel
• Databases
– PostgreSQL, MongoDB, Redis
– Hadoop/Hive, Amazon Redshift
– Amazon Elastic MapReduce

Hadoop vs. Amazon Redshift
• Hadoop is a big-data storage and processing engine
platform
– HDFS: data-storage layer
– YARN: resource management
– MapReduce/Pig/Hive/Spark: processing layer
• Amazon Redshift (MPP, massively parallel processing)
– Columnar-storage database. Meant for analytics purpose.
– OLAP – Online Analytics Processing
– Examples: Vertica, Amazon Redshift, Parracel

Thank You!
http://bit.ly/viki-datawarehouse
http://engineering.viki.com/blog/2014/data-warehouse-and-analytics-infrastructure-at-viki/
engineering.viki.com
huy@viki.com

Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Similar to Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen (20)

Recently uploaded

Recently uploaded (20)

Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

Editor's Notes