4. Youtube - A Typical Web Application
• Daily/weekly registered users by different platforms, countries?
• How many video uploads do we have everyday?
5. Youtube - A Typical Web Application
• Daily/weekly registered users by different platforms, countries?
• How many video uploads do we have everyday?
6.
7. Behavioral Data? (vs Transactional Data)
• Transactional Data
Mission-critical data (e.g user accounts, bookings, payments)
Often fixed schema
Lower volume
Transaction control
• Behavioral Data
Logging data (e.g. page view, video start, ad impression)
Often semi-structure (JSON)
Huge volume
No transaction control
15. 2. Centralizing & Processing Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
16. 2. Centralizing & Processing Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
17. Getting All Data To 1 Place
thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1
thor db:cp --source A --destination B –t reporting.video_plays --increment
18. b) Click-stream Data (Hadoop) Analytics DB:
{"origin":"tv_show_show", "app_ver":"2.9.3.151",
"uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u",
"vs_id":"1008912v-1380452660-7920", "app_id":"100004a”,
"event":”video_play","timed_comment":"off”, "stream_quality":"variable”,
"bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play",
"video_id":"1008912v", ”subtitle_completion_percent":"100",
"device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”,
"country":"ca", "city_name":"Toronto”, "region_name":"ON"}
…
date source partner event video_id country cnt
2013-09-29 ios viki video_play 1008912v ca 2
2013-09-29 android viki video_play 1008912v us 18
…
Hadoop
PostgreSQL
Aggregation (Hive)
Export Output / Sqoop
19. SELECT
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,
v['source'],
v['partner'],
v['event'],
v['video_id'],
v['country'],
COUNT(1) as cnt
FROM events
WHERE TIME_RANGE(time, '2013-09-29', '2013-09-30')
AND v['event'] = 'video_play'
GROUP BY
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ),
v['source'],
v['partner'],
v['event'],
v['video_id'],
v['country'];
Simple Aggregation SQL
20. But…
The Data Is Not Clean!
Event properties and names change as we
develop:
Old Version: {"user_id": "152", "country_code":"sg" }
{"user_id": "152u”, "country": "sg" }
New Version:
21. SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,
v['app_id'] AS `app_id`,
CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis'
WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon'
WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon'
WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo'
WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian'
WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren'
WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere'
ELSE LOWER( v['partner'] )
END AS `partner`,
CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'
WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'
WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed'
WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'
WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'
ELSE TRIM( v['source'] )
END AS `source` ,
LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2
THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )
ELSE NULL END ) AS `country` ,
COALESCE ( v['device_size'] ,v['device'] ) AS `device`,
COUNT( 1 ) AS `cnt`
FROM events
WHERE time >= 1380326400 AND time <= 1380412799
AND v['event'] = 'video_play'
GROUP BY
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['app_id'],
CASE WHEN v['app_ver'] LIKE '%_ax'
THEN 'axis' WHEN v['app_ver'] LIKE '%_kd'
THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf'
THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv'
THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx'
THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf'
THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp'
THEN 'samsung_viki_premiere'
ELSE LOWER( v['partner'] )
END ,
CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'
WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'
WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) )
THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'
WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'
ELSE TRIM( v['source'] ) END,
LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2
THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )
ELSE NULL END ),
COALESCE ( v['device_size'] ,v['device'] );
(Not so) simple Aggregation SQL
Hadoop
22. UPDATE "reporting"."cl_main_2013_09"
SET source = 'embed', partner = ’partner1'
WHERE app_id = '100105a' AND (source != 'embed' OR partner != ’partner1')
UPDATE "reporting"."cl_main_2013_09"
SET app_id = '100105a'
WHERE (source = 'embed' AND partner = ’partner1') AND (app_id != '100105a')
UPDATE reporting.cl_main_2013_09
SET user_id = user_id || 'u’
WHERE RIGHT(user_id, 1) ~ '[0-9]’
UPDATE "reporting"."cl_main_2013_09"
SET app_id = '100106a'
WHERE (source = 'embed' AND partner = ’partner2') AND (app_id != '100106a')
UPDATE reporting.cl_main_2013_09
SET source = 'raynor', partner = 'viki', app_id = '100000a’
WHERE event = 'pv’
AND source IS NULL
AND partner IS NULL
AND app_id IS NULL
…post-import cleanup
PostgreSQL
Cleaning Up Data Takes Lots of Time
23. Transforming Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
25. a) Reducing Table Size By Dropping Dimension (Aggregation)
date source partner event video_id country cnt
2013-09-29 ios viki video_play 1v ca 2
2013-09-29 ios viki video_play 2v ca 18
…
date source partner event country cnt
2013-09-29 ios viki video_play ca 20
…
PostgreSQL
20M records
4M records
video_plays_with_video_id
video_plays
26. b) Injecting Extra Fields For Analysis
1 n
id title
1c Game of Thrones
2c How I Met Your
Mother
…
PostgreSQL
id title num_videos
1c Game of
Thrones
30
2c How I Met
Your Mother
16
…
shows videos
shows shows
27. Injecting Extra Fields For Analysis
containers containers
id title
1c Game of Thrones
2c My Girlfriend Is A
Gumiho
…
PostgreSQL
id title video_count
1c Game of
Thrones
30
2c My Girlfriend
Is A Gumiho
16
…
1 n
containers videos
28. Chunk Tables By Month
…
video_plays_2013_06
video_plays_2013_07
video_plays_2013_08
video_plays_2013_09
video_plays (parent table)
ALTER TABLE video_plays_2013_09 INHERIT
video_plays;
ALTER TABLE video_plays_2013_09
ADD CONSTRAINT CHECK
date >= '2013-09-01'
AND date < '2013-10-01';
29. Managing Job Dependency
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
Theme of the talk:
We have all this data at viki, how do we collect it, make it usable and then derive biz value from it.
Other cool technologies that we use.
What are the predominant countries of origin for our content (Korea)?
How many Registered Users do we have and what is the growth trajectory?
What videos are translated into what languages and what is the percent completion rate?
What are our top performing shows (video starts) in the last 3 months? By geographic region?
What are the sources of our Registered Users (how do they find us, embed referral traffic, Google search, etc.)?
What videos are translated into what languages and what are the corresponding consumption patterns?
What are the predominant countries of origin for our content (Korea)?
How many Registered Users do we have and what is the growth trajectory?
What videos are translated into what languages and what is the percent completion rate?
What are our top performing shows (video starts) in the last 3 months? By geographic region?
What are the sources of our Registered Users (how do they find us, embed referral traffic, Google search, etc.)?
What videos are translated into what languages and what are the corresponding consumption patterns?
Website database tables: a user registration web flow would persist information provided by the user in a PostgreSQL database table.
Accounting data: highly transactional data integrity requirements are high, Enterprise Resource Planning (ERP) and General Ledger (GL) accounting systems.
where the structure of the data is predefined and fixed and the integrity of which can be reasonably relied upon.
s of which can be distributed in a Hadoop cluster
Add example of an event JSON
Simple?
Client libraries
We collect the events and put them in a queue
Structured and unstructured data
Simple?
Client libraries
We collect the events and put them in a queue
Structured and unstructured data
Simple?
Client libraries
We collect the events and put them in a queue
Structured and unstructured data
Centralizing All Data Sources
Data Cleanliness
Data Transformation
Managing Job Dependencies
Centralizing All Data Sources
Data Cleanliness
Data Transformation
Managing Job Dependencies
To effectively run queries on our data, we need to bring all the data into the same database. In this case we choose Postgres since all our databases are already in Postgres.
Anyone here knows Postgres? It’s like mysql, but it’s better.
We’ve built command line tools to copy tables from database to database. So the following command copies all tables in public schema of gaia database to our analytics database, and give them a separate schema. In PG, what schema means is something like namespace for tables.
Take a look at 1 sample event being stored in Hadoop in semi-structured JSON form, you have a video play event for that video id running on an ipad device, coming from an autoplay feature, from Toronto, Ontario, Canada. That’s a hell lot of dimensions. We want to aggregate and select a subset of dimensions to port into PG.
The Hadoop Provider we uses (Treasure Data) has a feature that allows you to specify a destination data storage (in this case Postgres), it’ll execute the Hadoop job and write the results into the selected database. It’s the equivalent of using Sqoop to bulk export data into Postgres.
As we develop, our data changes, we make mistake, we forgot to set a variable somewhere, we change our data structure. So the new data gets mixed up with the old data. And to make meaningful, and the simple query becomes not so simple.
Centralizing All Data Sources
Data Cleanliness
Data Transformation
Managing Job Dependencies
Once all our data are in Postgres, we start to perform transformation/aggregation to them, depending on various different purposes.
For example, to reduce the size of the table to serve them on a web UI front-end, we aggregate the data further. In this example, we’re dropping the video_id dimension, thus grouping the 2 records together as a new record with the cnt field total to 20.
And that reduces the table size.
For example, to reduce the size of the table to serve them on a web UI front-end, we aggregate the data further. In this example, we’re dropping the video_id dimension, thus grouping the 2 records together as a new record with the cnt field total to 20.
And that reduces the table size.
For example, to reduce the size of the table to serve them on a web UI front-end, we aggregate the data further. In this example, we’re dropping the video_id dimension, thus grouping the 2 records together as a new record with the cnt field total to 20.
And that reduces the table size.
We also chunk our data tables by month, so that when the new month comes, you don’t touch the old months’ data.
This also reduce the index size and make it easier to archive your old data.
When we first implemented this, we didn’t know how to query cross-month, so we have to write complicated query (like UNION), sometimes we even have to load the data into memory and process them.
But then we found out out this awesome feature in Postgres called Table Inheritance. It lets you define a parent table with a bunch of children. And you just need to query the parent table, and depending on your query, it’ll find out the correct children tables to hit.
Centralizing All Data Sources
Data Cleanliness
Data Transformation
Managing Job Dependencies
Can anyone tell me what this means? Ok no one can. That’s exactly my point.
At some point, our daily job workflow grew so complicated that it’s becoming hard to use crontab to manage them.
Can anyone tell me what this means? Ok no one can. That’s exactly my point.
At some point, our daily job workflow grew so complicated that it’s becoming hard to use crontab to manage them.
There is too many reports! I want to see the high level metrics all in one place
Enabling the product and business folks to “write” their own queries
How do you process 1k events a second?
Scalable and distributed
Guaranteed Message Passing
Fault Tolerant
How do you process 1k events a second?
Scalable and distributed
Guaranteed Message Passing
Fault Tolerant
We try to detect peaks and valleus in our real time data, and send out alerts every hour if any.
We are always exploring new technologies and finding the best tool for the job. The real reason is we get bored ;)
I wasn’t hired as a rails developer, I was hired as a developer