Apache Drill で見る Twitter の世界

Apache Drill で見る
Twitter の世界Tokyo Apache Drill Meetup
2015 年11 月12 日
渡部優

自己紹介
名前: 渡部優（ Watanabe Masaru ）
所属: 株式会社CyberAgent / AMoAd
職種: サーバサイドエンジニア

普段の業務
AMoAd
スマホ向けのアドネットワークサービス
RTB 配信サーバ、集計バッチの開発など
ユーザx メディアx 広告主
言語
Scala, Java, Python (, Ruby)
分析
Hadoop, Spark, BigQuery
Notebook : Zeppelin, Jupyter
Drill : 1 week ← New!

1. 知らない人はいなそう
2. それなりにデータが大きい
3. 分散モードでDrill してみたい

今回分析するデータ
Twitter REST API で取れるユーザプロファイル
※ツイート情報ではありません
収集時期 2013年10月中旬
アカウント数約8 億
フォーマット Line-delimited JSON
データサイズ約150 GB (GZip圧縮)

ユーザプロファイル？
Twitter REST API 1.1 'Get user/lookup
アカウント設定、ツイート数、フォワー数などなど
{
"name": "Twitter API",
"profile_sidebar_fill_color": "DDEEF6",
"profile_background_tile": false,
"profile_sidebar_border_color": "C0DEED",
"profile_image_url": "http://a0.twimg.com/profile_images/2284174872/7df3h38
"location": "San Francisco, CA",
"created_at": "Wed May 23 06:01:13 +0000 2007",
"follow_request_sent": false,
"id_str": "6253282",
"profile_link_color": "0084B4",
"is_translator": false,
"default_profile": true,
"favourites_count": 24,
"contributors_enabled": true,
"url": "http://dev.twitter.com",
"profile_image_url_https": "https://si0.twimg.com/profile_images/2284174872

Drill クラスタ
Apache Drill ver. 1.2
コア数 120 ( 12 x 10 nodes)
メモリ 40GB / node
ストレージ MapR FS

補足
今回はデモに合わせて必要なカラムだけに絞りました
フォーマット Line-delimited JSON
変更後 Parquet
データサイズ約150 GB (GZip圧縮)
変更後約43 GB (Snappy圧縮)

デモ用データ生成クエリ
出力フォーマットを変更し
ALTER SYSTEM SET `store.format` = 'parquet'

1クエリでJSONからParquetに変換できます
CREATE TABLE
dfs.tmp.`twitter/user_2013_demo`
AS (
SELECT
name,
location,
created_at,
favourites_count,
url,
utc_offset,
id,
listed_count,
lang,
followers_count,
protected,
geo_enabled,
time_zone,

デモ分散モードでは、Webからクエリを投げます
Apache Drill Cluster

本当に8 億アカウントあるの？
SELECT
COUNT( id ) AS ids
FROM
dfs.`tmp/twitter/user_2013_demo`

本当に8 億アカウントあるの？
結果 ids
857,645,612

ツイート数ランキング
SELECT
RANK() OVER ( ORDER BY statuses_count DESC ) AS ranking,
screen_name,
statuses_count
FROM
WHERE
--statuses_count > 1E4 -- 90 sec
--statuses_count > 1E5 -- 60 sec
statuses_count > 1E6 -- 40 sec
ORDER BY
ranking
LIMIT 1000

ツイート数ランキング
rank screen tweets
1 YOUGAKUDAN_00 37,300,610
2 RS_ian 10,675,618
3 NotFoundProfile 8,156,645
4 ___Scc____ 7,083,202
5 ____S__A____ 7,099,741

ツイート速度ランキング
SELECT
RANK() OVER ( ORDER BY tweets_per_hour DESC ) AS ranking,
*
FROM (
SELECT
screen_name,
statuses_count / (
UNIX_TIMESTAMP( '2013-10-25 00:00:00' ) -
UNIX_TIMESTAMP( created_at, 'EEE MMM dd HH:mm:ss Z yyyy' )
) * 60 * 60 AS tweets_per_hour,
statuses_count,
created_at
FROM
WHERE
statuses_count > 2E6 -- 60 sec
)

ツイート速度ランキング
rank screen tweets (/1h) tweets
1 YOUGAKUDAN_00 1008.8 37,300,610
2 RS_ian 645.7 10,675,618
3 NotFoundProfile 544.6 8,156,645
4 ___Scc____ 444.2 7,083,202
5 ____S__A____ 374.3 7,099,741

ユーザ数の多い地域ランキング
location(任意テキスト) の代わりにtime_zoneで算
出
SELECT
RANK() OVER ( ORDER BY users DESC ) AS ranking,
*
FROM (
SELECT
time_zone, -- not use 'location'
COUNT(1) AS users
FROM
WHERE
NOT (time_zone = '') -- 30 sec
GROUP BY
time_zone )
ORDER BY
ranking
LIMIT 1000

rank time_zone users
1 Central Time (US & Canada) 15,198,713
2 Eastern Time (US & Canada) 14,932,129
3 Pacific Time (US & Canada) 13,234,999
4 Tokyo 11,972,200
5 Hawaii 9,652,707
6 Brasilia 9,350,127
7 Quito 8,553,937
8 Santiago 6,684,055
9 Greenland 6,330,616
10 Bangkok 6,251,371

さらにアカウント作成日を考慮す
れば

アカウント作成日x 地域別ユーザ増加数
参考: 2009: ハドソン川飛行機事故
http://jp.techcrunch.com/2009/01/16/20090115plane-
crashes-in-hudson-first-pictures-on-flickr-tumblr-twitpic

フォロワー数ランキング
SELECT
RANK() OVER ( ORDER BY followers_count DESC ) AS ranking,
screen_name,
followers_count
FROM
WHERE
followers_count > 1E5 -- 40 sec
ORDER BY
ranking
LIMIT 1000

rank screen followers
1 justinbieber 44,608,861
2 katyperry 42,862,721
3 ladygaga 40,127,107
4 BarackObama 36,678,153
5 taylorswift13 34,121,718
6 YouTube 33,581,290
7 britneyspears 31,940,642
8 rihanna 31,666,558
9 instagram 26,664,278
10 jtimberlake 25,837,905

フォロー数ランキング
SELECT
RANK() OVER ( ORDER BY friends_count DESC ) AS ranking,
screen_name,
friends_count
FROM
WHERE
friends_count > 1E5 -- 40 sec
ORDER BY
ranking
LIMIT 1000

rank screen follows
1 ArabicBest 2,122,665
2 TheArabHash 2,035,893
3 Asr3Follow 1,896,414
4 ArabicScience 1,436,979
5 STOP_Q80 1,428,028
6 VenGalL 1,395,800
7 hootsuite 1,354,751
8 7madms 1,327,781
9 3LCHBOSAHAM 1,252,964
10 Do3av 1,165,899

有名人の影響力を調べる
仮) フォロワー数10,000人以上を有名人とする
SELECT
SUM( followers_count ) AS total_followers,
SUM( CASE WHEN followers_count > 10000 THEN followers_count ELSE 0
SUM( CASE WHEN followers_count <= 10000 THEN followers_count ELSE 0
COUNT( 1 ) AS total_users,
SUM( CASE WHEN followers_count > 10000 THEN 1 END ) AS celb_users,
SUM( CASE WHEN followers_count <= 10000 THEN 1 ELSE 0 END ) AS normal_users
FROM

有名人の影響力を調べる
count percent
total_followers 52,002,443,928 100.0 %
celb_followers 21,410,243,998 41.2 %
normal_followers 30,592,199,930 58.8 %
count percent
total_users 857,645,612 100.0 %
celb_users 381,567 4.44 %
normal_users 857,264,045 95.56 %
"The rich get richer and the poor get poorer"

Apache Drill でTwitter を見渡した
結果
1. Drill 初心者でも簡単に分散環境でクエリが投げれた
2. 様々なデータソースへのアドホッククエリが本当にお
手軽
3. 大きいデータに対するクエリは少し注意

Thank you!
(機会があれば、ツイート分析とかもやってみたいです
ね)

Apache Drill で見る Twitter の世界

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (12)

Similaire à Apache Drill で見る Twitter の世界

Similaire à Apache Drill で見る Twitter の世界 (20)

Apache Drill で見る Twitter の世界