2. eBuddy
Web based chat (Started in 2003)
● Initially no statistics, msn only
● Started basic logging in 2004
● Today
○ 34.467.010.693 login records (34x109)
○ It takes about 40min to select them all.
XMS (Launched May 23, 2011)
● Today
○ 1.334.794.121 records (1,3x109)
Website (google analytics)
Banners (openx)
3. Warehousing needs
● Product owners
○ Comparing product version
■ avg duration
■ msg sent/received
○ Churn analysis
○ Feature analysis
● Marketing
○ What countries should we focus on
○ What people should we target?
● Sales
○ Sell banners in countries/products.
● Operations/Dev
○ Help solve bugs
○ Blocked in countries/providers
6. Interesting to know
● Developers are Java centric
● Hosting in the US but BI people in Amsterdam
● 18 hadoop nodes each having
○ 16 cores
○ 24G ram
○ 4x400G HD's
● We make money with banners
○ So don't expect deep pockets
7. Warehouse timeline
● Traditional rdbms (2004)
● Custom mapreduce code (2008)
○ Joining two files (merge join/map join?)
○ Repeating code
○ Consider abstraction
○ Changing data changing code?
● Pig scripts (2008/2009)
○ Much simpler to read but domain specific
● Hive (2009)
○ Generic sql but with some limitations
○ Existing tools can be used
8. Hive
● Hey I already know this:
select *
from table1 t1
left outer join table2 t2 on (t1.id = t2.id)
where t2.id is null;
● Java programmers will like this:
○ Spring JdbcTemplates
○ Existing jdbc tools (SQuirreL)
○ Syntax highlighting
○ Code completion
9. Present
● App servers log to mysql
○ Brittle but it works
● Hive
○ Sql (most developers know this)
○ Partition pruning issues
○ No rollup queries
● ETL
○ Star schema
○ Fair scheduling (ETL vs BI)
■ reserved for etl pool
■ don't start reducers until 90% mappers done
○ Lzo on all jobs
● MicroStrategy (odbc)
● SQuirreL (jdbc)
10. Future
● Look at users from a to z
○ website logs
○ banners
● Cassandra handler for hive
○ Looking at contact lists (not just size)
● Streaming ETL
○ flume
■ No more mysql & scripts
■ Directly write into the correct partition
○ avro
■ Less schema related problems
○ snappy
■ Lightweight compression
12. Hive partition pruning
● Won't work
select count(*)
from chatsessions cs
inner join calendar c on (c.cldr_id = cs.login_cldr_id)
where c.iso_date = '2012-06-14';
● Will work
select cldr_id from calendar where iso_date = '2012-06-14';
select count(*) from chatsessions where login_cldr_id in (1234);
14. Left outer join in Pig
A = LOAD 'file1' USING PigStorage(',') AS (a1:int,a2:chararray);
B = LOAD 'file2' USING PigStorage(',') AS (b1:int,b2:chararray);
C = COGROUP A BY a1, B BY b1 OUTER;
X = FILTER C BY IsEmpty(B);
Z = FOREACH X GENERATE flatten(A.a2);
DUMP Z;