3. Today agenda
•
•
•
•
•
•
Big data in Miaozhen 秒针系统
Overview of Cloudera Impala
Hacking practice in Cloudera Impala
Performance
Conclusions
Q&A
4. What happened in miaozhen
• 3 billion Ads impression per day
• 20TB data scan for report generation every morning
• 24 servers cluster
• Besides this
–
–
–
–
TV Monitor
Mobile Monitor
Site Monitor
…
5. Before Hadoop
• Scrat
– PostgreSQL 9.1 cluster
– Write a simple proxy
– <2s for 2TB data scan
• Mobile Monitor
– Hadoop-like distribute computing system
– Rabbit MQ + 3 computing servers
– Write a Map-Reduce in C++
– Handles 30 millions to 500 millions Ads impression
6. Problem & Chance
• Database cluster
• SQL on Hadoop
• Miscellaneous data
• Requirements
– Most data is rational
– SQL interface
8. What’s this
• A kind of MPP engine
• In memory processing
• Small to big join
– Broadcast join
• Small result size
9. Why Cloudera Impala
• The team move fast
– UDF coming out
– Better join strategy on the way
• Good code base
– Modularize
– Easy to add sub classes
• Really fast
– Llvm code generation
• 80s/95s – uv test
– Distributed aggregation Tree
– In-situ data processing (inside storage)
10. Typical Arch.
SQL Interface
Meta Store
Query
Planner
Query
Planner
Query
Planner
Coordinat
or
Coordinat
or
Coordinat
or
Exec
Engine
Exec
Engine
Exec
Engine
11. Our target
• A MPP database
– Build on PostgreSQL9.1
– Scale well
– Speed
• A mixed data source MPP query engine
– Join two tables in different sources
– In fact…
12. Hacking… from where
• Add, not change
– Scan Node type
– DB Meta info
• Put changes in configuration
– Thrift Protocol update
• TDBHostInfo
• TDBScanNode
13. Front end
• Meta store update
– Link data to the table name
– Table location management
• Front end
– Compute table location
14. Back end
• Coordinator
– pg host
• New scan node type
– db scan node
• Pg scan node
• Psql library using cursor
15. SQL Plan
• select count(distinct id)
from table
– MR like process
HDFS/PG scan
Aggr. : group by id
Exchange node
Aggr. : group by id
Aggr. : count(id)
Exchange node
Aggr.: sum(count(id)
16. Env.
• Ads impression logs
– 150 millions, 100KB/line
• 3 servers
–
–
–
–
24 cores
32 G mem
2T * 12 HD
100Mbps LAN
• Query
– Select count(id) from t group by campaign
– Select count(distinct id) from t group by campaign
– Select * from t where id = ‘xxxxxxxx’
17. Performance
• Group by speed / core
• 20 M /s
700
600
500
400
impala
hive
300
pg+impala
200
100
0
1
2
3
19. Codegen on/off
• select count(distinct id)
from t group by c
100
90
80
70
• select distinct id
from t
60
50
en_codegen
40
dis_codegen
30
•
20
select id from t
10
group by id
0
having
uv_test
count(case when c = '1' then 1 else null end) > 0
and
count(case when c= 2' then 1 else null end) > 0
limit 10;
distinct
duplicated
21. Conclusion
• Source quality
– Readable
– Google C++ style
– Robust
• MPP solution based on PG
– Proved perf.
– Easy to scale
• Mixed engine usage
– HDFS and DB