This document contains an agenda for a presentation on using Hadoop and HBase for social media monitoring. The presentation covers why Hadoop and HBase are suitable technologies, challenges and lessons learned, and resources for getting started. It includes sections on the speaker's background, the social media monitoring process, using coprocessors in HBase, and testing performed on a test cluster.
1. CC 2.0 by William Brawley | http://flic.kr/p/7PdUP3
2. 5. Juni
2012
• Why Hadoop and HBase? 2
• Social Media Monitoring
• Prospective Search and Coprocessors
• Challenges & Lessons Learned
• Resources to get started
Agenda
3. 5. Juni
2012
Software Architect 3
@ sentric
Co-founder and
organizer of the
Swiss HUG
Contact:
christian.guegi@sentric.ch
http://www.sentric.ch
@chrisgugi
About me
4. 5. Juni
2012
• Spin-off of MeMo News AG, the 4
leading provider for Social Media
Monitoring & Analytics in Switzerland
• Big Data expert, focused on Hadoop,
HBase and Solr
• Objective: Transforming data into
insights
About sentric
6. 5. Juni
2012
6
Information Information Analysis & Insight
Gathering Processing Interpretation Presentation
Why Hadoop and HBase?
Social Media Monitoring Process
7. 5. Juni
2012
7
Cost
effective
High
SMM
Reliable
scalable
Analytical
RT Alerting
capabilities
Why Hadoop and HBase?
Requirements
9. CC 2.0 by nolifebeforecoffee | http://flic.kr/p/c1UTf
10. 5. Juni
2012
10
Downloaded Articles
match?
Search Agents
Output
Web-UI Reports RT Alerts
Icons by http://dryicons.com
Social Media Monitoring
Overview
11. 5. Juni
2012
11
n Crawler
REST
HBase
RowLog Coprocessor
Web-UI
MySQL Solr RT Alerts
Icons by http://dryicons.com
Social Media Monitoring
Solution Architecture
12. 5. Juni
2012
• Inspired by Google Bigtable 12
coprocessors
• HBase version 0.92
• Embed code directly into server
processes
• High-level call interface for clients
• Automatic scaling, load balancing,
request routing
Short Primer on Coprocessors
Overview
13. 5. Juni
2012
• Like a database trigger 13
• Provides event based hooks
• Concrete Implementations
• RegionObserver
• CRUD or DML type operations
• MasterObserver
• DDL or metadata operations and cluster
administration
• WALObserver
• Write-ahead-log appending and restoration
Short Primer on Coprocessors
Observer Classes
14. 5. Juni
2012
14
Client:Get()
CP1:preGet() CP2:preGet() CP3:preGet()
Hregion:Get()
CP1:postGet() CP2:postGet() CP3:postGet()
RegionServer
client response
Short Primer on Coprocessors
Observer Execution
15. 5. Juni
2012
• Comparable to stored procedures 15
• Custom RPC protocol, used between
client and region server
• Loaded in region server
• Client call APIs over single row or a
row range
• Framework translates row keys to region
location
• Parallel execution
Short Primer on Coprocessors
Endpoint Classes
16. 5. Juni
2012
16
Client code
Batch.Call<CountProtocol,int> Region Server 1
int call(CountProtocol p) {
table,,12345678 CountProtocol
return p.getRowCount();
} .
table,bbb,12345678 CountProtocol
HTable
coprocessorExec()
Region Server 2
table,ccc,12345678 CountProtocol
table,ddd,12345678 CountProtocol
Map<byte[], Integer> countsByRegion
Short Primer on Coprocessors
Endpoint Call Routine
17. 5. Juni
2012
• HBase Security (Version 0.94) 17
• Aggregate operations avg(), sum()
• AggregatorProtocol
• HBASE-3529: Embedded search
Short Primer on Coprocessors
Use Cases
18. 5. Juni
2012
18
Processing
Put operations
Prospective
Search
HRegion RT Alerts
HRegionServer
Icons by http://dryicons.com
Social Media Monitoring
Prospective Search with Coprocessors
19. 5. Juni
2012
• Standard, virtualized test cluster: 19
4RS/DN, 1HM, 1NN, 3ZK
• Test dataset created from 2h of live
index (1GB)
• Drive load on RS/DN
Social Media Monitoring
Testing Setup
20. 5. Juni
2012
1800 20
1600
1400
1200
Writes/sec
1000
800
600
400
200
0
0 10 50 100 200 400 800
# of agents
Social Media Monitoring
Test Results
22. 5. Juni
2012
• Everyone is still learning 22
• Some issues only appear at scale
• Production cluster configuration
• Hardware issues
• Tuning cluster configuration to our work
loads
• HBase stability
• Monitoring health of HBase
Challenges & Lessons Learned
Challenges
23. 5. Juni
2012
• Be careful with expensive operations 23
in coprocessors
• At scale, nothing works as advertised
• Monitoring/Operational tooling is
most important
• Play with all the configurations and
benchmark for tuning
Challenges & Lessons Learned
Lessons
24. 5. Juni
2012
• https://blogs.apache.org/hbase/ 24
entry/coprocessor_introduction
• http://hbase.apache.org/apidocs/
index.html
• http://www.lilyproject.org/lily/about/
playground/hbaserowlog.html
• http://www.github.com/sentric/
HBasePS
Resources to get started
25. 5. Juni
2012
25
Questions?
Christian Gügi
christian.guegi@sentric.ch
Berlin Buzzwords 2012
Thank you!