Generative AI for Technical Writer or Information Developers
Realtime Sentiment Analysis Application Using Hadoop and HBase
1. A Real Time Sentiment Analysis Application using
Hadoop and HBase in the Cloud
Jagane Sundar
Founder, AltoScale Inc.
June 14, 2012 Hadoop Summit 2012
AltoScale
2. AltoScale About me
Ø Extensive Knowledge of Hadoop, Cloud Compute and
Virtualization
Ø Co-founder of AltoScale. We developed the Workbench
Ø Worked on Hadoop Management and Performance at
Yahoo
Ø Primarily a systems and storage guy – have written TCP
stacks and NFS Clients, Livebackup for KVM
2
3. AltoScale My Motivation
Ø Build a cool real time big data app in order
to acquire a deep understanding of Real
Time Big Data Systems in the cloud
3
4. AltoScale What will you get out of this?
Ø See how easy it is to build a highly
scalable real-time Big Data application
using a variety of open source tools and
technologies
4
5. AltoScale Real Time Sentiment Analysis
Ø Easily accessible real time signals
v Twitter public status updates
v Blog entries
5
6. AltoScale Real Time Sentiment Analysis
Ø Two types of solutions to Real Time Sentiment
Analysis
v Keywords known a-priori
o Filter tweets by keyword
v Open ended sentiment analysis (no a-priori
knowledge of keywords)
o Random sample of all public tweets
• 1 % of public tweets easily available
• 10% (twitter firehose) may be available for purchase
6
7. AltoScale
Real Time Sentiment Analysis:
Application Architecture
Hadoop/HBase
Service Node
TwitterSampler HBase REST Gateway
Analyze Sentiment
HBase every minute
Write a few new rows to
Scan HTable
Hadoop Slave
DataNode, Region Server
Hadoop Slave
DataNode, Region Server
Master Hadoop Slave
NN, HBase Master DataNode, Region Server
7
8. AltoScale
Real Time Sentiment Analysis:
Twitter Streaming API Overview
Twitter APIs
REST APIs Streaming APIs
(Request/Response) (Persistent HTTP Conn)
Public Streams User Streams Site Streams
(Sample of all (One User’s (Multiple Users’
public updates) updates) updates)
filter
sample We use this API to
collect tweets
8 firehose
9. AltoScale
Real Time Sentiment Analysis:
Time Series Database
Ø Inspired by TSDB, but does not use TSDB
Ø Read Benoît “tsuna” Sigoure’s slides from
HBaseCon 2012
9
10. AltoScale
Real Time Sentiment Analysis:
in HBase
Row NEUTRAL POSITIVE NEGATIVE Sample
Tweets
obama:2012:06:04:13:34 1 4 0 sdac soasp few
romney:2012:06:04:13:34 2 3 1 Smsm djcn dje
jdj
davebarry:2012:06:04:13:34 0 9 0 cs dsjw ausj
10
11. AltoScale
Real Time Sentiment Analysis:
Front Page
11
12. AltoScale
Real Time Sentiment Analysis:
Results Page
12
13. AltoScale
Real Time Sentiment Analysis:
Standing on the Shoulders of Giants
Ø Hadoop and HBase, of course
Ø Twitter4j library for getting the twitter stream
Ø Sentiment Analysis
v https://code.google.com/p/twitter-sentiment-analysis/
v Weka Library
Ø Tomcat
Ø Jquery, dojo for javascript client
13
14. AltoScale
Real Time Sentiment Analysis:
Twitter Stream API - TsStatusListener
public static class TsStatusListener implements StatusListener {
public void onStatus(Status status) {
Item item = wm.weightedClassify(status.getText());
int polarity = 0;
try {
polarity = Integer.parseInt(item.getPolarity().trim());
} catch (NumberFormatException nfe) {
}
updateKeywordTrackers(status, polarity);
}
}
14
16. AltoScale
Reading from HBase
Various Options
Technologies for Writing HBase Clients
Service Node
Option 1: HBase Client Java Client linked to
HBase Client classes
Service Node Service Node
Thrift Client
Option 2: Thrift RPC HBase Thrift Gateway
Thrift protocol
16 Service Node
HBase REST Gateway
Option 3: REST API REST (HTTP or HTTPS)
17. AltoScale
Reading from HBase
and presenting to the user’s browser
Hadoop/HBase in the cloud
Service Node
HBase REST Gateway
REST scan Tomcat
Proxy
Static
html
Scan HTable
Hadoop Slave
DataNode, Region Server
Hadoop Slave
DataNode, Region Server
Master Hadoop Slave
NN, HBase Master DataNode, Region Server
17
18. AltoScale Tomcat as HTTP Proxy
Ø HBase Stardust REST Server runs on port 8081 and is
connected to the HBase
Ø The REST server has the capability to scan tables
Ø A javascript webpage is the client
Ø Problem:
v JavaScript security restrictions do now allow the JavaScript to
execute REST calls to any server other than the one it was
loaded from
v Tomcat is used as a proxy. It serves up:
o Static html pages with the javascript client, images etc.
o REST requests from the javascript client are proxied to the HBase
Stardust server running on port 8081
18
19. AltoScale Future Improvements
Ø Elastic HBase in the cloud
Ø At night time, use on VM to receive tweets and write out
into SequenceFiles in S3
Ø Before business hours, start up HBase, run a MR job to
process all these SequenceFiles and write into HBase
Ø Cost effective real time HBase application in the cloud
19
20. AltoScale Big Data Apps in the Cloud
Ø The Cloud is suitable for Big Data apps which use Big
Data from the Internet. For example:
v Twitter Public Status Updates
v Blog entries
v Web Crawl data
Ø Big Data apps in the cloud are not useful if all your data
is generated inside your network
v Router, Storage device, Authentication device logs
v Logs from Web Servers located inside your network
20