Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Realtime Sentiment Analysis Application Using Hadoop and HBase

Plus De Contenu Connexe

Plus par DataWorks Summit

Realtime Sentiment Analysis Application Using Hadoop and HBase

  1. 1. A Real Time Sentiment Analysis Application using Hadoop and HBase in the Cloud Jagane Sundar Founder, AltoScale Inc. June 14, 2012 Hadoop Summit 2012 AltoScale
  2. 2. AltoScale About me Ø Extensive Knowledge of Hadoop, Cloud Compute and Virtualization Ø Co-founder of AltoScale. We developed the Workbench Ø Worked on Hadoop Management and Performance at Yahoo Ø Primarily a systems and storage guy – have written TCP stacks and NFS Clients, Livebackup for KVM 2
  3. 3. AltoScale My Motivation Ø Build a cool real time big data app in order to acquire a deep understanding of Real Time Big Data Systems in the cloud 3
  4. 4. AltoScale What will you get out of this? Ø See how easy it is to build a highly scalable real-time Big Data application using a variety of open source tools and technologies 4
  5. 5. AltoScale Real Time Sentiment Analysis Ø Easily accessible real time signals v Twitter public status updates v Blog entries 5
  6. 6. AltoScale Real Time Sentiment Analysis Ø Two types of solutions to Real Time Sentiment Analysis v Keywords known a-priori o  Filter tweets by keyword v Open ended sentiment analysis (no a-priori knowledge of keywords) o  Random sample of all public tweets •  1 % of public tweets easily available •  10% (twitter firehose) may be available for purchase 6
  7. 7. AltoScale Real Time Sentiment Analysis: Application Architecture Hadoop/HBase Service Node TwitterSampler HBase REST Gateway Analyze Sentiment HBase every minute Write a few new rows to Scan HTable Hadoop Slave DataNode, Region Server Hadoop Slave DataNode, Region Server Master Hadoop Slave NN, HBase Master DataNode, Region Server 7
  8. 8. AltoScale Real Time Sentiment Analysis: Twitter Streaming API Overview Twitter APIs REST APIs Streaming APIs (Request/Response) (Persistent HTTP Conn) Public Streams User Streams Site Streams (Sample of all (One User’s (Multiple Users’ public updates) updates) updates) filter sample We use this API to collect tweets 8 firehose
  9. 9. AltoScale Real Time Sentiment Analysis: Time Series Database Ø Inspired by TSDB, but does not use TSDB Ø Read Benoît “tsuna” Sigoure’s slides from HBaseCon 2012 9
  10. 10. AltoScale Real Time Sentiment Analysis: in HBase Row NEUTRAL POSITIVE NEGATIVE Sample Tweets obama:2012:06:04:13:34 1 4 0 sdac soasp few romney:2012:06:04:13:34 2 3 1 Smsm djcn dje jdj davebarry:2012:06:04:13:34 0 9 0 cs dsjw ausj 10
  11. 11. AltoScale Real Time Sentiment Analysis: Front Page 11
  12. 12. AltoScale Real Time Sentiment Analysis: Results Page 12
  13. 13. AltoScale Real Time Sentiment Analysis: Standing on the Shoulders of Giants Ø Hadoop and HBase, of course Ø Twitter4j library for getting the twitter stream Ø Sentiment Analysis v https://code.google.com/p/twitter-sentiment-analysis/ v Weka Library Ø Tomcat Ø Jquery, dojo for javascript client 13
  14. 14. AltoScale Real Time Sentiment Analysis: Twitter Stream API - TsStatusListener public static class TsStatusListener implements StatusListener { public void onStatus(Status status) { Item item = wm.weightedClassify(status.getText()); int polarity = 0; try { polarity = Integer.parseInt(item.getPolarity().trim()); } catch (NumberFormatException nfe) { } updateKeywordTrackers(status, polarity); } } 14
  15. 15. AltoScale Real Time Sentiment Analysis: Writing to HBase private void writeToHBase() { Calendar cal = Calendar.getInstance(); String calStr = String.format("%04d", (cal.get(Calendar.YEAR))) + ":" + String.format("%02d", cal.get(Calendar.MONTH) + 1) + ":" + String.format("%02d", cal.get(Calendar.DAY_OF_MONTH)) + ":" + String.format("%02d", cal.get(Calendar.HOUR_OF_DAY)) + ":" + String.format("%02d", cal.get(Calendar.MINUTE)); String rowKey = keyword + ":" + calStr; Put put = new Put(rowKey.getBytes()); put.add(COLFAM1.getBytes(), "NEUTRAL".getBytes(), tracker.getNeutralCount().getBytes()); put.add(COLFAM1.getBytes(), "POSITIVE".getBytes(), tracker.getPositiveCount().getBytes()); put.add(COLFAM1.getBytes(), "NEGATIVE".getBytes(), tracker.getNegativeCount().getBytes()); try { table.put(put); } catch (Exception ex) { System.err.println(ex); } } 15
  16. 16. AltoScale Reading from HBase Various Options Technologies for Writing HBase Clients Service Node Option 1: HBase Client Java Client linked to HBase Client classes Service Node Service Node Thrift Client Option 2: Thrift RPC HBase Thrift Gateway Thrift protocol 16 Service Node HBase REST Gateway Option 3: REST API REST (HTTP or HTTPS)
  17. 17. AltoScale Reading from HBase and presenting to the user’s browser Hadoop/HBase in the cloud Service Node HBase REST Gateway REST scan Tomcat Proxy Static html Scan HTable Hadoop Slave DataNode, Region Server Hadoop Slave DataNode, Region Server Master Hadoop Slave NN, HBase Master DataNode, Region Server 17
  18. 18. AltoScale Tomcat as HTTP Proxy Ø HBase Stardust REST Server runs on port 8081 and is connected to the HBase Ø The REST server has the capability to scan tables Ø A javascript webpage is the client Ø Problem: v JavaScript security restrictions do now allow the JavaScript to execute REST calls to any server other than the one it was loaded from v Tomcat is used as a proxy. It serves up: o  Static html pages with the javascript client, images etc. o  REST requests from the javascript client are proxied to the HBase Stardust server running on port 8081 18
  19. 19. AltoScale Future Improvements Ø Elastic HBase in the cloud Ø At night time, use on VM to receive tweets and write out into SequenceFiles in S3 Ø Before business hours, start up HBase, run a MR job to process all these SequenceFiles and write into HBase Ø Cost effective real time HBase application in the cloud 19
  20. 20. AltoScale Big Data Apps in the Cloud Ø The Cloud is suitable for Big Data apps which use Big Data from the Internet. For example: v Twitter Public Status Updates v Blog entries v Web Crawl data Ø Big Data apps in the cloud are not useful if all your data is generated inside your network v Router, Storage device, Authentication device logs v Logs from Web Servers located inside your network 20
  21. 21. AltoScale Ø Questions, Comments, Flames? •  Thanks! •  Jagane Sundar •  jagane@altoscale.com 21

×