SlideShare une entreprise Scribd logo
1  sur  21
A Real Time Sentiment Analysis Application using
Hadoop and HBase in the Cloud




Jagane Sundar
Founder, AltoScale Inc.



June 14, 2012                      Hadoop Summit 2012


     AltoScale
AltoScale                               About me


Ø Extensive Knowledge of Hadoop, Cloud Compute and
  Virtualization
Ø Co-founder of AltoScale. We developed the Workbench
Ø Worked on Hadoop Management and Performance at
  Yahoo
Ø Primarily a systems and storage guy – have written TCP
  stacks and NFS Clients, Livebackup for KVM




2
AltoScale                   My Motivation




Ø Build a cool real time big data app in order
 to acquire a deep understanding of Real
 Time Big Data Systems in the cloud




3
AltoScale   What will you get out of this?




Ø See how easy it is to build a highly
 scalable real-time Big Data application
 using a variety of open source tools and
 technologies




4
AltoScale         Real Time Sentiment Analysis




        Ø Easily accessible real time signals
           v Twitter public status updates
           v Blog entries




5
AltoScale           Real Time Sentiment Analysis


Ø Two types of solutions to Real Time Sentiment
  Analysis
    v Keywords known a-priori
      o  Filter tweets by keyword
    v Open ended sentiment analysis (no a-priori
      knowledge of keywords)
      o  Random sample of all public tweets
          •  1 % of public tweets easily available
          •  10% (twitter firehose) may be available for purchase




6
AltoScale
                    Real Time Sentiment Analysis:
                          Application Architecture
                             Hadoop/HBase

                                          Service Node
                     TwitterSampler                         HBase REST Gateway


                          Analyze Sentiment




                                HBase every minute
                                Write a few new rows to




                                                                       Scan HTable
                                                            Hadoop Slave
                                        DataNode, Region Server

                                                               Hadoop Slave
                                                          DataNode, Region Server
                     Master                                      Hadoop Slave
                NN, HBase Master                            DataNode, Region Server

7
AltoScale
                              Real Time Sentiment Analysis:
                             Twitter Streaming API Overview

                                Twitter APIs




        REST APIs                                   Streaming APIs
    (Request/Response)                          (Persistent HTTP Conn)




           Public Streams             User Streams             Site Streams
           (Sample of all             (One User’s             (Multiple Users’
           public updates)              updates)                 updates)
                    filter

                   sample                      We use this API to
                                               collect tweets
8                  firehose
AltoScale
                  Real Time Sentiment Analysis:
                          Time Series Database




    Ø Inspired by TSDB, but does not use TSDB
    Ø Read Benoît “tsuna” Sigoure’s slides from
      HBaseCon 2012




9
AltoScale
                               Real Time Sentiment Analysis:
                                                   in HBase



          Row              NEUTRAL   POSITIVE   NEGATIVE       Sample
                                                               Tweets
obama:2012:06:04:13:34    1          4          0          sdac soasp few


romney:2012:06:04:13:34   2          3          1          Smsm djcn dje
                                                           jdj
davebarry:2012:06:04:13:34 0         9          0          cs dsjw ausj




    10
AltoScale
                 Real Time Sentiment Analysis:
                                   Front Page




11
AltoScale
                 Real Time Sentiment Analysis:
                                 Results Page




12
AltoScale
                       Real Time Sentiment Analysis:
                  Standing on the Shoulders of Giants

Ø Hadoop and HBase, of course
Ø Twitter4j library for getting the twitter stream
Ø Sentiment Analysis
     v https://code.google.com/p/twitter-sentiment-analysis/
     v Weka Library

Ø Tomcat
Ø Jquery, dojo for javascript client




13
AltoScale
                    Real Time Sentiment Analysis:
             Twitter Stream API - TsStatusListener

public static class TsStatusListener implements StatusListener {
       public void onStatus(Status status) {
               Item item = wm.weightedClassify(status.getText());
               int polarity = 0;
               try {
                   polarity = Integer.parseInt(item.getPolarity().trim());
               } catch (NumberFormatException nfe) {
               }
               updateKeywordTrackers(status, polarity);
       }
}
14
AltoScale
                                             Real Time Sentiment Analysis:
                                                         Writing to HBase
private void writeToHBase() {
             Calendar cal = Calendar.getInstance();
             String calStr = String.format("%04d", (cal.get(Calendar.YEAR)))
                           + ":" + String.format("%02d", cal.get(Calendar.MONTH) + 1)
                           + ":" + String.format("%02d", cal.get(Calendar.DAY_OF_MONTH))
                           + ":" + String.format("%02d", cal.get(Calendar.HOUR_OF_DAY))
                           + ":" + String.format("%02d", cal.get(Calendar.MINUTE));
             String rowKey = keyword + ":" + calStr;
             Put put = new Put(rowKey.getBytes());
             put.add(COLFAM1.getBytes(), "NEUTRAL".getBytes(), tracker.getNeutralCount().getBytes());
             put.add(COLFAM1.getBytes(), "POSITIVE".getBytes(), tracker.getPositiveCount().getBytes());
             put.add(COLFAM1.getBytes(), "NEGATIVE".getBytes(), tracker.getNegativeCount().getBytes());

             try {
                           table.put(put);
             } catch (Exception ex) {
                           System.err.println(ex);
             }
}
    15
AltoScale
                                                           Reading from HBase
                                                              Various Options
                         Technologies for Writing HBase Clients

                                              Service Node

Option 1: HBase Client                   Java Client linked to
                                         HBase Client classes




                                               Service Node                                 Service Node


                                                                                        Thrift Client
Option 2: Thrift RPC                     HBase Thrift Gateway
                                                                 Thrift protocol




                                    16                   Service Node


                                         HBase REST Gateway
Option 3: REST API                                               REST (HTTP or HTTPS)
AltoScale
                                                  Reading from HBase
                                  and presenting to the user’s browser
     Hadoop/HBase in the cloud

                                 Service Node
       HBase REST Gateway
                                         REST scan            Tomcat
                                                               Proxy


                                                     Static
                                                     html
                   Scan HTable




                                   Hadoop Slave
                   DataNode, Region Server

                                      Hadoop Slave
                                 DataNode, Region Server
     Master                             Hadoop Slave
NN, HBase Master                   DataNode, Region Server

17
AltoScale                         Tomcat as HTTP Proxy


Ø HBase Stardust REST Server runs on port 8081 and is
  connected to the HBase
Ø The REST server has the capability to scan tables
Ø A javascript webpage is the client
Ø Problem:
     v JavaScript security restrictions do now allow the JavaScript to
        execute REST calls to any server other than the one it was
        loaded from
     v Tomcat is used as a proxy. It serves up:
        o  Static html pages with the javascript client, images etc.
        o  REST requests from the javascript client are proxied to the HBase
           Stardust server running on port 8081
18
AltoScale                  Future Improvements


Ø Elastic HBase in the cloud
Ø At night time, use on VM to receive tweets and write out
  into SequenceFiles in S3
Ø Before business hours, start up HBase, run a MR job to
  process all these SequenceFiles and write into HBase
Ø Cost effective real time HBase application in the cloud




19
AltoScale              Big Data Apps in the Cloud


Ø The Cloud is suitable for Big Data apps which use Big
  Data from the Internet. For example:
     v Twitter Public Status Updates
     v Blog entries
     v Web Crawl data

Ø Big Data apps in the cloud are not useful if all your data
  is generated inside your network
     v Router, Storage device, Authentication device logs
     v Logs from Web Servers located inside your network




20
AltoScale




Ø Questions, Comments, Flames?


       •  Thanks!
       •  Jagane Sundar
       •  jagane@altoscale.com




21

Contenu connexe

Tendances

Spatial data mining
Spatial data miningSpatial data mining
Spatial data miningMITS Gwalior
 
12. Random Forest
12. Random Forest12. Random Forest
12. Random ForestFAO
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisVincenzo Gulisano
 
Deep Learning for Graphs
Deep Learning for GraphsDeep Learning for Graphs
Deep Learning for GraphsDeepLearningBlr
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AISaurav Shrestha
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)Amir Fahmideh
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Kazi Toufiq Wadud
 

Tendances (20)

Web usage mining
Web usage miningWeb usage mining
Web usage mining
 
Spatial data mining
Spatial data miningSpatial data mining
Spatial data mining
 
12. Random Forest
12. Random Forest12. Random Forest
12. Random Forest
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Text MIning
Text MIningText MIning
Text MIning
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
Cnn
CnnCnn
Cnn
 
Deep Learning for Graphs
Deep Learning for GraphsDeep Learning for Graphs
Deep Learning for Graphs
 
PPT.pptx
PPT.pptxPPT.pptx
PPT.pptx
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AI
 
Cnn
CnnCnn
Cnn
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Audio mining
Audio miningAudio mining
Audio mining
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?
 

En vedette

Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment AnalysisJaganadh Gopinadhan
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in TwitterAyushi Dalmia
 
Social media & sentiment analysis splunk conf2012
Social media & sentiment analysis   splunk conf2012Social media & sentiment analysis   splunk conf2012
Social media & sentiment analysis splunk conf2012Michael Wilde
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis worksCJ Jenkins
 
Building a Sentiment Analytics Solution powered by Machine Learning- Webinar QA
Building a Sentiment Analytics Solution powered by Machine Learning- Webinar QABuilding a Sentiment Analytics Solution powered by Machine Learning- Webinar QA
Building a Sentiment Analytics Solution powered by Machine Learning- Webinar QAImpetus Technologies
 
Social media mining and multimedia analysis research and applications
Social media mining and multimedia analysis research and applicationsSocial media mining and multimedia analysis research and applications
Social media mining and multimedia analysis research and applicationsYiannis Kompatsiaris
 
Intro to Algebra II
Intro to Algebra IIIntro to Algebra II
Intro to Algebra IIteamxxlp
 
Packet capture and network traffic analysis
Packet capture and network traffic analysisPacket capture and network traffic analysis
Packet capture and network traffic analysisCARMEN ALCIVAR
 
Top 10 senior administrative officer interview questions and answers
Top 10 senior administrative officer interview questions and answersTop 10 senior administrative officer interview questions and answers
Top 10 senior administrative officer interview questions and answersannababy1245
 
Virtualization In Software Testing
Virtualization In Software TestingVirtualization In Software Testing
Virtualization In Software TestingColloquium
 
Vendor quality management
Vendor quality managementVendor quality management
Vendor quality managementG2Link
 
Video Quality Measurements
Video Quality MeasurementsVideo Quality Measurements
Video Quality MeasurementsYoss Cohen
 
Digital Platform Selection Best Practices
Digital Platform Selection Best PracticesDigital Platform Selection Best Practices
Digital Platform Selection Best Practicesedynamic
 
Hands-On Lab: Let's Build an ITSM Dashboard
Hands-On Lab: Let's Build an ITSM DashboardHands-On Lab: Let's Build an ITSM Dashboard
Hands-On Lab: Let's Build an ITSM DashboardCA Technologies
 
Defining Workplace Safety
Defining Workplace SafetyDefining Workplace Safety
Defining Workplace SafetyBruce Lambert
 
Which test cases to automate
Which test cases to automateWhich test cases to automate
Which test cases to automatesachxn1
 

En vedette (20)

Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
 
Sentiment Analysis in Twitter
Sentiment Analysis in TwitterSentiment Analysis in Twitter
Sentiment Analysis in Twitter
 
Social media & sentiment analysis splunk conf2012
Social media & sentiment analysis   splunk conf2012Social media & sentiment analysis   splunk conf2012
Social media & sentiment analysis splunk conf2012
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Chem Lab Report (1)
Chem Lab Report (1)Chem Lab Report (1)
Chem Lab Report (1)
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
Building a Sentiment Analytics Solution powered by Machine Learning- Webinar QA
Building a Sentiment Analytics Solution powered by Machine Learning- Webinar QABuilding a Sentiment Analytics Solution powered by Machine Learning- Webinar QA
Building a Sentiment Analytics Solution powered by Machine Learning- Webinar QA
 
Social media mining and multimedia analysis research and applications
Social media mining and multimedia analysis research and applicationsSocial media mining and multimedia analysis research and applications
Social media mining and multimedia analysis research and applications
 
Intro to Algebra II
Intro to Algebra IIIntro to Algebra II
Intro to Algebra II
 
Orbital Notation
Orbital NotationOrbital Notation
Orbital Notation
 
Packet capture and network traffic analysis
Packet capture and network traffic analysisPacket capture and network traffic analysis
Packet capture and network traffic analysis
 
Top 10 senior administrative officer interview questions and answers
Top 10 senior administrative officer interview questions and answersTop 10 senior administrative officer interview questions and answers
Top 10 senior administrative officer interview questions and answers
 
Virtualization In Software Testing
Virtualization In Software TestingVirtualization In Software Testing
Virtualization In Software Testing
 
Vendor quality management
Vendor quality managementVendor quality management
Vendor quality management
 
Video Quality Measurements
Video Quality MeasurementsVideo Quality Measurements
Video Quality Measurements
 
Digital Platform Selection Best Practices
Digital Platform Selection Best PracticesDigital Platform Selection Best Practices
Digital Platform Selection Best Practices
 
Analysis of water pollution presentaion by m.nadeem ashraf
Analysis of water pollution presentaion by m.nadeem ashrafAnalysis of water pollution presentaion by m.nadeem ashraf
Analysis of water pollution presentaion by m.nadeem ashraf
 
Hands-On Lab: Let's Build an ITSM Dashboard
Hands-On Lab: Let's Build an ITSM DashboardHands-On Lab: Let's Build an ITSM Dashboard
Hands-On Lab: Let's Build an ITSM Dashboard
 
Defining Workplace Safety
Defining Workplace SafetyDefining Workplace Safety
Defining Workplace Safety
 
Which test cases to automate
Which test cases to automateWhich test cases to automate
Which test cases to automate
 

Similaire à Realtime Sentiment Analysis Application Using Hadoop and HBase

Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Hadoop User Group
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerCodemotion
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizonArtem Ervits
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseYahoo Developer Network
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airshipdave_revell
 
Serverless by Examples and Case Studies
Serverless by Examples and Case StudiesServerless by Examples and Case Studies
Serverless by Examples and Case StudiesSrushith Repakula
 
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Sparkhbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and SparkMichael Stack
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 

Similaire à Realtime Sentiment Analysis Application Using Hadoop and HBase (20)

Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe Seiler
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
מיכאל
מיכאלמיכאל
מיכאל
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
Serverless by examples and case studies
Serverless by examples and case studiesServerless by examples and case studies
Serverless by examples and case studies
 
Serverless by Examples and Case Studies
Serverless by Examples and Case StudiesServerless by Examples and Case Studies
Serverless by Examples and Case Studies
 
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Sparkhbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 

Dernier (20)

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 

Realtime Sentiment Analysis Application Using Hadoop and HBase

  • 1. A Real Time Sentiment Analysis Application using Hadoop and HBase in the Cloud Jagane Sundar Founder, AltoScale Inc. June 14, 2012 Hadoop Summit 2012 AltoScale
  • 2. AltoScale About me Ø Extensive Knowledge of Hadoop, Cloud Compute and Virtualization Ø Co-founder of AltoScale. We developed the Workbench Ø Worked on Hadoop Management and Performance at Yahoo Ø Primarily a systems and storage guy – have written TCP stacks and NFS Clients, Livebackup for KVM 2
  • 3. AltoScale My Motivation Ø Build a cool real time big data app in order to acquire a deep understanding of Real Time Big Data Systems in the cloud 3
  • 4. AltoScale What will you get out of this? Ø See how easy it is to build a highly scalable real-time Big Data application using a variety of open source tools and technologies 4
  • 5. AltoScale Real Time Sentiment Analysis Ø Easily accessible real time signals v Twitter public status updates v Blog entries 5
  • 6. AltoScale Real Time Sentiment Analysis Ø Two types of solutions to Real Time Sentiment Analysis v Keywords known a-priori o  Filter tweets by keyword v Open ended sentiment analysis (no a-priori knowledge of keywords) o  Random sample of all public tweets •  1 % of public tweets easily available •  10% (twitter firehose) may be available for purchase 6
  • 7. AltoScale Real Time Sentiment Analysis: Application Architecture Hadoop/HBase Service Node TwitterSampler HBase REST Gateway Analyze Sentiment HBase every minute Write a few new rows to Scan HTable Hadoop Slave DataNode, Region Server Hadoop Slave DataNode, Region Server Master Hadoop Slave NN, HBase Master DataNode, Region Server 7
  • 8. AltoScale Real Time Sentiment Analysis: Twitter Streaming API Overview Twitter APIs REST APIs Streaming APIs (Request/Response) (Persistent HTTP Conn) Public Streams User Streams Site Streams (Sample of all (One User’s (Multiple Users’ public updates) updates) updates) filter sample We use this API to collect tweets 8 firehose
  • 9. AltoScale Real Time Sentiment Analysis: Time Series Database Ø Inspired by TSDB, but does not use TSDB Ø Read Benoît “tsuna” Sigoure’s slides from HBaseCon 2012 9
  • 10. AltoScale Real Time Sentiment Analysis: in HBase Row NEUTRAL POSITIVE NEGATIVE Sample Tweets obama:2012:06:04:13:34 1 4 0 sdac soasp few romney:2012:06:04:13:34 2 3 1 Smsm djcn dje jdj davebarry:2012:06:04:13:34 0 9 0 cs dsjw ausj 10
  • 11. AltoScale Real Time Sentiment Analysis: Front Page 11
  • 12. AltoScale Real Time Sentiment Analysis: Results Page 12
  • 13. AltoScale Real Time Sentiment Analysis: Standing on the Shoulders of Giants Ø Hadoop and HBase, of course Ø Twitter4j library for getting the twitter stream Ø Sentiment Analysis v https://code.google.com/p/twitter-sentiment-analysis/ v Weka Library Ø Tomcat Ø Jquery, dojo for javascript client 13
  • 14. AltoScale Real Time Sentiment Analysis: Twitter Stream API - TsStatusListener public static class TsStatusListener implements StatusListener { public void onStatus(Status status) { Item item = wm.weightedClassify(status.getText()); int polarity = 0; try { polarity = Integer.parseInt(item.getPolarity().trim()); } catch (NumberFormatException nfe) { } updateKeywordTrackers(status, polarity); } } 14
  • 15. AltoScale Real Time Sentiment Analysis: Writing to HBase private void writeToHBase() { Calendar cal = Calendar.getInstance(); String calStr = String.format("%04d", (cal.get(Calendar.YEAR))) + ":" + String.format("%02d", cal.get(Calendar.MONTH) + 1) + ":" + String.format("%02d", cal.get(Calendar.DAY_OF_MONTH)) + ":" + String.format("%02d", cal.get(Calendar.HOUR_OF_DAY)) + ":" + String.format("%02d", cal.get(Calendar.MINUTE)); String rowKey = keyword + ":" + calStr; Put put = new Put(rowKey.getBytes()); put.add(COLFAM1.getBytes(), "NEUTRAL".getBytes(), tracker.getNeutralCount().getBytes()); put.add(COLFAM1.getBytes(), "POSITIVE".getBytes(), tracker.getPositiveCount().getBytes()); put.add(COLFAM1.getBytes(), "NEGATIVE".getBytes(), tracker.getNegativeCount().getBytes()); try { table.put(put); } catch (Exception ex) { System.err.println(ex); } } 15
  • 16. AltoScale Reading from HBase Various Options Technologies for Writing HBase Clients Service Node Option 1: HBase Client Java Client linked to HBase Client classes Service Node Service Node Thrift Client Option 2: Thrift RPC HBase Thrift Gateway Thrift protocol 16 Service Node HBase REST Gateway Option 3: REST API REST (HTTP or HTTPS)
  • 17. AltoScale Reading from HBase and presenting to the user’s browser Hadoop/HBase in the cloud Service Node HBase REST Gateway REST scan Tomcat Proxy Static html Scan HTable Hadoop Slave DataNode, Region Server Hadoop Slave DataNode, Region Server Master Hadoop Slave NN, HBase Master DataNode, Region Server 17
  • 18. AltoScale Tomcat as HTTP Proxy Ø HBase Stardust REST Server runs on port 8081 and is connected to the HBase Ø The REST server has the capability to scan tables Ø A javascript webpage is the client Ø Problem: v JavaScript security restrictions do now allow the JavaScript to execute REST calls to any server other than the one it was loaded from v Tomcat is used as a proxy. It serves up: o  Static html pages with the javascript client, images etc. o  REST requests from the javascript client are proxied to the HBase Stardust server running on port 8081 18
  • 19. AltoScale Future Improvements Ø Elastic HBase in the cloud Ø At night time, use on VM to receive tweets and write out into SequenceFiles in S3 Ø Before business hours, start up HBase, run a MR job to process all these SequenceFiles and write into HBase Ø Cost effective real time HBase application in the cloud 19
  • 20. AltoScale Big Data Apps in the Cloud Ø The Cloud is suitable for Big Data apps which use Big Data from the Internet. For example: v Twitter Public Status Updates v Blog entries v Web Crawl data Ø Big Data apps in the cloud are not useful if all your data is generated inside your network v Router, Storage device, Authentication device logs v Logs from Web Servers located inside your network 20
  • 21. AltoScale Ø Questions, Comments, Flames? •  Thanks! •  Jagane Sundar •  jagane@altoscale.com 21