SlideShare une entreprise Scribd logo
1  sur  35
Real Time Analytics for Big Data –
Lessons from Twitter (and beyond)..

                             DeWayne Filppi
                                  @dfilppi
Big Data Predictions



         “Over the next few years we'll see the adoption of scalable
         frameworks and platforms for handling
         streaming, or near real-time, analysis and processing. In the
         same way that Hadoop has been borne out of large-scale web
         applications, these platforms will be driven by the needs of large-
         scale location-aware mobile, social and sensor use.”
         Edd Dumbill, O’REILLY




          ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
                                                                     2
The Two Vs of Big Data

     Velocity                                                   Volume




3        ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
We’re Living in a Real Time World…
      Social                            User Tracking &                Homeland Security
                                         Engagement




    eCommerce                       Financial Services                 Real Time Search




4               ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
The Flavors of Big Data Analytics




    Counting                                Correlating               Research




5              ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Analytics @ Twitter – Counting

 How many signups,
  tweets, retweets for a
  topic?
 What’s the average
  latency?
 Demographics
       Countries and cities
       Gender
       Age groups
       Device types
       …



6                  ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Analytics @ Twitter – Correlating

 What devices fail at the
  same time?
 What features get user
  hooked?
 What places on the
  globe are “happening”?




7             ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Analytics @ Twitter – Research

 Sentiment analysis
     “Obama is popular”
 Trends
     “People like to tweet
      after watching
      American Idol”
 Spam patterns
     How can you tell when
      a user spams?




8                ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
It’s All about Timing




      “Real time”                      Reasonably Quick                    Batch
    (< few Seconds)                   (seconds - minutes)               (hours/days)




9                ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
It’s All about Timing

              • Event driven / stream processing
              • High resolution – every tweet gets counted



              • Ad-hoc querying          This is what
              • Medium resolution (aggregations) here
                                         we’re
                                                                 to discuss 

              • Long running batch jobs (ETL, map/reduce)
              • Low resolution (trends & patterns)

10        ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
TWITTER REAL-TIME
ANALYTICS SYSTEM




                    11
Twitter in Numbers (Jan 2013)



     It takes a week for users to
     send   3 billion Tweets.
                                                      Source: http://blog.twitter.com/2011/03/numbers.html

12          ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers (Jan 2013)


                  On average,
        500 million
     tweets get sent every day.
               Source: http://news.cnet.com/8301-1023_3-57541566-93/report-twitter-hits-half-a-billion-tweets-a-day/l


13        ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers (Jan 2013)



         The highest
     throughput to date is
     33,388 tweets/sec.
                        http://www.huffingtonpost.com/2013/01/02/tweets-per-second-record_n_2396915.html


14       ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers (March 2011)



     1,000,000 new
        accounts
        are created daily.
                                                 Source: http://www.mediabistro.com/alltwitter/50-twitter-fun-facts_b33589


15       ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Twitter in Numbers




     5% of the users generate
      75% of the content.
                                                            Source: http://www.sysomos.com/insidetwitter/

16        ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Analyze the Problem

 (Tens of) thousands of tweets per second to
  process
      Assumption: Need to process in near real time
 Aggregate counters for each word
      A few 10s of thousands of words (or hundreds of
       thousands if we include URLs)
 System needs to linearly scale
 System needs to be fault tolerant


17            ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Key Elements in
                     Real Time Big Data Analytics




18   ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Sharding (Partitioning)

                                     Counter
          Tokenizer1   Filterer 1   Updater 1

                                     Counter
          Tokenizer2   Filterer 2   Updater 2

           Tokenizer                 Counter
                       Filterer 3
               3                    Updater 3




           Tokenizer                 Counter
                       Filterer n
               n                    Updater n
Use EDA (Event Driven Architecture)


                                                                                  Counter
     Raw   Tokenizer               Tokenized       Filterer           Filtered       /
                                                                                 Aggregator




20             ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Twitter Storm




21   ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Twitter Storm With Hadoop




22       ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Storm Overview




23      ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Storm Concepts
                                                                          Spouts
 Streams
                                                                                       Bolt
      Unbounded sequence of tuples
 Spouts
      Source of streams (Queues)
 Bolts
      Functions, Filters, Joins, Aggregations
 Topologies




                                                                                   Topologies

24                 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Challenge – Word Count
       Tweets




                                                                      Count
                                                                               Word:Count




                                                                • Hottest topics
                                                                • URL mentions
                                                                • etc.
25       ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Streaming word count with Storm




26       ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Supercharging Storm
 Storm doesn’t supply persistence, but provides for it
 Storm optimizes IO to slow persistence (e.g. databases) using
  batching.
 Storm processes streams. The stream provider itself needs to
  support persistency, batching, and reliability.


          Tweets,
     events,whatever….




27               ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
XAP Real Time Analytics




28   ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Two Layer Approach
 Advantage: Minimal




                                                                                               Raw Event Stream




                                                                                                                     Raw Event Stream




                                                                                                                                        Raw Event Stream




                                                                                                                                                                ts
                                                                                      ents
  “impedance mismatch”




                                                                                                                                                                       en
                                                                                                                                                           Real Time Ev
                                                                          Real Time Ev
  between layers.
    – Both NoSQL cluster
      technologies, with similar
      advantages                                                                                                                                                            SCALE

 Grid layer serves as an in



                                                       Reporting Engine
                                                                                             In Memory Compute Cluster
  memory cache for interactive
                                                                                               Raw And Derived Events
  requests.
 Grid layer serves as a real time                                                                                              ...
                                                                                                                                                                            SCALE
  computation fabric for CEP, and
                                                                                                                  NoSQL Cluster
  limited ( to allocated memory)
  real time distributed query
  capability.


                  ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Simplified Architecture




30   ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Key Concepts

 Flowing event streams through memory for side effects
 Event driven architecture executing in-memory
 Raw events flushed, aggregations/derivations retained
 All layers horizontally scalable
 All layers highly available
 Real-time analytics & cached batch analytics on same scalable
  layer
 Data grid provides a transactional/consistent façade on NoSQL
  store (in this case eliminating SQL database entirely)



31            ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Keep Things In Memory

   Facebook keeps 80% of its
   data in Memory
   (Stanford research)

   RAM is 100-1000x faster
   than Disk (Random seek)
   • Disk: 5 -10ms
   • RAM: ~0.001msec
Take Aways
 A data grid can serve different needs for big data analytics:
    Supercharge a dedicated stream processing cluster like Storm.
       – Provide fast, reliable, transactional tuple streams and state
    Provide a general purpose analytics platform
       – Roll your own
    Simplify overall architecture while enhancing scalability
       – Ultra high performance/low latency
       – Dynamically scalable processing and in-memory storage
       – Eliminate messaging tier
       – Eliminate or minimize need for RDBMS
References
 Realtime Analytics with Storm and Hadoop
 http://www.slideshare.net/Hadoop_Summit/realtime-
  analytics-with-storm
 Learn and fork the code on github:
     https://github.com/Gigaspaces/storm-integration
 Twitter Storm:
     http://storm-project.net
 XAP + Storm Detailed Blog Post
  http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-
xap-integration/


34              ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
35   ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved

Contenu connexe

En vedette

Twitter Data Analytics
Twitter Data AnalyticsTwitter Data Analytics
Twitter Data Analytics
rupika08
 
Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyReal Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case Study
Nati Shalom
 
Alfresco in few points - Search Tutorial
Alfresco in few points - Search TutorialAlfresco in few points - Search Tutorial
Alfresco in few points - Search Tutorial
PASCAL Jean Marie
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitter
nkallen
 
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Takahiro Inoue
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 

En vedette (20)

Using Ruby to do Map/Reduce with Hadoop
Using Ruby to do Map/Reduce with HadoopUsing Ruby to do Map/Reduce with Hadoop
Using Ruby to do Map/Reduce with Hadoop
 
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
 
Twitter Data Analytics
Twitter Data AnalyticsTwitter Data Analytics
Twitter Data Analytics
 
Twitter Big Data
Twitter Big DataTwitter Big Data
Twitter Big Data
 
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)
 
Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyReal Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case Study
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
SAP HANA in Healthcare: Real-Time Big Data Analysis
SAP HANA in Healthcare: Real-Time Big Data AnalysisSAP HANA in Healthcare: Real-Time Big Data Analysis
SAP HANA in Healthcare: Real-Time Big Data Analysis
 
NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
Alfresco in few points - Search Tutorial
Alfresco in few points - Search TutorialAlfresco in few points - Search Tutorial
Alfresco in few points - Search Tutorial
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Machine Learning on Big Data
Machine Learning on Big DataMachine Learning on Big Data
Machine Learning on Big Data
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitter
 
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 

Similaire à Bigdata analytics-twitter

Cassandra summit-2013
Cassandra summit-2013Cassandra summit-2013
Cassandra summit-2013
dfilppi
 
Big data by_mcal
Big data by_mcalBig data by_mcal

Similaire à Bigdata analytics-twitter (20)

Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Cassandra summit-2013
Cassandra summit-2013Cassandra summit-2013
Cassandra summit-2013
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 
Sysmech The Zen of Consolidated Network Performance Management
Sysmech The Zen of Consolidated Network Performance ManagementSysmech The Zen of Consolidated Network Performance Management
Sysmech The Zen of Consolidated Network Performance Management
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media Streaming
 
Scaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsScaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analytics
 
Conf2013 bchristensen thebig_t
Conf2013 bchristensen thebig_tConf2013 bchristensen thebig_t
Conf2013 bchristensen thebig_t
 
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
Windows of Opportunity: Big Data on Tap
Windows of Opportunity: Big Data on TapWindows of Opportunity: Big Data on Tap
Windows of Opportunity: Big Data on Tap
 
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
 
Big Data & The Cloud
Big Data & The CloudBig Data & The Cloud
Big Data & The Cloud
 
AWS Earth and Space 2018 - Element 84 Processing and Streaming GOES-16 Data...
AWS Earth and Space 2018 -   Element 84 Processing and Streaming GOES-16 Data...AWS Earth and Space 2018 -   Element 84 Processing and Streaming GOES-16 Data...
AWS Earth and Space 2018 - Element 84 Processing and Streaming GOES-16 Data...
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
ITCamp 2018 - Magnus Mårtensson - Azure Global Application Perspectives
ITCamp 2018 - Magnus Mårtensson - Azure Global Application PerspectivesITCamp 2018 - Magnus Mårtensson - Azure Global Application Perspectives
ITCamp 2018 - Magnus Mårtensson - Azure Global Application Perspectives
 
Lambda kappa architecture - the jury are still out
Lambda   kappa architecture - the jury are still outLambda   kappa architecture - the jury are still out
Lambda kappa architecture - the jury are still out
 
Big data by_mcal
Big data by_mcalBig data by_mcal
Big data by_mcal
 
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
 
Designing the Next Generation Data Lake
Designing the Next Generation Data LakeDesigning the Next Generation Data Lake
Designing the Next Generation Data Lake
 
#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter#TwitterRealTime - Real time processing @twitter
#TwitterRealTime - Real time processing @twitter
 

Plus de dfilppi

Building an elastic real time no sql platform
Building an elastic real time no sql platform Building an elastic real time no sql platform
Building an elastic real time no sql platform
dfilppi
 

Plus de dfilppi (7)

Container Orchestration
Container OrchestrationContainer Orchestration
Container Orchestration
 
NFV Orchestration for Optimal Performance
NFV Orchestration for Optimal PerformanceNFV Orchestration for Optimal Performance
NFV Orchestration for Optimal Performance
 
Hybrid cloud openstack meetup
Hybrid cloud openstack meetupHybrid cloud openstack meetup
Hybrid cloud openstack meetup
 
TOSCA and Cloudify
TOSCA and CloudifyTOSCA and Cloudify
TOSCA and Cloudify
 
Middle Tier Scalability - Present and Future
Middle Tier Scalability - Present and FutureMiddle Tier Scalability - Present and Future
Middle Tier Scalability - Present and Future
 
An Application Centric Approach to Devops
An Application Centric Approach to DevopsAn Application Centric Approach to Devops
An Application Centric Approach to Devops
 
Building an elastic real time no sql platform
Building an elastic real time no sql platform Building an elastic real time no sql platform
Building an elastic real time no sql platform
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Bigdata analytics-twitter

  • 1. Real Time Analytics for Big Data – Lessons from Twitter (and beyond).. DeWayne Filppi @dfilppi
  • 2. Big Data Predictions “Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large- scale location-aware mobile, social and sensor use.” Edd Dumbill, O’REILLY ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 2
  • 3. The Two Vs of Big Data Velocity Volume 3 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 4. We’re Living in a Real Time World… Social User Tracking & Homeland Security Engagement eCommerce Financial Services Real Time Search 4 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 5. The Flavors of Big Data Analytics Counting Correlating Research 5 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 6. Analytics @ Twitter – Counting  How many signups, tweets, retweets for a topic?  What’s the average latency?  Demographics  Countries and cities  Gender  Age groups  Device types  … 6 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 7. Analytics @ Twitter – Correlating  What devices fail at the same time?  What features get user hooked?  What places on the globe are “happening”? 7 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 8. Analytics @ Twitter – Research  Sentiment analysis  “Obama is popular”  Trends  “People like to tweet after watching American Idol”  Spam patterns  How can you tell when a user spams? 8 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 9. It’s All about Timing “Real time” Reasonably Quick Batch (< few Seconds) (seconds - minutes) (hours/days) 9 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 10. It’s All about Timing • Event driven / stream processing • High resolution – every tweet gets counted • Ad-hoc querying This is what • Medium resolution (aggregations) here we’re to discuss  • Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns) 10 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 12. Twitter in Numbers (Jan 2013) It takes a week for users to send 3 billion Tweets. Source: http://blog.twitter.com/2011/03/numbers.html 12 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 13. Twitter in Numbers (Jan 2013) On average, 500 million tweets get sent every day. Source: http://news.cnet.com/8301-1023_3-57541566-93/report-twitter-hits-half-a-billion-tweets-a-day/l 13 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 14. Twitter in Numbers (Jan 2013) The highest throughput to date is 33,388 tweets/sec. http://www.huffingtonpost.com/2013/01/02/tweets-per-second-record_n_2396915.html 14 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 15. Twitter in Numbers (March 2011) 1,000,000 new accounts are created daily. Source: http://www.mediabistro.com/alltwitter/50-twitter-fun-facts_b33589 15 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 16. Twitter in Numbers 5% of the users generate 75% of the content. Source: http://www.sysomos.com/insidetwitter/ 16 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 17. Analyze the Problem  (Tens of) thousands of tweets per second to process  Assumption: Need to process in near real time  Aggregate counters for each word  A few 10s of thousands of words (or hundreds of thousands if we include URLs)  System needs to linearly scale  System needs to be fault tolerant 17 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 18. Key Elements in Real Time Big Data Analytics 18 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 19. Sharding (Partitioning) Counter Tokenizer1 Filterer 1 Updater 1 Counter Tokenizer2 Filterer 2 Updater 2 Tokenizer Counter Filterer 3 3 Updater 3 Tokenizer Counter Filterer n n Updater n
  • 20. Use EDA (Event Driven Architecture) Counter Raw Tokenizer Tokenized Filterer Filtered / Aggregator 20 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 21. Twitter Storm 21 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 22. Twitter Storm With Hadoop 22 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 23. Storm Overview 23 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 24. Storm Concepts Spouts  Streams Bolt  Unbounded sequence of tuples  Spouts  Source of streams (Queues)  Bolts  Functions, Filters, Joins, Aggregations  Topologies Topologies 24 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 25. Challenge – Word Count Tweets Count Word:Count • Hottest topics • URL mentions • etc. 25 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 26. Streaming word count with Storm 26 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 27. Supercharging Storm  Storm doesn’t supply persistence, but provides for it  Storm optimizes IO to slow persistence (e.g. databases) using batching.  Storm processes streams. The stream provider itself needs to support persistency, batching, and reliability. Tweets, events,whatever…. 27 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 28. XAP Real Time Analytics 28 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 29. Two Layer Approach  Advantage: Minimal Raw Event Stream Raw Event Stream Raw Event Stream ts ents “impedance mismatch” en Real Time Ev Real Time Ev between layers. – Both NoSQL cluster technologies, with similar advantages SCALE  Grid layer serves as an in Reporting Engine In Memory Compute Cluster memory cache for interactive Raw And Derived Events requests.  Grid layer serves as a real time ... SCALE computation fabric for CEP, and NoSQL Cluster limited ( to allocated memory) real time distributed query capability. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 30. Simplified Architecture 30 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 31. Key Concepts  Flowing event streams through memory for side effects  Event driven architecture executing in-memory  Raw events flushed, aggregations/derivations retained  All layers horizontally scalable  All layers highly available  Real-time analytics & cached batch analytics on same scalable layer  Data grid provides a transactional/consistent façade on NoSQL store (in this case eliminating SQL database entirely) 31 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 32. Keep Things In Memory Facebook keeps 80% of its data in Memory (Stanford research) RAM is 100-1000x faster than Disk (Random seek) • Disk: 5 -10ms • RAM: ~0.001msec
  • 33. Take Aways  A data grid can serve different needs for big data analytics:  Supercharge a dedicated stream processing cluster like Storm. – Provide fast, reliable, transactional tuple streams and state  Provide a general purpose analytics platform – Roll your own  Simplify overall architecture while enhancing scalability – Ultra high performance/low latency – Dynamically scalable processing and in-memory storage – Eliminate messaging tier – Eliminate or minimize need for RDBMS
  • 34. References  Realtime Analytics with Storm and Hadoop  http://www.slideshare.net/Hadoop_Summit/realtime- analytics-with-storm  Learn and fork the code on github: https://github.com/Gigaspaces/storm-integration  Twitter Storm: http://storm-project.net  XAP + Storm Detailed Blog Post http://blog.gigaspaces.com/gigaspaces-and-storm-part-2- xap-integration/ 34 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 35. 35 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved

Notes de l'éditeur

  1. ActiveInsight
  2. New Years Day 2013, Asia region