SlideShare a Scribd company logo
1 of 31
Download to read offline
Twitter Big Data

           Colin Surprenant
           @colinsurprenant
           Lead ninja

           warning: this presentation contains a few rage faces
Twitter - Spring 2012

● 350 million   tweets/day
● 140 million   active users
● >1 million    applications using API
Daily Twitter @ Needium



50 000 000 processed tweets
     5 000 opportunities
       500 messages sent

   100GB data
Needium Geo
Anatomy of a tweet
What's in a tweet?
                                  timestamp
                 user




                        message
  avatar


Anything else?
How to get the tweets?

● Streaming API
  Subscribe to realtime feeds moving forward

● REST Search API
  Search request on past data (1 week)
Streaming API
public statuses from all users


● status/filter
    track/location/follow
     ○ 5000 follow user ids
     ○ 400 track keywords
     ○ 25 location boxes
     ○ rate limited

● status/sample
    ○   1% of all public statuses (message id mod 100)
    ○   two status/sample streams will result in same data
Streaming API
per user streams




● User Streams
    ○   all data required to update a user's display
    ○   requires user's OAuth token
    ○   statuses from followings, direct messages, mentions
    ○   cannot open large number of user streams from same host


● Site Streams
    ○   multiplexing of multiple User Streams
Streaming API
Firehose


need more/full data?           only through partners

● gnip.com
● datasift.com

●   filtering/tracking
●   partial to full Firehose


What's the catch?
Streaming API
Firehose


Base Twitter data license


$0.10 per 1000 tweets
~$1 million/month
approx for full Firehose
Streaming API
Firehose


startup?
Search API

● REST API (http request/response)
  ○   search query
  ○   geocode (lat, long, radius)
  ○   result type (mixed/recent/popular)
  ○   since id


● max 100 rpp and 1500 results
● rate limited (~1 request/sec)
Twitter Geo

NO simple way to grab
ALL tweets for a given region
Twitter Geo
Streaming API

● status/filter + location (bounding box)
   ○ only tweets with explicit coordinates
   ○ < 10% of all tweets
Twitter Geo
Streaming API

● Firehose
   ○ < 10% of all tweets contains explicit coordinates
   ○ must do reverse geocoding on user profile location
   ○ user profile location is free form
Twitter Geo
Search API

● geocode (lat, long, radius)
  ● tweets with explicit coordinates
  ● tweets reverse geocoded from user profile location
   ● location field: free form text (Montreal / Montreal,Qc / Mtl / Mourial)
   ● false positives
   ● REST API: not for frequent polling
   ● rate limited (1 req/sec/ip)
Twitter Geo
Solutions?




That's your job!




But seriously?
Twitter Geo
● search API intelligent polling farm
    ○   adjust polling interval to minimize polling in relation to traffic


● streaming API status/filter/follow reader farm?
  ○ find N relevant users from city, # stream readers = N / 5000
  ○ must do reverse geocoding
  ○ user list dynamic update

●   TOS gray zone
Storm
Distributed and fault-tolerant realtime computation

                         https://github.com/nathanmarz/storm
Storm
                     The promise

●   Guaranteed data processing
●   Horizontal scalability
●   Fault-tolerance
●   No intermediate message brokers
●   Higher level abstraction than message passing
●   Just work
RedStorm

    JRuby integration & DSL for Storm

   Simplicity of Ruby + power of Storm

    https://github.com/colinsurprenant/redstorm



                       +
Storm
Typical use cases




     Stream processing   Continuous computation
Storm
Concepts

                       Streams


  Tuple        Tuple    Tuple    Tuple   Tuple




           Unbounded sequence of tuples
Storm
Concepts

                           Spouts
                                            Tuple
                                    Tuple
                            Tuple
                   Tuple
           Tuple



           Tuple
                   Tuple
                            Tuple
                                    Tuple
                                            Tuple


                   Source of streams
Storm
Concepts

                                          Bolts

  Tuple
          Tuple
                  Tuple
                          Tuple
                                  Tuple


                                                  Tuple   Tuple   Tuple   Tuple   Tuple


                                  Tuple
                          Tuple
                  Tuple
          Tuple
  Tuple




   Processes input streams and produce new streams
Storm
Concepts

                  Topology




           Network of spouts and bolts
Storm
                   What Storm does

●   Distributes code
●   Robust process management
●   Monitors topologies and reassigns failed tasks
●   Provides reliability by tracking tuple trees
●   Routing and partitioning of stream
Tweitgeist
Live top 10 trending hashtags on Twitter




      DEMO


                  https://github.com/colinsurprenant/tweitgeist
                  Live demo: http://tweitgeist.needium.com
Tweitgeist
              TwitterStream             ExtractMessage             ExtractHashtag                   RollingCount                   Rank            Merge
              Spout                     Bolt                       Bolt                             Bolt                           Bolt            Bolt
 Twitter




                                                                                                                   field hashtag
                                                                                    field hashtag




                                                                                                                                                              UI
                                                                                                                                          global
                              shuffle




                                                         shuffle
Redis                                                                                                                                                        Redis
queue                                                                                                                                                        queue




              stream                    message hashtag                                              rolling ranking                               merging
              reader                    extract extract                                              counter


           Shuffle grouping:                    Tuples are randomly distributed across the bolt's tasks
           Fields grouping:                     The stream is partitioned by the fields specified in the grouping
           Global grouping:                     The entire stream goes to a single one of the bolt's tasks
Tweitgeist
Topology definition

More Related Content

What's hot

Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 

What's hot (20)

Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Hive
HiveHive
Hive
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Testing Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnitTesting Hadoop jobs with MRUnit
Testing Hadoop jobs with MRUnit
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Twitter Stream Processing
Twitter Stream ProcessingTwitter Stream Processing
Twitter Stream Processing
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Java rmi
Java rmiJava rmi
Java rmi
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Yarn.ppt
Yarn.pptYarn.ppt
Yarn.ppt
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 

Similar to Twitter Big Data

Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
Uday Vakalapudi
 
project proposal
project proposalproject proposal
project proposal
2ugene
 
Making Sense of Location-based Microposts using Stream Reasoning
Making Sense of Location-based Microposts using Stream ReasoningMaking Sense of Location-based Microposts using Stream Reasoning
Making Sense of Location-based Microposts using Stream Reasoning
Irene Celino
 

Similar to Twitter Big Data (18)

Storm - The Real-Time Layer Your Big Data's Been Missing
Storm - The Real-Time Layer Your Big Data's Been MissingStorm - The Real-Time Layer Your Big Data's Been Missing
Storm - The Real-Time Layer Your Big Data's Been Missing
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 
Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentation
 
Bigdata roundtable-storm
Bigdata roundtable-stormBigdata roundtable-storm
Bigdata roundtable-storm
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 
project proposal
project proposalproject proposal
project proposal
 
Bigdata analytics-twitter
Bigdata analytics-twitterBigdata analytics-twitter
Bigdata analytics-twitter
 
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
Stream Processing in the Cloud - Athens Kubernetes Meetup 16.07.2019
 
Handson with Twitter Heron
Handson with Twitter HeronHandson with Twitter Heron
Handson with Twitter Heron
 
C sanlitalk lvhash_spike_fromdatatoknowledge_mons2015
C sanlitalk lvhash_spike_fromdatatoknowledge_mons2015C sanlitalk lvhash_spike_fromdatatoknowledge_mons2015
C sanlitalk lvhash_spike_fromdatatoknowledge_mons2015
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Storm
StormStorm
Storm
 
Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013
Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013
Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013
 
Mobile Monday Athens launch, Konstantinos Papamiltiadis, Taptu
Mobile Monday Athens launch, Konstantinos Papamiltiadis, TaptuMobile Monday Athens launch, Konstantinos Papamiltiadis, Taptu
Mobile Monday Athens launch, Konstantinos Papamiltiadis, Taptu
 
Incremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation systemIncremental pattern matching in the VIATRA2 model transformation system
Incremental pattern matching in the VIATRA2 model transformation system
 
Storm
StormStorm
Storm
 
Twitter by the Numbers
Twitter by the NumbersTwitter by the Numbers
Twitter by the Numbers
 
Making Sense of Location-based Microposts using Stream Reasoning
Making Sense of Location-based Microposts using Stream ReasoningMaking Sense of Location-based Microposts using Stream Reasoning
Making Sense of Location-based Microposts using Stream Reasoning
 

Recently uploaded

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Twitter Big Data

  • 1. Twitter Big Data Colin Surprenant @colinsurprenant Lead ninja warning: this presentation contains a few rage faces
  • 2. Twitter - Spring 2012 ● 350 million tweets/day ● 140 million active users ● >1 million applications using API
  • 3. Daily Twitter @ Needium 50 000 000 processed tweets 5 000 opportunities 500 messages sent 100GB data
  • 5. Anatomy of a tweet What's in a tweet? timestamp user message avatar Anything else?
  • 6.
  • 7. How to get the tweets? ● Streaming API Subscribe to realtime feeds moving forward ● REST Search API Search request on past data (1 week)
  • 8. Streaming API public statuses from all users ● status/filter track/location/follow ○ 5000 follow user ids ○ 400 track keywords ○ 25 location boxes ○ rate limited ● status/sample ○ 1% of all public statuses (message id mod 100) ○ two status/sample streams will result in same data
  • 9. Streaming API per user streams ● User Streams ○ all data required to update a user's display ○ requires user's OAuth token ○ statuses from followings, direct messages, mentions ○ cannot open large number of user streams from same host ● Site Streams ○ multiplexing of multiple User Streams
  • 10. Streaming API Firehose need more/full data? only through partners ● gnip.com ● datasift.com ● filtering/tracking ● partial to full Firehose What's the catch?
  • 11. Streaming API Firehose Base Twitter data license $0.10 per 1000 tweets ~$1 million/month approx for full Firehose
  • 13. Search API ● REST API (http request/response) ○ search query ○ geocode (lat, long, radius) ○ result type (mixed/recent/popular) ○ since id ● max 100 rpp and 1500 results ● rate limited (~1 request/sec)
  • 14. Twitter Geo NO simple way to grab ALL tweets for a given region
  • 15. Twitter Geo Streaming API ● status/filter + location (bounding box) ○ only tweets with explicit coordinates ○ < 10% of all tweets
  • 16. Twitter Geo Streaming API ● Firehose ○ < 10% of all tweets contains explicit coordinates ○ must do reverse geocoding on user profile location ○ user profile location is free form
  • 17. Twitter Geo Search API ● geocode (lat, long, radius) ● tweets with explicit coordinates ● tweets reverse geocoded from user profile location ● location field: free form text (Montreal / Montreal,Qc / Mtl / Mourial) ● false positives ● REST API: not for frequent polling ● rate limited (1 req/sec/ip)
  • 18. Twitter Geo Solutions? That's your job! But seriously?
  • 19. Twitter Geo ● search API intelligent polling farm ○ adjust polling interval to minimize polling in relation to traffic ● streaming API status/filter/follow reader farm? ○ find N relevant users from city, # stream readers = N / 5000 ○ must do reverse geocoding ○ user list dynamic update ● TOS gray zone
  • 20. Storm Distributed and fault-tolerant realtime computation https://github.com/nathanmarz/storm
  • 21. Storm The promise ● Guaranteed data processing ● Horizontal scalability ● Fault-tolerance ● No intermediate message brokers ● Higher level abstraction than message passing ● Just work
  • 22. RedStorm JRuby integration & DSL for Storm Simplicity of Ruby + power of Storm https://github.com/colinsurprenant/redstorm +
  • 23. Storm Typical use cases Stream processing Continuous computation
  • 24. Storm Concepts Streams Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples
  • 25. Storm Concepts Spouts Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Source of streams
  • 26. Storm Concepts Bolts Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Processes input streams and produce new streams
  • 27. Storm Concepts Topology Network of spouts and bolts
  • 28. Storm What Storm does ● Distributes code ● Robust process management ● Monitors topologies and reassigns failed tasks ● Provides reliability by tracking tuple trees ● Routing and partitioning of stream
  • 29. Tweitgeist Live top 10 trending hashtags on Twitter DEMO https://github.com/colinsurprenant/tweitgeist Live demo: http://tweitgeist.needium.com
  • 30. Tweitgeist TwitterStream ExtractMessage ExtractHashtag RollingCount Rank Merge Spout Bolt Bolt Bolt Bolt Bolt Twitter field hashtag field hashtag UI global shuffle shuffle Redis Redis queue queue stream message hashtag rolling ranking merging reader extract extract counter Shuffle grouping: Tuples are randomly distributed across the bolt's tasks Fields grouping: The stream is partitioned by the fields specified in the grouping Global grouping: The entire stream goes to a single one of the bolt's tasks