Real Time Analytics for Twitter with a Scalable Architecture
1. Real Time Analytics for Big Data
A Twitter Case Study
Uri Cohen
Head of Product
@uri1803
2. Big Data Predictions
“Over the next few years we'll see the adoption of scalable
frameworks and platforms for handling
streaming, or near real-time, analysis and processing. In the
same way that Hadoop has been borne out of large-scale web
applications, these platforms will be driven by the needs of large-
scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
2
3. We’re Living in a Real Time World…
Facebook Real Time Real Time User Google RT Web Analytics
Social Analytics Engagement
Twitter paid tweet analytics New Real Time Google Real Time Search
Analytics Startups
3 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
4. 3 Flavors to Analytics
Counting Correlating Research
4 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
5. Analytics @ Twitter – Counting
How many
signups, tweets, retweet
s for a topic?
What’s the average
latency?
Demographics
Countries and cities
Gender
Age groups
Device types
…
5 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
6. Analytics @ Twitter – Correlating
What devices fail at the
same time?
What features get user
hooked?
What places on the
globe are “happening”?
6 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
7. Analytics @ Twitter – Research
Sentiment analysis
“Obama is popular”
Trends
“People like to tweet
after watching
American Idol”
Spam patterns
How can you tell when
a user spams?
7 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
8. It’s All about Timing
“Real time” Reasonably Quick Batch
(< few Seconds) (seconds - minutes) (hours/days)
8 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
9. It’s All about Timing
• Event driven / stream processing
• High resolution – every tweet gets counted
• Ad-hoc querying This is what
• Medium resolution (aggregations) here
we’re
to discuss
• Long running batch jobs (ETL, map/reduce)
• Low resolution (trends & patterns)
9 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
10. Twitter in Numbers (March 2011)
It takes a week for users to
send 1 billion Tweets.
Source: http://blog.twitter.com/2011/03/numbers.html
10 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
11. Twitter in Numbers (March 2011)
On average,
140 million
tweets get sent every day.
Source: http://blog.twitter.com/2011/03/numbers.html
11 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
12. Twitter in Numbers (March 2011)
The highest
throughput to date is
6,939 tweets/sec.
Source: http://blog.twitter.com/2011/03/numbers.html
12 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
13. Twitter in Numbers (March 2011)
460,000 new
accounts
are created daily.
Source: http://blog.twitter.com/2011/03/numbers.html
13 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
14. Twitter in Numbers
5% of the users generate
75% of the content.
Source: http://www.sysomos.com/insidetwitter/
14 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
15. Challenge 1 – Computing Reach
15 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
17. Challenge 2 – Word Count
Tweets
Count
Word:Count
• Hottest topics
• URL mentions
• etc.
17 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
18. URL Mentions – Here’s One Use Case
18 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
19. Analyze the Problem
(Tens of) thousands of tweets per second to
process
Assumption: Need to process in near real time
Aggregate counters for each word
A few 10s of thousands of words (or hundreds of
thousands if we include URLs)
System needs to linearly scale
19 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
20. It’s Really Just an Online Map/Reduce
20 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
21. Event Driven Flow
Store
Research
Raw
Tokenize Filter Count
Store
Counters
21 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
22. It’s Not That Simple…
(Tens) of thousands of tweets to tokenize every
second, even more words to filter
CPU bottleneck
Tens/hundreds of thousands of counters to
update Counters contention
Tens/hundreds of thousands of counters to
persist Database bottleneck
(Tens) of thousands of tweets to store every
second in the database Database bottleneck
22 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
23. Implementation Challenges
Twitter is growing by the day,
how can this scale?
Fault tolerance.
What happens if a server fails?
Consistency.
Counters should be accurate!
23 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
25. Dealing with the Massive Tweet Stream
We need a system that could scale linearly
Event partitioning to the rescue!
What to partition by?
Tokenizer1 Filterer 1
Tokenizer2 Filterer 2
Tokenizer
Filterer 3
3
Tokenizer
Filterer n
n
25 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
26. Counters Persistence & Contention
Why not keep things in memory?
Treat it as a SoR
IMDG
Counter
Tokenizer1 Filterer 1 Updater 1
Counter
Tokenizer2 Filterer 2 Updater 2
Tokenizer Counter
Filterer 3
3 Updater 3
Tokenizer Counter
Filterer n
n Updater n
26 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
27. Counters Persistence & Contention
Why not keep things in memory (data grid)?
Question: How to partition counters?
IMDG
Counter
Tokenizer1 Filterer 1 Updater 1
Counter
Tokenizer2 Filterer 2 Updater 2
Tokenizer Counter
Filterer 3
3 Updater 3
Tokenizer Counter
Filterer n
n Updater n
27 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
28. Counters Persistence & Contention
Why not keep things in memory?
Question: How to partition counters? in
Facebook keeps 80% of its data
Memory (Stanford research) IMDG
Counter
Tokenizer1 Filterer 1 Updater 1
RAM is 100-1000x faster than Disk
Counter
(Random seek)
Tokenizer2 Filterer 2 Updater 2
• Disk: 5 -10ms
Tokenizer Counter
Filterer 3
• RAM: ~0.001msec
3 Updater 3
Tokenizer Counter
Filterer n
n Updater n
28 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
29. Database Bottleneck – Storing Raw Tweets
RDBMS won’t cut it (unless you have $$$$$)
Hadoop / NoSQL to the rescue
Use the data grid for batching and as a reliable
transaction log
Persister
1
Persister
2
Persister
n
29 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
30. Counter Consistency & Resiliency
Primary/hot backup setup, sync replication
Transactional, atomic updates
IMDG
“On Demand” backups
P B
P B
P B
P B
30 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
31. Counter Consistency & Resiliency
Primary/hot backup setup, sync replication
Transactional, atomic updates
IMDG
“On Demand” backups
P B
P
P B
P B
P B
31 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
32. Implementing Event Flows
Use the data grid as event bus
Stateful objects as transactional events
Raw Tokenizer Tokenized Filterer Filtered Counter
P B
32 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
33. Putting It All Together
IMDG
P B
P B
B
P
B
P
33 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
34. Putting It All Together
• Use an IMDG IMDG
• Partition events, counters
• Collocate eventP handlers with data for low
B
latency and simple scaling
• Use atomic updates to update counters in
P B
memory
P B
• Persist asynchronously to a big data store
P B
34 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
35. Reducing Counter Contention - Optimization
Store processed word1
tweets locally in word2
each partition
word3
Use periodic batch
writes to update 1 sec
counters
Process batches as
a whole
35 35
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
36. Keep Your Eyes Open…
36 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
37. References
Learn and fork the code on github:
https://github.com/Gigaspaces/rt-analytics
Detailed blog post
http://bit.ly/gs-bigdata-analytics
Twitter in numbers:
http://blog.twitter.com/2011/03/numbers.html
Twitter Storm:
http://bit.ly/twitter-storm
Apache S4
http://incubator.apache.org/s4/
37 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
38. 38 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved