SlideShare a Scribd company logo
1 of 60
Rainbird:
Real-time Analytics @Twitter
Kevin Weil -- @kevinweil
Product Lead for Revenue, Twitter




                                    TM
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
My Background
‣   Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wireless): mesh
    routing algorithms, GBs of data
‣   Cooliris (web media): Hadoop and Pig for
    analytics, TBs of data
‣   Twitter: Hadoop, Pig, HBase, Cassandra, data
    viz, social graph analysis, soon to be PBs of data
My Background
‣   Mathematics and Physics at Harvard, Physics at
    Stanford
‣   Tropos Networks (city-wide wireless): mesh
    routing algorithms, GBs of data
‣   Cooliris (web media): Hadoop and Pig for
    analytics, TBs of data
‣   Twitter: Hadoop, Pig, HBase, Cassandra, data
    viz, social graph analysis, soon to be PBs of data
    Now revenue products!
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Why Real-time Analytics
‣   Twitter is real-time
Why Real-time Analytics
‣   Twitter is real-time
‣   ... even in space
And My Personal Favorite
And My Personal Favorite
Real-time Reporting
‣   Discussion around ad-based revenue model
‣   Help shape the conversation in real-time with
    Promoted Tweets
Real-time Reporting
‣   Discussion around ad-based revenue model
‣   Help shape the conversation in real-time with
    Promoted Tweets
‣   Realtime reporting
    ties it all together
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS

‣   High read volume
‣      Needs to scale to 10,000s of RPS
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS

‣   High read volume
‣      Needs to scale to 10,000s of RPS

‣   Horizontally scalable (reads, storage, etc)
‣      Needs to scale to 100+ TB
Requirements
‣   Extremely high write volume
‣      Needs to scale to 100,000s of WPS

‣   High read volume
‣      Needs to scale to 10,000s of RPS

‣   Horizontally scalable (reads, storage, etc)
‣      Needs to scale to 100+ TB

‣   Low latency
‣      Most reads <100 ms (esp. recent data)
Cassandra
‣   Pro: In-house expertise
‣   Pro: Open source Apache project
‣   Pro: Writes are extremely fast
‣   Pro: Horizontally scalable, low latency
‣   Pro: Other startup adoption (Digg, SimpleGeo)
Cassandra
‣   Pro: In-house expertise
‣   Pro: Open source Apache project
‣   Pro: Writes are extremely fast
‣   Pro: Horizontally scalable, low latency
‣   Pro: Other startup adoption (Digg, SimpleGeo)




‣   Con: It was really young (0.3a)
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
‣   And @lenn0x
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
‣   And @lenn0x
‣   A dude from
    Sweden began helping: @skr
Cassandra
‣   Pro: Some dudes at Digg had already started
    working on distributed atomic counters in
    Cassandra
‣   Say hi to @kelvin
‣   And @lenn0x
‣   A dude from
    Sweden began helping: @skr


‣   Now all at Twitter :)
Rainbird
‣   It counts things. Really quickly.
‣   Layers on top of the distributed
    counters patch, CASSANDRA-1072
Rainbird
‣   It counts things. Really quickly.
‣   Layers on top of the distributed
    counters patch, CASSANDRA-1072


‣   Relies on Zookeeper, Cassandra, Scribe, Thrift
‣   Written in Scala
Rainbird Design
‣   Aggregators
    buffer for 1m
‣   Intelligent
    flush to
    Cassandra
‣   Query
    servers read
    once written
‣   1m is
    configurable
Rainbird Data Structures
struct Event
{
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Unix timestamp of event
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Stat category name
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Stat keys (hierarchical)
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               Actual count (diff)
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Rainbird Data Structures
struct Event
{                               More later
    1: i32 timestamp,
    2: string category,
    3: list<string> key,
    4: i64 value,
    5: optional set<Property> properties,
    6: optional map<Property, i64> propertiesWithCounts
}
Hierarchical Aggregation
‣   Say we’re counting Promoted Tweet impressions
‣   category = pti
‣   keys = [advertiser_id, campaign_id, tweet_id]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [advertiser_id, campaign_id, tweet_id]
‣      [advertiser_id, campaign_id]
‣      [advertiser_id]
‣   Means fast queries over each level of hierarchy
‣   Configurable in rainbird.conf, or dynamically via ZK
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]
‣      [com, amazon]
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 full URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 any music.amazon.com URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 any amazon.com URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Hierarchical Aggregation
‣   Another example: tracking URL shortener tweets/clicks
‣   full URL = http://music.amazon.com/some_really_long_path
‣   keys = [com, amazon, music, full URL]
‣   count = 1
‣   Rainbird automatically increments the count for
‣      [com, amazon, music, full URL]
‣      [com, amazon, music]          How many people tweeted
‣      [com, amazon]                 any .com URL?
‣      [com]
‣   Means we can count clicks on full URLs
‣   And automatically aggregate over domains and subdomains!
Temporal Aggregation
‣   Rainbird also does (configurable) temporal
    aggregation
‣   Each count is kept minutely, but also
    denormalized hourly, daily, and all time
‣   Gives us quick counts at varying granularities
    with no large scans at read time
‣      Trading storage for latency
Multiple Formulas
‣   So far we have talked about sums
‣   Could also store counts (1 for each event)
‣   ... which gives us a mean
‣   And sums of squares (count * count for each event)
‣   ... which gives us a standard deviation
‣   And min/max as well


‣   Configure this per-category in rainbird.conf
Rainbird
‣   Write 100,000s of events per second, each with
    hierarchical structure
‣   Query with minutely granularity over any level of
    the hierarchy, get back a time series
‣   Or query all time values
‣   Or query all time means, standard deviations
‣   Latency < 100ms
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Production Uses
‣   It turns out we need to count things all the time
‣   As soon as we had this service, we started
    finding all sorts of use cases for it
‣      Promoted Products
‣      Tweeted URLs, by domain/subdomain
‣      Per-user Tweet interactions (fav, RT, follow)
‣      Arbitrary terms in Tweets
‣      Clicks on t.co URLs
Use Cases
‣   Promoted Tweet Analytics
Each different metric is part
Production Uses                of the key hierarchy

‣   Promoted Tweet Analytics
Uses the temporal
                               aggregation to quickly show
Production Uses                different levels of granularity

‣   Promoted Tweet Analytics
Data can be historical, or
Production Uses                from 60 seconds ago

‣   Promoted Tweet Analytics
Production Uses
‣   Internal Monitoring and Alerting




‣   We require operational reporting on all internal services
‣   Needs to be real-time, but also want longer-term
    aggregates
‣   Hierarchical, too: [stat,   datacenter, service, machine]
Production Uses
‣   Tweet Button Counts




‣   Tweet Button counts are requested many many
    times each day from across the web
‣   Uses the all time field
Agenda
‣   Why Real-time Analytics?
‣   Rainbird and Cassandra
‣   Production Uses at Twitter
‣   Open Source
Open Source?
‣   Yes!
Open Source?
‣   Yes!   ... but not yet
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
‣      ... also relies on some internal frameworks we need to
    open source
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
‣      ... also relies on some internal frameworks we need to
    open source
‣   It will happen
Open Source?
‣   Yes!   ... but not yet
‣   Relies on unreleased version of Cassandra
‣      ... but the counters patch is committed in trunk (0.8)
‣      ... also relies on some internal frameworks we need to
    open source
‣   It will happen
‣   See http://github.com/twitter for proof of how much
    Twitter    open source
Team
‣   John Corwin (@johnxorz)
‣   Adam Samet (@damnitsamet)
‣   Johan Oskarsson (@skr)
‣   Kelvin Kakugawa (@kelvin)
‣   Chris Goffinet (@lenn0x)
‣   Steve Jiang (@sjiang)
‣   Kevin Weil (@kevinweil)
If You Only Remember One Slide...
‣   Rainbird is a distributed, high-volume counting
    service built on top of Cassandra
‣   Write 100,000s events per second, query it with
    hierarchy and multiple time granularities, returns
    results in <100 ms
‣   Used by Twitter for multiple products internally,
    including our Promoted Products, operational
    monitoring and Tweet Button
‣   Will be open sourced so the community can use and
    improve it!
Questions?
        Follow me: @kevinweil




                       TM

More Related Content

What's hot

AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Amazon Web Services
 

What's hot (20)

Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015
 
Scaling to 1,000,000 concurrent users on the JVM
Scaling to 1,000,000 concurrent users on the JVMScaling to 1,000,000 concurrent users on the JVM
Scaling to 1,000,000 concurrent users on the JVM
 
Linux intermediate level
Linux intermediate levelLinux intermediate level
Linux intermediate level
 
Big data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera QuickstartBig data processing using Hadoop with Cloudera Quickstart
Big data processing using Hadoop with Cloudera Quickstart
 
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014Algebird : Abstract Algebra for big data analytics. Devoxx 2014
Algebird : Abstract Algebra for big data analytics. Devoxx 2014
 
Apache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainerApache Spark & Hadoop : Train-the-trainer
Apache Spark & Hadoop : Train-the-trainer
 
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
 
Scaling Redis Workloads with Amazon ElastiCache - AWS Online Tech Talks
Scaling Redis Workloads with Amazon ElastiCache - AWS Online Tech TalksScaling Redis Workloads with Amazon ElastiCache - AWS Online Tech Talks
Scaling Redis Workloads with Amazon ElastiCache - AWS Online Tech Talks
 
Install Apache Hadoop for Development/Production
Install Apache Hadoop for  Development/ProductionInstall Apache Hadoop for  Development/Production
Install Apache Hadoop for Development/Production
 
Amazon ElastiCache and Redis
Amazon ElastiCache and RedisAmazon ElastiCache and Redis
Amazon ElastiCache and Redis
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Big data processing using Cloudera Quickstart
Big data processing using Cloudera QuickstartBig data processing using Cloudera Quickstart
Big data processing using Cloudera Quickstart
 
Aws r
Aws rAws r
Aws r
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
 
개발자가 알아두면 좋을 5가지 AWS 인공 지능 깨알 지식 - 윤석찬 (AWS 테크 에반젤리스트)
개발자가 알아두면 좋을 5가지 AWS 인공 지능 깨알 지식 - 윤석찬 (AWS 테크 에반젤리스트)개발자가 알아두면 좋을 5가지 AWS 인공 지능 깨알 지식 - 윤석찬 (AWS 테크 에반젤리스트)
개발자가 알아두면 좋을 5가지 AWS 인공 지능 깨알 지식 - 윤석찬 (AWS 테크 에반젤리스트)
 
Advanced Container Management and Scheduling - DevDay Los Angeles 2017
Advanced Container Management and Scheduling - DevDay Los Angeles 2017Advanced Container Management and Scheduling - DevDay Los Angeles 2017
Advanced Container Management and Scheduling - DevDay Los Angeles 2017
 
Big Data Ecosystem - 1000 Simulated Drones
Big Data Ecosystem - 1000 Simulated DronesBig Data Ecosystem - 1000 Simulated Drones
Big Data Ecosystem - 1000 Simulated Drones
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
 
RedisConf18 - Introducing RediSearch Aggregations
RedisConf18 - Introducing RediSearch AggregationsRedisConf18 - Introducing RediSearch Aggregations
RedisConf18 - Introducing RediSearch Aggregations
 

Viewers also liked

Previews Presentation 2010
Previews Presentation 2010Previews Presentation 2010
Previews Presentation 2010
Alex Caraco
 
Prestige Sales Showcase Final 2008
Prestige Sales Showcase Final 2008Prestige Sales Showcase Final 2008
Prestige Sales Showcase Final 2008
Alex Caraco
 
Coldwell Banker Sharks News Articles Print &amp; Web
Coldwell Banker  Sharks News Articles  Print &amp; WebColdwell Banker  Sharks News Articles  Print &amp; Web
Coldwell Banker Sharks News Articles Print &amp; Web
Alex Caraco
 
Franchise Times 2009 Top 200
Franchise Times 2009 Top 200Franchise Times 2009 Top 200
Franchise Times 2009 Top 200
Alex Caraco
 
Cb Intro Presentation Final April 2010
Cb Intro Presentation Final    April 2010Cb Intro Presentation Final    April 2010
Cb Intro Presentation Final April 2010
Alex Caraco
 
Global Ad Program0209 Rates And Info
Global Ad Program0209 Rates And InfoGlobal Ad Program0209 Rates And Info
Global Ad Program0209 Rates And Info
Alex Caraco
 

Viewers also liked (7)

Previews Presentation 2010
Previews Presentation 2010Previews Presentation 2010
Previews Presentation 2010
 
Prestige Sales Showcase Final 2008
Prestige Sales Showcase Final 2008Prestige Sales Showcase Final 2008
Prestige Sales Showcase Final 2008
 
BedrijvenAPK
BedrijvenAPKBedrijvenAPK
BedrijvenAPK
 
Coldwell Banker Sharks News Articles Print &amp; Web
Coldwell Banker  Sharks News Articles  Print &amp; WebColdwell Banker  Sharks News Articles  Print &amp; Web
Coldwell Banker Sharks News Articles Print &amp; Web
 
Franchise Times 2009 Top 200
Franchise Times 2009 Top 200Franchise Times 2009 Top 200
Franchise Times 2009 Top 200
 
Cb Intro Presentation Final April 2010
Cb Intro Presentation Final    April 2010Cb Intro Presentation Final    April 2010
Cb Intro Presentation Final April 2010
 
Global Ad Program0209 Rates And Info
Global Ad Program0209 Rates And InfoGlobal Ad Program0209 Rates And Info
Global Ad Program0209 Rates And Info
 

Similar to Realtimeanalyticsattwitter strata2011-110204123031-phpapp02

Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
Kevin Weil
 
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
CODE BLUE
 

Similar to Realtimeanalyticsattwitter strata2011-110204123031-phpapp02 (20)

Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)Hadoop at Twitter (Hadoop Summit 2010)
Hadoop at Twitter (Hadoop Summit 2010)
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
[CB16] 80時間でWebを一周:クロムミウムオートメーションによるスケーラブルなフィンガープリント by Isaac Dawson
 
Streams
StreamsStreams
Streams
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
 
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
 
Using Event Streams in Serverless Applications
Using Event Streams in Serverless ApplicationsUsing Event Streams in Serverless Applications
Using Event Streams in Serverless Applications
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo
 
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...
 
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
AWS Cloud Kata 2014 | Jakarta - 2-3 Big Data
 
TSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech TalkTSAR (TimeSeries AggregatoR) Tech Talk
TSAR (TimeSeries AggregatoR) Tech Talk
 
Tsar tech talk
Tsar tech talkTsar tech talk
Tsar tech talk
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patterns
 
Introduction to Artificial Intelligence and Machine Learning services at AWS ...
Introduction to Artificial Intelligence and Machine Learning services at AWS ...Introduction to Artificial Intelligence and Machine Learning services at AWS ...
Introduction to Artificial Intelligence and Machine Learning services at AWS ...
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Realtimeanalyticsattwitter strata2011-110204123031-phpapp02

  • 1. Rainbird: Real-time Analytics @Twitter Kevin Weil -- @kevinweil Product Lead for Revenue, Twitter TM
  • 2. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 3. My Background ‣ Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, Cassandra, data viz, social graph analysis, soon to be PBs of data
  • 4. My Background ‣ Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): mesh routing algorithms, GBs of data ‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data ‣ Twitter: Hadoop, Pig, HBase, Cassandra, data viz, social graph analysis, soon to be PBs of data Now revenue products!
  • 5. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 6. Why Real-time Analytics ‣ Twitter is real-time
  • 7. Why Real-time Analytics ‣ Twitter is real-time ‣ ... even in space
  • 8. And My Personal Favorite
  • 9. And My Personal Favorite
  • 10. Real-time Reporting ‣ Discussion around ad-based revenue model ‣ Help shape the conversation in real-time with Promoted Tweets
  • 11. Real-time Reporting ‣ Discussion around ad-based revenue model ‣ Help shape the conversation in real-time with Promoted Tweets ‣ Realtime reporting ties it all together
  • 12. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 13. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS
  • 14. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS ‣ High read volume ‣ Needs to scale to 10,000s of RPS
  • 15. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS ‣ High read volume ‣ Needs to scale to 10,000s of RPS ‣ Horizontally scalable (reads, storage, etc) ‣ Needs to scale to 100+ TB
  • 16. Requirements ‣ Extremely high write volume ‣ Needs to scale to 100,000s of WPS ‣ High read volume ‣ Needs to scale to 10,000s of RPS ‣ Horizontally scalable (reads, storage, etc) ‣ Needs to scale to 100+ TB ‣ Low latency ‣ Most reads <100 ms (esp. recent data)
  • 17. Cassandra ‣ Pro: In-house expertise ‣ Pro: Open source Apache project ‣ Pro: Writes are extremely fast ‣ Pro: Horizontally scalable, low latency ‣ Pro: Other startup adoption (Digg, SimpleGeo)
  • 18. Cassandra ‣ Pro: In-house expertise ‣ Pro: Open source Apache project ‣ Pro: Writes are extremely fast ‣ Pro: Horizontally scalable, low latency ‣ Pro: Other startup adoption (Digg, SimpleGeo) ‣ Con: It was really young (0.3a)
  • 19. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra
  • 20. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin
  • 21. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x
  • 22. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x ‣ A dude from Sweden began helping: @skr
  • 23. Cassandra ‣ Pro: Some dudes at Digg had already started working on distributed atomic counters in Cassandra ‣ Say hi to @kelvin ‣ And @lenn0x ‣ A dude from Sweden began helping: @skr ‣ Now all at Twitter :)
  • 24. Rainbird ‣ It counts things. Really quickly. ‣ Layers on top of the distributed counters patch, CASSANDRA-1072
  • 25. Rainbird ‣ It counts things. Really quickly. ‣ Layers on top of the distributed counters patch, CASSANDRA-1072 ‣ Relies on Zookeeper, Cassandra, Scribe, Thrift ‣ Written in Scala
  • 26. Rainbird Design ‣ Aggregators buffer for 1m ‣ Intelligent flush to Cassandra ‣ Query servers read once written ‣ 1m is configurable
  • 27. Rainbird Data Structures struct Event { 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 28. Rainbird Data Structures struct Event { Unix timestamp of event 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 29. Rainbird Data Structures struct Event { Stat category name 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 30. Rainbird Data Structures struct Event { Stat keys (hierarchical) 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 31. Rainbird Data Structures struct Event { Actual count (diff) 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 32. Rainbird Data Structures struct Event { More later 1: i32 timestamp, 2: string category, 3: list<string> key, 4: i64 value, 5: optional set<Property> properties, 6: optional map<Property, i64> propertiesWithCounts }
  • 33. Hierarchical Aggregation ‣ Say we’re counting Promoted Tweet impressions ‣ category = pti ‣ keys = [advertiser_id, campaign_id, tweet_id] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [advertiser_id, campaign_id, tweet_id] ‣ [advertiser_id, campaign_id] ‣ [advertiser_id] ‣ Means fast queries over each level of hierarchy ‣ Configurable in rainbird.conf, or dynamically via ZK
  • 34. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] ‣ [com, amazon] ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 35. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] full URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 36. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] any music.amazon.com URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 37. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] any amazon.com URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 38. Hierarchical Aggregation ‣ Another example: tracking URL shortener tweets/clicks ‣ full URL = http://music.amazon.com/some_really_long_path ‣ keys = [com, amazon, music, full URL] ‣ count = 1 ‣ Rainbird automatically increments the count for ‣ [com, amazon, music, full URL] ‣ [com, amazon, music] How many people tweeted ‣ [com, amazon] any .com URL? ‣ [com] ‣ Means we can count clicks on full URLs ‣ And automatically aggregate over domains and subdomains!
  • 39. Temporal Aggregation ‣ Rainbird also does (configurable) temporal aggregation ‣ Each count is kept minutely, but also denormalized hourly, daily, and all time ‣ Gives us quick counts at varying granularities with no large scans at read time ‣ Trading storage for latency
  • 40. Multiple Formulas ‣ So far we have talked about sums ‣ Could also store counts (1 for each event) ‣ ... which gives us a mean ‣ And sums of squares (count * count for each event) ‣ ... which gives us a standard deviation ‣ And min/max as well ‣ Configure this per-category in rainbird.conf
  • 41. Rainbird ‣ Write 100,000s of events per second, each with hierarchical structure ‣ Query with minutely granularity over any level of the hierarchy, get back a time series ‣ Or query all time values ‣ Or query all time means, standard deviations ‣ Latency < 100ms
  • 42. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 43. Production Uses ‣ It turns out we need to count things all the time ‣ As soon as we had this service, we started finding all sorts of use cases for it ‣ Promoted Products ‣ Tweeted URLs, by domain/subdomain ‣ Per-user Tweet interactions (fav, RT, follow) ‣ Arbitrary terms in Tweets ‣ Clicks on t.co URLs
  • 44. Use Cases ‣ Promoted Tweet Analytics
  • 45. Each different metric is part Production Uses of the key hierarchy ‣ Promoted Tweet Analytics
  • 46. Uses the temporal aggregation to quickly show Production Uses different levels of granularity ‣ Promoted Tweet Analytics
  • 47. Data can be historical, or Production Uses from 60 seconds ago ‣ Promoted Tweet Analytics
  • 48. Production Uses ‣ Internal Monitoring and Alerting ‣ We require operational reporting on all internal services ‣ Needs to be real-time, but also want longer-term aggregates ‣ Hierarchical, too: [stat, datacenter, service, machine]
  • 49. Production Uses ‣ Tweet Button Counts ‣ Tweet Button counts are requested many many times each day from across the web ‣ Uses the all time field
  • 50. Agenda ‣ Why Real-time Analytics? ‣ Rainbird and Cassandra ‣ Production Uses at Twitter ‣ Open Source
  • 52. Open Source? ‣ Yes! ... but not yet
  • 53. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra
  • 54. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8)
  • 55. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8) ‣ ... also relies on some internal frameworks we need to open source
  • 56. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8) ‣ ... also relies on some internal frameworks we need to open source ‣ It will happen
  • 57. Open Source? ‣ Yes! ... but not yet ‣ Relies on unreleased version of Cassandra ‣ ... but the counters patch is committed in trunk (0.8) ‣ ... also relies on some internal frameworks we need to open source ‣ It will happen ‣ See http://github.com/twitter for proof of how much Twitter open source
  • 58. Team ‣ John Corwin (@johnxorz) ‣ Adam Samet (@damnitsamet) ‣ Johan Oskarsson (@skr) ‣ Kelvin Kakugawa (@kelvin) ‣ Chris Goffinet (@lenn0x) ‣ Steve Jiang (@sjiang) ‣ Kevin Weil (@kevinweil)
  • 59. If You Only Remember One Slide... ‣ Rainbird is a distributed, high-volume counting service built on top of Cassandra ‣ Write 100,000s events per second, query it with hierarchy and multiple time granularities, returns results in <100 ms ‣ Used by Twitter for multiple products internally, including our Promoted Products, operational monitoring and Tweet Button ‣ Will be open sourced so the community can use and improve it!
  • 60. Questions? Follow me: @kevinweil TM

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n