SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
Real-Time Data
Processing
Emerging Business Meetup - 07/24/13
Bryan Warner @ Traackr
About Me
● Bryan Warner - Engineer @ Traackr
○ bwarner@traackr.com
● Primary background is in Java
○ Breaking into Scala development this past year
● Interested in search, data scalability, and distributed
computing
About Traackr
● Influencer search engine
○ Platform for discovering and engaging online
individuals who matter
● We track content and metrics for our database of
influential people
○ Both in RT and daily processes
● Some of our back-end stack includes: ElasticSearch,
MongoDb, Java/Spring, Scala/Akka, etc.
● Looking for developers to use our API!
Overview
● Review Traackr's use case for real-time data
processing
● Technical solution we decided on
● Questions
Traackr Use Case
1. Real-time content stream for a targeted group of
influencers within our platform
a. Primarily to show real-time tweets via our Twitter
data provider (GNIP)
2. On-demand content tracking and searching for new
influencers
a. Users can add up to a hundred people at once
b. Expect that new influencer content is searchable
near real-time
Traackr Use Case
Data Processing Requirements
1. Incoming data is not lost
2. Data needs to be analyzed and enriched
3. Each type of data has its own processing component
* Blog Posts, Tweets, Videos, Images, etc.
4. Components should be configurable for maximum
throughput!
5. Components should act like small building blocks
Bird's Eye View
Tracking App
MongoDb
Initial Persist
Content "Enrichment"
Pipeline
RabbitMQ Broker
Queue
ElasticSearch
GNIP Listener App
"Post" Payload
Queue Listener
Make content searchable
Content Pipeline
● Apache Camel (http://camel.apache.org/)
○ Integration framework based on Enterprise
Integration patterns (EIP)
● Flexible route building
○ Supports direct and asynchronous components
○ Integrates with DI frameworks (e.g. Spring, Guice)
○ Tons of native support for various transports (http,
jms, amqp, tcp, imap, etc.)
● Good support for unit testing
○ org.apache.camel.component.mock.MockEndpoint
Content Pipeline
Queue
Queue Listener
Search Indexer
Tweet
Processor
Blog
Processor
Image
Processor
Routing
Filter
ROUTE
Content Pipeline
Initial Approach:
● Route(s) live within a CamelContext in your JVM
● Initial route is utilizing direct components (serial)
from(<queue.uri>).routeId("my-route")
.choice()
.when(simple("${in.body.isTweet()}"))
.to("bean:languageAnalyzer?method=detectLanguage")
.to("bean:tweetAnalyzer?method=extractMentions")
.when(simple("${in.body.isBlog()}"))
.to("bean:httpService?method=fetchFullContent")
.to("bean:languageAnalyzer?method=detectLanguage")
.otherwise()
.to("bean:imageAnalyzer?method=categorizeImage")
.end()
.to("bean:searchService?method=indexContent");
But there's a throughput problem...
Content Pipeline
TWEET
TWEET
IMAGE
TWEET
TWEET
BLOG
TWEET
TWEET
TWEET
BLOG
TWEET
TWEET
● If Tweets come into the system at 5/sec,
then Tweet processing rate has to be >=
5/sec
● If a blog post takes 5 seconds to process
(on average)...
● And an image takes 30 seconds to process
(on average)...
then...
Queue
HEAD
Content Pipeline
Expanded Approach:
● Utilize Seda Components (http://camel.apache.org/seda.html)
○ Underlying Thread pool with BlockingQueue
from(<queue.uri>).routeId("my-route")
.choice()
.when(simple("${in.body.isTweet()}")).to("seda:tweetEnricher")
.when(simple("${in.body.isBlog()}")).to("seda:blogEnricher")
.otherwise().to("seda:imageEnricher")
.end();
from("seda:tweetEnricher?concurrentConsumers=10").routeId("tweet-route")
.to("bean:languageAnalyzer?method=detectLanguage")
.to("bean:tweetAnalyzer?method=extractMentions")
.to("seda:searchService")
from("seda:blogEnricher?concurrentConsumers=2").routeId("blog-route") ...
from("seda:imageEnricher?concurrentConsumers=2").routeId("img-route") ...
// Routes re-join
from("seda:searchService?concurrentConsumers=X").routeId("s-indexer")
.to("bean:searchService?method=indexContent");
Content Pipeline
Queue
Queue Listener
Search Indexer
Tweet
Processor
Blog
Processor
Image
Processor
ROUTE
Routing
Filter
ThreadPool +
BlockingQueu
e
Content Pipeline
Caveats:
● No visibility into SEDA's thread pool state (e.g. how many objects on its
internal queue?)
● If VM crashes, those payloads on the SEDA thread pool blocking queue
are lost
● Our route is assuming that each payload consists of only one message
○ In reality, our payloads are a mix of different post types ... how to
handle this efficiently?
Content Pipeline
Final Solution:
from(<queue.uri>).routeId("my-route")
.split().method("payloadSplitterService", "splitMessage")
.choice()
.when(header("enrichTweets").isEqualTo(true)).to(<queue.uri.tweet>)
.when(header("enrichBlogs").isEqualTo(true)).to(<queue.uri.blogs>)
.otherwise().to("<queue.uri.images>")
.end();
from(<queue.uri.tweet>).routeId("queue-in-tweet-route")
.to("seda:tweetEnricher?timeout=0");
from("seda:tweetEnricher?concurrentConsumers=10&size=0&blockWhenFull=true").routeId
("tweet-route")
.to("bean:languageAnalyzer?method=detectLanguage")
.to("bean:tweetAnalyzer?method=extractMentions")
.to("seda:searchService")
from("seda:searchService?concurrentConsumers=X").routeId("s-indexer")
.to("bean:searchService?method=indexContent");
Questions

Contenu connexe

Tendances

Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Andrii Vozniuk
 
Log management with ELK
Log management with ELKLog management with ELK
Log management with ELKGeert Pante
 
Prometheus london
Prometheus londonPrometheus london
Prometheus londonwyukawa
 
Research Topics in Machine Hypermedia
Research Topics in Machine HypermediaResearch Topics in Machine Hypermedia
Research Topics in Machine HypermediaMichael Koster
 
Hypermedia for Machine APIs
Hypermedia for Machine APIsHypermedia for Machine APIs
Hypermedia for Machine APIsMichael Koster
 
Centralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackCentralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackRohit Sharma
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...CloudTechnologies
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Groupnathanmarz
 
Web of Things to the edge
Web of Things to the edgeWeb of Things to the edge
Web of Things to the edgeMichael Koster
 
An Open Source NoSQL solution for Internet Access Logs Analysis
An Open Source NoSQL solution for Internet Access Logs AnalysisAn Open Source NoSQL solution for Internet Access Logs Analysis
An Open Source NoSQL solution for Internet Access Logs AnalysisJosé Manuel Ciges Regueiro
 

Tendances (13)

Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
Interactive learning analytics dashboards with ELK (Elasticsearch Logstash Ki...
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Cascalog
CascalogCascalog
Cascalog
 
Log management with ELK
Log management with ELKLog management with ELK
Log management with ELK
 
DevOps, Yet Another IT Revolution
DevOps, Yet Another IT RevolutionDevOps, Yet Another IT Revolution
DevOps, Yet Another IT Revolution
 
Prometheus london
Prometheus londonPrometheus london
Prometheus london
 
Research Topics in Machine Hypermedia
Research Topics in Machine HypermediaResearch Topics in Machine Hypermedia
Research Topics in Machine Hypermedia
 
Hypermedia for Machine APIs
Hypermedia for Machine APIsHypermedia for Machine APIs
Hypermedia for Machine APIs
 
Centralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackCentralized Logging System Using ELK Stack
Centralized Logging System Using ELK Stack
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Group
 
Web of Things to the edge
Web of Things to the edgeWeb of Things to the edge
Web of Things to the edge
 
An Open Source NoSQL solution for Internet Access Logs Analysis
An Open Source NoSQL solution for Internet Access Logs AnalysisAn Open Source NoSQL solution for Internet Access Logs Analysis
An Open Source NoSQL solution for Internet Access Logs Analysis
 

Similaire à Real-time Data Processing

Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Jimmy DeadcOde
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleBharvi Dixit
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)Igor Talevski
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyLITTINRAJAN
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014Matthew Vaughn
 
Mi Domain Wheel Slides
Mi Domain Wheel SlidesMi Domain Wheel Slides
Mi Domain Wheel Slideslancesfa
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015ontopic
 
PWA - The Future of eCommerce - Magento Meetup Ahmedabad 2018
PWA - The Future of eCommerce - Magento Meetup Ahmedabad 2018PWA - The Future of eCommerce - Magento Meetup Ahmedabad 2018
PWA - The Future of eCommerce - Magento Meetup Ahmedabad 2018Bhavesh Surani
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineTrieu Nguyen
 
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHubMuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHubAlfonso Martino
 
Using AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics ServiceUsing AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics ServiceChristian Beedgen
 
Ilian Uzunov (Georgi Georgiev): Ilian Uzunov (Georgi Georgiev)
Ilian Uzunov (Georgi Georgiev): Ilian Uzunov (Georgi Georgiev)Ilian Uzunov (Georgi Georgiev): Ilian Uzunov (Georgi Georgiev)
Ilian Uzunov (Georgi Georgiev): Ilian Uzunov (Georgi Georgiev)Semantic Web Company
 
Arcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls BeginnersArcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls Beginnersarcomem
 
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...Timothy Spann
 

Similaire à Real-time Data Processing (20)

SoftNews-lowres
SoftNews-lowresSoftNews-lowres
SoftNews-lowres
 
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)AngularJS 1.x - your first application (problems and solutions)
AngularJS 1.x - your first application (problems and solutions)
 
Django course
Django courseDjango course
Django course
 
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and ScrapyWeb scraping with BeautifulSoup, LXML, RegEx and Scrapy
Web scraping with BeautifulSoup, LXML, RegEx and Scrapy
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014
 
Olist Architecture v2.0
Olist Architecture v2.0Olist Architecture v2.0
Olist Architecture v2.0
 
Mi Domain Wheel Slides
Mi Domain Wheel SlidesMi Domain Wheel Slides
Mi Domain Wheel Slides
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
 
PWA - The Future of eCommerce - Magento Meetup Ahmedabad 2018
PWA - The Future of eCommerce - Magento Meetup Ahmedabad 2018PWA - The Future of eCommerce - Magento Meetup Ahmedabad 2018
PWA - The Future of eCommerce - Magento Meetup Ahmedabad 2018
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
 
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHubMuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
 
Varun-CV-J
Varun-CV-JVarun-CV-J
Varun-CV-J
 
Using AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics ServiceUsing AWS To Build A Scalable Machine Data Analytics Service
Using AWS To Build A Scalable Machine Data Analytics Service
 
Ilian Uzunov (Georgi Georgiev): Ilian Uzunov (Georgi Georgiev)
Ilian Uzunov (Georgi Georgiev): Ilian Uzunov (Georgi Georgiev)Ilian Uzunov (Georgi Georgiev): Ilian Uzunov (Georgi Georgiev)
Ilian Uzunov (Georgi Georgiev): Ilian Uzunov (Georgi Georgiev)
 
Maruti gollapudi cv
Maruti gollapudi cvMaruti gollapudi cv
Maruti gollapudi cv
 
Arcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls BeginnersArcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls Beginners
 
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
 

Dernier

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Dernier (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Real-time Data Processing

  • 1. Real-Time Data Processing Emerging Business Meetup - 07/24/13 Bryan Warner @ Traackr
  • 2. About Me ● Bryan Warner - Engineer @ Traackr ○ bwarner@traackr.com ● Primary background is in Java ○ Breaking into Scala development this past year ● Interested in search, data scalability, and distributed computing
  • 3. About Traackr ● Influencer search engine ○ Platform for discovering and engaging online individuals who matter ● We track content and metrics for our database of influential people ○ Both in RT and daily processes ● Some of our back-end stack includes: ElasticSearch, MongoDb, Java/Spring, Scala/Akka, etc. ● Looking for developers to use our API!
  • 4. Overview ● Review Traackr's use case for real-time data processing ● Technical solution we decided on ● Questions
  • 5. Traackr Use Case 1. Real-time content stream for a targeted group of influencers within our platform a. Primarily to show real-time tweets via our Twitter data provider (GNIP) 2. On-demand content tracking and searching for new influencers a. Users can add up to a hundred people at once b. Expect that new influencer content is searchable near real-time
  • 6. Traackr Use Case Data Processing Requirements 1. Incoming data is not lost 2. Data needs to be analyzed and enriched 3. Each type of data has its own processing component * Blog Posts, Tweets, Videos, Images, etc. 4. Components should be configurable for maximum throughput! 5. Components should act like small building blocks
  • 7. Bird's Eye View Tracking App MongoDb Initial Persist Content "Enrichment" Pipeline RabbitMQ Broker Queue ElasticSearch GNIP Listener App "Post" Payload Queue Listener Make content searchable
  • 8. Content Pipeline ● Apache Camel (http://camel.apache.org/) ○ Integration framework based on Enterprise Integration patterns (EIP) ● Flexible route building ○ Supports direct and asynchronous components ○ Integrates with DI frameworks (e.g. Spring, Guice) ○ Tons of native support for various transports (http, jms, amqp, tcp, imap, etc.) ● Good support for unit testing ○ org.apache.camel.component.mock.MockEndpoint
  • 9. Content Pipeline Queue Queue Listener Search Indexer Tweet Processor Blog Processor Image Processor Routing Filter ROUTE
  • 10. Content Pipeline Initial Approach: ● Route(s) live within a CamelContext in your JVM ● Initial route is utilizing direct components (serial) from(<queue.uri>).routeId("my-route") .choice() .when(simple("${in.body.isTweet()}")) .to("bean:languageAnalyzer?method=detectLanguage") .to("bean:tweetAnalyzer?method=extractMentions") .when(simple("${in.body.isBlog()}")) .to("bean:httpService?method=fetchFullContent") .to("bean:languageAnalyzer?method=detectLanguage") .otherwise() .to("bean:imageAnalyzer?method=categorizeImage") .end() .to("bean:searchService?method=indexContent"); But there's a throughput problem...
  • 11. Content Pipeline TWEET TWEET IMAGE TWEET TWEET BLOG TWEET TWEET TWEET BLOG TWEET TWEET ● If Tweets come into the system at 5/sec, then Tweet processing rate has to be >= 5/sec ● If a blog post takes 5 seconds to process (on average)... ● And an image takes 30 seconds to process (on average)... then... Queue HEAD
  • 12. Content Pipeline Expanded Approach: ● Utilize Seda Components (http://camel.apache.org/seda.html) ○ Underlying Thread pool with BlockingQueue from(<queue.uri>).routeId("my-route") .choice() .when(simple("${in.body.isTweet()}")).to("seda:tweetEnricher") .when(simple("${in.body.isBlog()}")).to("seda:blogEnricher") .otherwise().to("seda:imageEnricher") .end(); from("seda:tweetEnricher?concurrentConsumers=10").routeId("tweet-route") .to("bean:languageAnalyzer?method=detectLanguage") .to("bean:tweetAnalyzer?method=extractMentions") .to("seda:searchService") from("seda:blogEnricher?concurrentConsumers=2").routeId("blog-route") ... from("seda:imageEnricher?concurrentConsumers=2").routeId("img-route") ... // Routes re-join from("seda:searchService?concurrentConsumers=X").routeId("s-indexer") .to("bean:searchService?method=indexContent");
  • 13. Content Pipeline Queue Queue Listener Search Indexer Tweet Processor Blog Processor Image Processor ROUTE Routing Filter ThreadPool + BlockingQueu e
  • 14. Content Pipeline Caveats: ● No visibility into SEDA's thread pool state (e.g. how many objects on its internal queue?) ● If VM crashes, those payloads on the SEDA thread pool blocking queue are lost ● Our route is assuming that each payload consists of only one message ○ In reality, our payloads are a mix of different post types ... how to handle this efficiently?
  • 15. Content Pipeline Final Solution: from(<queue.uri>).routeId("my-route") .split().method("payloadSplitterService", "splitMessage") .choice() .when(header("enrichTweets").isEqualTo(true)).to(<queue.uri.tweet>) .when(header("enrichBlogs").isEqualTo(true)).to(<queue.uri.blogs>) .otherwise().to("<queue.uri.images>") .end(); from(<queue.uri.tweet>).routeId("queue-in-tweet-route") .to("seda:tweetEnricher?timeout=0"); from("seda:tweetEnricher?concurrentConsumers=10&size=0&blockWhenFull=true").routeId ("tweet-route") .to("bean:languageAnalyzer?method=detectLanguage") .to("bean:tweetAnalyzer?method=extractMentions") .to("seda:searchService") from("seda:searchService?concurrentConsumers=X").routeId("s-indexer") .to("bean:searchService?method=indexContent");