2. About Me
● Bryan Warner - Engineer @ Traackr
○ bwarner@traackr.com
● Primary background is in Java
○ Breaking into Scala development this past year
● Interested in search, data scalability, and distributed
computing
3. About Traackr
● Influencer search engine
○ Platform for discovering and engaging online
individuals who matter
● We track content and metrics for our database of
influential people
○ Both in RT and daily processes
● Some of our back-end stack includes: ElasticSearch,
MongoDb, Java/Spring, Scala/Akka, etc.
● Looking for developers to use our API!
4. Overview
● Review Traackr's use case for real-time data
processing
● Technical solution we decided on
● Questions
5. Traackr Use Case
1. Real-time content stream for a targeted group of
influencers within our platform
a. Primarily to show real-time tweets via our Twitter
data provider (GNIP)
2. On-demand content tracking and searching for new
influencers
a. Users can add up to a hundred people at once
b. Expect that new influencer content is searchable
near real-time
6. Traackr Use Case
Data Processing Requirements
1. Incoming data is not lost
2. Data needs to be analyzed and enriched
3. Each type of data has its own processing component
* Blog Posts, Tweets, Videos, Images, etc.
4. Components should be configurable for maximum
throughput!
5. Components should act like small building blocks
8. Content Pipeline
● Apache Camel (http://camel.apache.org/)
○ Integration framework based on Enterprise
Integration patterns (EIP)
● Flexible route building
○ Supports direct and asynchronous components
○ Integrates with DI frameworks (e.g. Spring, Guice)
○ Tons of native support for various transports (http,
jms, amqp, tcp, imap, etc.)
● Good support for unit testing
○ org.apache.camel.component.mock.MockEndpoint
10. Content Pipeline
Initial Approach:
● Route(s) live within a CamelContext in your JVM
● Initial route is utilizing direct components (serial)
from(<queue.uri>).routeId("my-route")
.choice()
.when(simple("${in.body.isTweet()}"))
.to("bean:languageAnalyzer?method=detectLanguage")
.to("bean:tweetAnalyzer?method=extractMentions")
.when(simple("${in.body.isBlog()}"))
.to("bean:httpService?method=fetchFullContent")
.to("bean:languageAnalyzer?method=detectLanguage")
.otherwise()
.to("bean:imageAnalyzer?method=categorizeImage")
.end()
.to("bean:searchService?method=indexContent");
But there's a throughput problem...
14. Content Pipeline
Caveats:
● No visibility into SEDA's thread pool state (e.g. how many objects on its
internal queue?)
● If VM crashes, those payloads on the SEDA thread pool blocking queue
are lost
● Our route is assuming that each payload consists of only one message
○ In reality, our payloads are a mix of different post types ... how to
handle this efficiently?