SlideShare une entreprise Scribd logo
1  sur  22
Streaming Kafka
Search Utility for
Bagheera
Varunkumar Manohar
Metrics Engineering Intern- Summer 2013
San Francisco Commons -20th August
Apache Kafka
Why use Kafka ?
Mozilla’s Bagheera System
Search Utility
Practical Usage
Other Projects
Apache Kafka
A high throughput distributed messaging system.
Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
Centralized data pipeline
Producer1
Producer2
Producer3
Consumer1 Consumer2 Consumer3
Centralised persistent data
pipeline(Apache Kafka)
Since its
persistent
consumers can
lag behind
Producers and consumers
do not know each other
Consumer
maintenance is
easy
High Throughput
Partitioning of data allows production,
consumption and brokering to be handled by
clusters of machines. Scaling horizontally is easy.
Batch the messages and send large chunks at once.
Usage of filesystem page-cache-Delayed the flush
to disk
Shrinkage of data
Metrics
KafkaNode
1
KafkaNode
2
KafkaNode
3
KafkaNode
4
Ptn 1
Ptn 2
Ptn 3
Ptn 4
Ptn 1
Ptn 2
Ptn 3
Ptn 4
Ptn 1
Ptn 2
Ptn 3
Ptn 4
Ptn 1
Ptn 2
Ptn 3
Ptn 4
Kafka Commit log
Real time data flow for Metrics topic
( In Production)
Each partition = Commit
Log
At offset 0 we have
message_37
That can be a json
for example
Underlying principle
Use a persistent log as a messaging system.
Parallels the concept of commit log
Append only commit log keep’s track of the incoming
messages.
Mozilla’s Bagheera System
Some real-time numbers !
{ PER MINUTE PLS }
3.5k
2.6k
1.7k
0.9k
8.7 K messages per minute on week 31
Some questions !
Can we be more granular in finding out the counts
Can I get the count of messages that were pushed 3 days
back ?
Can I get to know count of messages between
Sunday and Tuesday?
Can I get to know the total messages that came in 3
days back & belong to updatechannel=‘release’
 Can I get to know the count of messages that
came in from UK two days ago?
We can get into Hadoop or HBase for that matter
and scan the data.
But Hadoop/HBase in real time is actually a massive
data store- Mind blowing !
Crunching out so much of data – Not all efficient
Can we search the kafka queue that has a fair
amount of data retained as per retentition policy ?
Yup ! You can query only the data
retained on kafka logs- Typically
our queries range within those
bounds
Yes! We can more
efficiently
Efficiently use the kafka offsets and data associated
with an offset.
The data we store has a time stamp – { the time of
insertion into the queue} – Check the time stamp to
know if the message fits our filter conditions.
We can selectively export the data we have retrieved
Concurrent execution
across partitions
for(int i =0;i<totalPartitoins;i++)
{
/*create a callable object and submit to the exectuor to
trigger of a execution */
Callable<Long>callable = new
KafkaQueueOps(brokerNodes.get(brokerIndex),
topics.get(topicIndex), i, noDays);
final ListenableFuture<Long> future=pool.submit(callable);
computationResults.add(future);
}
ListenableFuture<List<Long>> successfulResults=
Futures.successfulAsList(computationResults);
long[] sparseLst = consumer.getOffsetsBefore(topicName,
partitionNumber, -1,
Integer.MAX_VALUE);
/*sparseLst is a sparse list of offsets only (names of log files
only)*/
for (int i = 1; i < sparseLst.size(); i++) {
Fetch the message at offLst[i]
De-serialize the data using Googleprotocol buffers
if(sparseLst[i] <=timeRange)
{
checkpoint=sparseLst
break
}
}
/*start fetching the data from checkpoint skipping through every
offset till the precise offset value is obtained*/
State of consumers in
Zookeeper
Kafka Broker
Node
Kafka Broker
Node
Kafka Broker
Node
Zookeepe
r
/consumers/group1/offsets/topic1/0-2:119914
/consumers/group1/offsets/topic1/0-1:127994
/consumers/group1/offsets/topic1/0-0:130760
Consumer reads the state of their consumption from
zookeeper.
What if we can change the offset values to something we
want it to be?
we can go back in time and gracefully make the
consumer start reading from that point-
We are setting the seek cursor on a distributed log so that
the consumers can read from that point.
Do Not Track Dashboard
Hive data processing for
DNT Dashboards
JDBC
Application
Thrift Service
Driver
compiler
Executor
Metastore
Threads to execute several hive queries which in
turn starts map reduce jobs.
The processed data is converted into to JSON.
All the older JSON records and newly processed
JSON records are merged suitably.
The JSON data is used by web-api’s for data
binding
2013-04-01 AR 0.11265908876536693
0.12200304892132859
2013-04-01 AS 0.159090909090 0. 90910.5
JSON Conversion
Existing JSON
data
Merge&Sort
Thank you !
Daniel Einspanjer
Anurag
Harsha
Mark Reid

Contenu connexe

Tendances

Corbett osdi12 slides (1)
Corbett osdi12 slides (1)Corbett osdi12 slides (1)
Corbett osdi12 slides (1)
Aksh54
 
Maria Patterson - Building a community fountain around your data stream
Maria Patterson - Building a community fountain around your data streamMaria Patterson - Building a community fountain around your data stream
Maria Patterson - Building a community fountain around your data stream
PyData
 
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNOne Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
DataWorks Summit
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
P. Taylor Goetz
 
Storm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computationStorm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computation
Ferran Galí Reniu
 

Tendances (20)

Corbett osdi12 slides (1)
Corbett osdi12 slides (1)Corbett osdi12 slides (1)
Corbett osdi12 slides (1)
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
 
Maria Patterson - Building a community fountain around your data stream
Maria Patterson - Building a community fountain around your data streamMaria Patterson - Building a community fountain around your data stream
Maria Patterson - Building a community fountain around your data stream
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNOne Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEM
 
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxHBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
 
GlobalLogic Webinar: Massive aggregations with Spark and Hadoop
GlobalLogic Webinar: Massive aggregations with Spark and HadoopGlobalLogic Webinar: Massive aggregations with Spark and Hadoop
GlobalLogic Webinar: Massive aggregations with Spark and Hadoop
 
Шардинг в MongoDB, Henrik Ingo (MongoDB)
Шардинг в MongoDB, Henrik Ingo (MongoDB)Шардинг в MongoDB, Henrik Ingo (MongoDB)
Шардинг в MongoDB, Henrik Ingo (MongoDB)
 
Storm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computationStorm: Distributed and fault tolerant realtime computation
Storm: Distributed and fault tolerant realtime computation
 
Google File System
Google File SystemGoogle File System
Google File System
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
Storm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paperStorm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014 paper
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
Jan 2012 HUG: Storm
Jan 2012 HUG: StormJan 2012 HUG: Storm
Jan 2012 HUG: Storm
 

Similaire à Streaming kafka search utility for Mozilla's Bagheera

Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Shameera Rathnayaka
 
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
confluent
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
kawamuray
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 

Similaire à Streaming kafka search utility for Mozilla's Bagheera (20)

Multitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEMultitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINE
 
Kafka Deep Dive
Kafka Deep DiveKafka Deep Dive
Kafka Deep Dive
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
 
KAFKA Quickstart
KAFKA QuickstartKAFKA Quickstart
KAFKA Quickstart
 
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy—160 Billion Daily Messages on One Shared Cluster at LINE
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
 
Kafka syed academy_v1_introduction
Kafka syed academy_v1_introductionKafka syed academy_v1_introduction
Kafka syed academy_v1_introduction
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
 
Kafka Fundamentals
Kafka FundamentalsKafka Fundamentals
Kafka Fundamentals
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Streaming kafka search utility for Mozilla's Bagheera

  • 1. Streaming Kafka Search Utility for Bagheera Varunkumar Manohar Metrics Engineering Intern- Summer 2013 San Francisco Commons -20th August
  • 2. Apache Kafka Why use Kafka ? Mozilla’s Bagheera System Search Utility Practical Usage Other Projects
  • 3. Apache Kafka A high throughput distributed messaging system. Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.
  • 4. Centralized data pipeline Producer1 Producer2 Producer3 Consumer1 Consumer2 Consumer3 Centralised persistent data pipeline(Apache Kafka) Since its persistent consumers can lag behind Producers and consumers do not know each other Consumer maintenance is easy
  • 5. High Throughput Partitioning of data allows production, consumption and brokering to be handled by clusters of machines. Scaling horizontally is easy. Batch the messages and send large chunks at once. Usage of filesystem page-cache-Delayed the flush to disk Shrinkage of data
  • 6. Metrics KafkaNode 1 KafkaNode 2 KafkaNode 3 KafkaNode 4 Ptn 1 Ptn 2 Ptn 3 Ptn 4 Ptn 1 Ptn 2 Ptn 3 Ptn 4 Ptn 1 Ptn 2 Ptn 3 Ptn 4 Ptn 1 Ptn 2 Ptn 3 Ptn 4 Kafka Commit log Real time data flow for Metrics topic ( In Production)
  • 7. Each partition = Commit Log At offset 0 we have message_37 That can be a json for example
  • 8. Underlying principle Use a persistent log as a messaging system. Parallels the concept of commit log Append only commit log keep’s track of the incoming messages.
  • 10. Some real-time numbers ! { PER MINUTE PLS } 3.5k 2.6k 1.7k 0.9k 8.7 K messages per minute on week 31
  • 11. Some questions ! Can we be more granular in finding out the counts Can I get the count of messages that were pushed 3 days back ? Can I get to know count of messages between Sunday and Tuesday? Can I get to know the total messages that came in 3 days back & belong to updatechannel=‘release’  Can I get to know the count of messages that came in from UK two days ago?
  • 12. We can get into Hadoop or HBase for that matter and scan the data. But Hadoop/HBase in real time is actually a massive data store- Mind blowing ! Crunching out so much of data – Not all efficient Can we search the kafka queue that has a fair amount of data retained as per retentition policy ? Yup ! You can query only the data retained on kafka logs- Typically our queries range within those bounds
  • 13. Yes! We can more efficiently Efficiently use the kafka offsets and data associated with an offset. The data we store has a time stamp – { the time of insertion into the queue} – Check the time stamp to know if the message fits our filter conditions. We can selectively export the data we have retrieved
  • 14. Concurrent execution across partitions for(int i =0;i<totalPartitoins;i++) { /*create a callable object and submit to the exectuor to trigger of a execution */ Callable<Long>callable = new KafkaQueueOps(brokerNodes.get(brokerIndex), topics.get(topicIndex), i, noDays); final ListenableFuture<Long> future=pool.submit(callable); computationResults.add(future); } ListenableFuture<List<Long>> successfulResults= Futures.successfulAsList(computationResults);
  • 15. long[] sparseLst = consumer.getOffsetsBefore(topicName, partitionNumber, -1, Integer.MAX_VALUE); /*sparseLst is a sparse list of offsets only (names of log files only)*/ for (int i = 1; i < sparseLst.size(); i++) { Fetch the message at offLst[i] De-serialize the data using Googleprotocol buffers if(sparseLst[i] <=timeRange) { checkpoint=sparseLst break } } /*start fetching the data from checkpoint skipping through every offset till the precise offset value is obtained*/
  • 16. State of consumers in Zookeeper Kafka Broker Node Kafka Broker Node Kafka Broker Node Zookeepe r /consumers/group1/offsets/topic1/0-2:119914 /consumers/group1/offsets/topic1/0-1:127994 /consumers/group1/offsets/topic1/0-0:130760
  • 17. Consumer reads the state of their consumption from zookeeper. What if we can change the offset values to something we want it to be? we can go back in time and gracefully make the consumer start reading from that point- We are setting the seek cursor on a distributed log so that the consumers can read from that point.
  • 18. Do Not Track Dashboard
  • 19. Hive data processing for DNT Dashboards JDBC Application Thrift Service Driver compiler Executor Metastore
  • 20. Threads to execute several hive queries which in turn starts map reduce jobs. The processed data is converted into to JSON. All the older JSON records and newly processed JSON records are merged suitably. The JSON data is used by web-api’s for data binding
  • 21. 2013-04-01 AR 0.11265908876536693 0.12200304892132859 2013-04-01 AS 0.159090909090 0. 90910.5 JSON Conversion Existing JSON data Merge&Sort
  • 22. Thank you ! Daniel Einspanjer Anurag Harsha Mark Reid