Streaming kafka search utility for Mozilla's Bagheera

Streaming Kafka
Search Utility for
Bagheera
Varunkumar Manohar
Metrics Engineering Intern- Summer 2013
San Francisco Commons -20th August

Apache Kafka
Why use Kafka ?
Mozilla’s Bagheera System
Search Utility
Practical Usage
Other Projects

Apache Kafka
A high throughput distributed messaging system.
Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.

Centralized data pipeline
Producer1
Producer2
Producer3
Consumer1 Consumer2 Consumer3
Centralised persistent data
pipeline(Apache Kafka)
Since its
persistent
consumers can
lag behind
Producers and consumers
do not know each other
Consumer
maintenance is
easy

High Throughput
Partitioning of data allows production,
consumption and brokering to be handled by
clusters of machines. Scaling horizontally is easy.
Batch the messages and send large chunks at once.
Usage of filesystem page-cache-Delayed the flush
to disk
Shrinkage of data

Metrics
KafkaNode
1
KafkaNode
2
KafkaNode
3
KafkaNode
4
Ptn 1
Ptn 2
Ptn 3
Ptn 4
Ptn 1
Ptn 2
Ptn 3
Ptn 4
Ptn 1
Ptn 2
Ptn 3
Ptn 4
Ptn 1
Ptn 2
Ptn 3
Ptn 4
Kafka Commit log
Real time data flow for Metrics topic
( In Production)

Each partition = Commit
Log
At offset 0 we have
message_37
That can be a json
for example

Underlying principle
Use a persistent log as a messaging system.
Parallels the concept of commit log
Append only commit log keep’s track of the incoming
messages.

Some real-time numbers !
{ PER MINUTE PLS }
3.5k
2.6k
1.7k
0.9k
8.7 K messages per minute on week 31

Some questions !
Can we be more granular in finding out the counts
Can I get the count of messages that were pushed 3 days
back ?
Can I get to know count of messages between
Sunday and Tuesday?
Can I get to know the total messages that came in 3
days back & belong to updatechannel=‘release’
 Can I get to know the count of messages that
came in from UK two days ago?

We can get into Hadoop or HBase for that matter
and scan the data.
But Hadoop/HBase in real time is actually a massive
data store- Mind blowing !
Crunching out so much of data – Not all efficient
Can we search the kafka queue that has a fair
amount of data retained as per retentition policy ?
Yup ! You can query only the data
retained on kafka logs- Typically
our queries range within those
bounds

Yes! We can more
efficiently
Efficiently use the kafka offsets and data associated
with an offset.
The data we store has a time stamp – { the time of
insertion into the queue} – Check the time stamp to
know if the message fits our filter conditions.
We can selectively export the data we have retrieved

Concurrent execution
across partitions
for(int i =0;i<totalPartitoins;i++)
{
/*create a callable object and submit to the exectuor to
trigger of a execution */
Callable<Long>callable = new
KafkaQueueOps(brokerNodes.get(brokerIndex),
topics.get(topicIndex), i, noDays);
final ListenableFuture<Long> future=pool.submit(callable);
computationResults.add(future);
}
ListenableFuture<List<Long>> successfulResults=
Futures.successfulAsList(computationResults);

long[] sparseLst = consumer.getOffsetsBefore(topicName,
partitionNumber, -1,
Integer.MAX_VALUE);
/*sparseLst is a sparse list of offsets only (names of log files
only)*/
for (int i = 1; i < sparseLst.size(); i++) {
Fetch the message at offLst[i]
De-serialize the data using Googleprotocol buffers
if(sparseLst[i] <=timeRange)
{
checkpoint=sparseLst
break
}
}
/*start fetching the data from checkpoint skipping through every
offset till the precise offset value is obtained*/

State of consumers in
Zookeeper
Kafka Broker
Node
Kafka Broker
Node
Kafka Broker
Node
Zookeepe
r
/consumers/group1/offsets/topic1/0-2:119914

Consumer reads the state of their consumption from
zookeeper.
What if we can change the offset values to something we
want it to be?
we can go back in time and gracefully make the
consumer start reading from that point-
We are setting the seek cursor on a distributed log so that
the consumers can read from that point.

Hive data processing for
DNT Dashboards
JDBC
Application
Thrift Service
Driver
compiler
Executor
Metastore

Threads to execute several hive queries which in
turn starts map reduce jobs.
The processed data is converted into to JSON.
All the older JSON records and newly processed
JSON records are merged suitably.
The JSON data is used by web-api’s for data
binding

2013-04-01 AR 0.11265908876536693
0.12200304892132859
2013-04-01 AS 0.159090909090 0. 90910.5
JSON Conversion
Existing JSON
data
Merge&Sort

Thank you !
Daniel Einspanjer
Anurag
Harsha
Mark Reid

Streaming kafka search utility for Mozilla's Bagheera

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Streaming kafka search utility for Mozilla's Bagheera

Similaire à Streaming kafka search utility for Mozilla's Bagheera (20)

Dernier

Dernier (20)

Streaming kafka search utility for Mozilla's Bagheera