We all love to play with the shiny toys, but an event stream with no events is a sorry sight. In this session you’ll see how to create your own streaming dataset for Apache Kafka using Python and the Faker library. You’ll learn how to create a random data producer and define the structure and rate of its message delivery. Randomly-generated data is often hilarious in its own right, and it adds just the right amount of fun to any Kafka and its integrations!
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Kickstart your Kafka with Faker Data | Francesco Tisiot, Aiven.io
1. NEAR REAL TIME
STREAMING AND DATA
PROCESSING OF
CEILOMETER DATA USING
KAFKA
Kafka Summit Europe 2021
2. Acknowledgement
This project is funded by NASA. Brightest minds of University of Maryland
Baltimore County are investing on this project.
3. Kafka Team
Samit Shivadekar
PhD candidate in Computer Science
at UMBC
Dr. Milton Halem
Research Professor at UMBC
Dr. Phuong Nguyen
Research Assistant Professor at
UMBC
Rahul Gite
Master’s in Data Science
Graduate Research Assistant at UMB
5. Planetary boundary layer
The Planetary Boundary Layer is the lowest
part of the Troposphere.
Plays an important role in atmospheric and
Pollution study
6. Data Collection
Ceilometer is a device that uses a laser or measure the aerosol
concentration within the atmosphere, cloud base height.
Satellite instruments (ICESat-2, ESA’s ADM-Aeolus)
Model Output from National Oceanic and Atmospheric
Administration (PBLH, HRRR hourly product)
7.
8. About
Data
Aerosol backscatter is Portion of
light that is reflected towards the
ceilometer from a distance.
Aerosol Backscatter is calculated
from an retrieval algorithm from
radiation profile
9. Tools and Technologies
KAFKA – DATA
STREAMING
PYTHON – DATA
PROCESSING
TENSORFLOW –
ARTIFICIAL
INTELLIGENCE SYSTEM
COMPRESSIVE
SENSING – DATA
FUSION
JAVASCRIPT –
INTERACTIVE WEB
DEVELOPMENT
11. Solutions with Kafka
Kafka streaming
Data organization with Kafka Topics
Building real time data pipelines by developing Kafka producers and streaming
applications for consuming.
Fault tolerance with Partitioning
12. Project Architecture
The Project is divided into Three Layer:
• Data Ingestion Layer
• Deep Learning AI Layer
• User Interface Layer
13. Data Ingestion
Layer
Kafka
Streaming
Data
Shared Business
Layer
Kafka
Machine learning
Layer
Archive Data
Batch
Processing
• We collect the data in two form,
one is Streaming and other as
Batch Processing
• First task is to archive data
• Streaming data goes through
Kafka into Shared Business
Layer. In this layer all kinds of
data preprocessing is done.
• Once the Data is ready, we
again use Kafka to feed this
data to Machine Learning
models where prediction are
done
14. Kafka
Architecture
• Data is received from multiple Ceilometer
sensor in form of file system.
• Kafka Producer writes the data to Kafka
Topic, which is partitioned across the
cluster
• Consumer receive data from the topic,
preprocess it and store it to respective
layer.
• Fault tolerance is achieved by replicating
data across two cluster
15. Data
Processin
g
As soon consumer receive the
data, its first task is the data
pre-processing.
Integrating the data from
different file format like csv,
dat, his etc file format to
common netcdf file format.
Managing noisy signal, missing
data and outliers. Using LSTM
to impute missing data.
16. Experiments - Latency
Performance
• The latency performance is recorded to measure the total
time to process all streams of data.
• Horizontal axis shows the number of Consumers and the
vertical axis shows time to process 18000 files.
• The fixed rate of 200 files per second are sent to the
system.
• The results show the total amount of time to process all the
streams is almost linearly reduced as the number of
Consumers are increased using a single node.
Time to process number of files vs number of
consumers using a single node (predicted speed
for 18000 files at 200.0 files per second).
17. Experiment - Throughput
Performance
• Throughput measures number of files transfer
per second through Kafka with respect to
number of partitions and number of
Consumers.
• The results shows the throughput increases
till the number of partitions are less than
number of consumer.
• The reason being all new consumers wait in
idle mode until an existing consumer
unsubscribes from that partition. Throughput performance with configuration set to use
Consumers and 50 partitions using a single node.
18. Experiment - Throughput Performance
on Cluster
• Throughput performance when only one
broker is present across a cluster of two
nodes.
• Performance decreases as the number
of consumers on two nodes increase.
• The reason is network bottleneck. When
multiple consumers on a single node
connect to a broker, the bandwidth
available for each is reduced.
Throughput performance results for 2-node
cluster. Brokers is running on Node 1
19. Experiment - Throughput
Performance on Cluster
• Throughput performance when two
broker are present across a cluster of
two nodes.
• Much better performance compared
when running single broker.
• All traffic happens between brokers on a
single connection, which uses much less
bandwidth because of partition
replication. Throughput performance results for 2-node
cluster. Brokers are running on Node 1 and Node
2
20. MULTI-STATION
PROCESSING AND
INTEGRATION
Data Preprocessing and Storage
Visualization
• Edge Streaming AI system results
with integrated multi-station
processing.
• Using multiple sources of data to
estimate the PBLH
• Experimenting with a multi-sourced
stacked convolutional LSTM
• Learns PBLH over time for given
geographical locations using a
combination of source data
- WRF-CHEM model backscatter
- Ceilometer-based backscatter
- Satellite-based backscatter
21. Future Work
Incorporate
Incorporate more data from
different ceilometer sites.
Increase
Increase Fault Tolerance by
including commodity hardware
to duplicate data across Kafka
using replication factor.
Integrate
Integrate Kafka with Spark for
big data processing.