This document summarizes a webinar about integrating Apache Kafka and MongoDB for data streaming. The webinar covered:
- An overview of Apache Kafka and how it can be used for data transport and integration as well as real-time stream processing.
- How MongoDB can be used as both a Kafka producer, to stream data into Kafka topics, and as a Kafka consumer, to retrieve streamed data from Kafka for storage, querying, and analytics in MongoDB.
- Various use cases for integrating Kafka and MongoDB, including handling real-time updates, storing raw and processed event data, and powering real-time applications with analytics models built from streamed data.
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Webinar: Data Streaming with Apache Kafka & MongoDB
1. #MongoDBWebinar | @mongodb
Data Streaming with
Apache Kafka &
MongoDB
Andrew Morgan –MongoDB Product
Marketing
David Tucker–Director, Partner Engineering
andAlliances atConfluent
13th September 2016
10. #MongoDBWebinar | @mongodb
What does Kafka
do?
Producers
Consumers
Kafka Connect
Kafka Connect
Topic
Your interfaces to the world
Connected to your systems in real time
11. #MongoDBWebinar | @mongodb
What is Streaming Data
Synchronous Req/Response
0 – 100s ms
Near Real Time
> 100s ms
Offline Batch
> 1 hour
KAFKA
Stream Data Platform
Search
RDBMS
Apps Monitoring
Real-time AnalyticsNoSQL Stream Processing
HADOOP
Data Lake
Impala
DWH
Hive
Spark Map-Reduce
13. #MongoDBWebinar | @mongodb
Confluent Platform: It’s Kafka ++
Feature Benefit Apache Kafka Confluent Platform
Confluent Platform
Enterprise
Apache Kafka
High throughput,low latency, high availability, secure distributedmessage
system
Kafka Connect
Advanced framework for connecting external sources/destinations into
Kafka
Java Client Provides easy integration intoJava applications
Kafka Streams
Simple library that enables streaming applicationdevelopment within the
Kafka framework
Additional Clients Supports non-Javaclients; C, C++, Python, etc.
REST Proxy
Provides universal access to Kafka from any network connected devicevia
HTTP
Schema Registry
Central registry for the format of Kafka data – guarantees all data is always
consumable
Pre-Built Connectors
HDFS, JDBC and other connectors fully Certified
and fully supported by Confluent
Confluent Control Center Includes Connector Managementand Stream Monitoring
Support
Enterprise class support to keep your Kafkaenvironment running at top
performance
Community Community 24x7x365
Free Free Subscription
14. #MongoDBWebinar | @mongodb
Common Kafka Use Cases
Data transport and integration
• Log data
• Database changes
• Sensors and device data
• Monitoring streams
• Call data records
• Stock ticker data
Real-time stream processing
• Monitoring
• Asynchronous applications
• Fraud and security
15. #MongoDBWebinar | @mongodb
People Using Kafka Today
Financial Services
Entertainment & Media
Consumer Tech
Travel & Leisure
Enterprise Tech
Telecom Retail
29. #MongoDBWebinar | @mongodb
MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Distributed
Processing
Frameworks
Millisecond latency. Expressive querying & flexible indexing against subsets
of data. Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data
stored in 128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Kafka
Streams
30. #MongoDBWebinar | @mongodb
MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Millisecond latency. Expressive querying & flexible indexing against subsets
of data. Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data
stored in 128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Configure where to
land incoming data
Distributed
Processing
Frameworks
Kafka
Streams
31. #MongoDBWebinar | @mongodb
MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Millisecond latency. Expressive querying & flexible indexing against subsets
of data. Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data
stored in 128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Raw data processed to
generate analytics models
Distributed
Processing
Frameworks
Kafka
Streams
32. #MongoDBWebinar | @mongodb
MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Millisecond latency. Expressive querying & flexible indexing against subsets
of data. Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data
stored in 128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
MongoDB exposes
analytics models to
operational apps.
Handles real time
updates
Distributed
Processing
Frameworks
Kafka
Streams
33. #MongoDBWebinar | @mongodb
MessageQueue
Customer Data Mgmt Mobile App IoT App Live Dashboards
Raw Data
Processed
Events
Millisecond latency. Expressive querying & flexible indexing against subsets
of data. Updates-in place. In-database aggregations & transformations
Multi-minute latency with scans across TB/PB of data. No indexes. Data
stored in 128MB blocks. Write-once-read-many & append-only storage model
Sensors
User Data
Clickstreams
Logs
Churn
Analysis
Enriched
Customer
Profiles
Risk
Modeling
Predictive
Analytics
Real-Time Access
Batch Processing, Batch Views
Design Pattern: Operationalized Data Lake
Compute new
models against
MongoDB &
HDFS
Distributed
Processing
Frameworks
Kafka
Streams
41. #MongoDBWebinar | @mongodb
MongoDB Atlas
Database as a service for MongoDB
MongoDBAtlas is…
• Automated: The easiestway to build,launch,and scale apps on MongoDB
• Flexible: The only database as a service with all you need for modern applications
• Secured: Multiple levels ofsecurity available to give you peace of mind
• Scalable: Deliver massive scalability with zero downtime as you grow
• Highly available: Your deployments are fault-tolerantand self-healing by default
• High performance: The performance you need for your most demanding workloads
42. #MongoDBWebinar | @mongodb
MongoDB Atlas Features
• Spin up a cluster in
seconds
• Replicated & always-
on deployments
• Fully elastic: scale
out or up in a few
clicks with zero
downtime
• Automatic patches &
simplified upgrades
for the newest
MongoDB features
• Authenticated &
encrypted
• Continuous backup
with point-in-time
recovery
• Fine-grained
monitoring &
custom alerts
Safe & SecureRun for You
• On-demand pricing
model;billed by the
hour
• Multi-cloud support
(AWS available with
others coming
soon)
• Part of a suite of
products & services
designed for all
phases of your app;
migrate easily to
different
environments
(private cloud, on-
prem, etc) when
needed
No Lock-In
Database as a service for MongoDB
43. #MongoDBWebinar | @mongodb
MongoDB Enterprise Advanced
• MongoDB Ops
Manager or
MongoDB Cloud
Manager Premium
• MongoDB Compass
• MongoDB
Connector for BI
• Encrypted Storage
Engine
• LDAP / Kerberos
Integration
• DDL & DML
Auditing
• FIPS 140-2 Support
SecurityTooling
• 24 x 7 Support
• 1 hr SLA
• Emergency
Patches
• Customer Success
Program
• On-Demand
Training
Support License
• Commercial
License
44. #MongoDBWebinar | @mongodb
Resources
• Data Streaming with Apache Kafka & MongoDB
• https://www.mongodb.com/collateral/data-streaming-with-apache-
kafka-and-mongodb
• Implementing a Kafka Consumer for MongoDB
• https://www.mongodb.com/blog/post/mongodb-and-data-streaming-
implementing-a-mongodb-kafka-consumer
• Tailing the Oplog on a sharded MongoDB Cluster
• https://www.mongodb.com/blog/post/tailing-mongodb-oplog-sharded-
clusters
45. #MongoDBWebinar | @mongodb
Old Billingsgate, London
15th November
mongodb.com/europe
Use my discount code for 20% off: andrewmorgan20
46. #MongoDBWebinar | @mongodb
Document Data Model
Relational MongoDB
{ customer_id : 1,
first_name : "Mark",
last_name : "Smith",
city : "San Francisco",
phones: [
{
number : “1-212-777-1212”,
dnc : true,
type : “home”
},
number : “1-212-777-1213”,
type : “cell”
}]
}
Customer ID First Name Last Name City
0 John Doe New York
1 Mark Smith San Francisco
2 Jay Black Newark
3 Meagan White London
4 Edward Daniels Boston
Phone Number Type DNC Customer ID
1-212-555-1212 home T 0
1-212-555-1213 home T 0
1-212-555-1214 cell F 0
1-212-777-1212 home T 1
1-212-777-1213 cell (null) 1
1-212-888-1212 home F 2
47. #MongoDBWebinar | @mongodb
Document Model Benefits
{
customer_id : 1,
first_name : "Mark",
last_name : "Smith",
city : "San Francisco",
phones: [
{
number : “1-212-777-1212”,
dnc : true,
type : “home”
},
number : “1-212-777-1213”,
type : “cell”
}]
}
Agility and flexibility
Data model supports business change
Rapidly iterate to meet new requirements
Intuitive, natural data representation
Eliminates ORM layer
Developers are more productive
Reduces the need for joins, disk seeks
Programming is more simple
Performance delivered at scale
48. #MongoDBWebinar | @mongodb
Rich Functionality
MongoDB
Expressive Queries
• Find anyone with phone # “1-212…”
• Check if the person with number “555…” is on the “do not
call” list
Geospatial
• Find the best offer for the customer at geo coordinates of 42nd
St. and 6th
Ave
Text Search • Find all tweets that mention the firm within the last 2 days
Aggregation • Count and sort number of customers by city
Native Binary
JSON support
• Add an additional phone number to Mark Smith’s without
rewriting the document
• Select just the mobile phone number in the list
• Sort on the modified date
{ customer_id : 1,
first_name : "Mark",
last_name : "Smith",
city : "San Francisco",
phones: [ {
number : “1-212-777-1212”,
dnc : true,
type : “home”
},
{
number : “1-212-777-1213”,
type : “cell”
}]
}
Left outer join
($lookup)
• Query for all San Francisco residences,lookup their
transactions, and sum the amount by person