SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
Machine Learning for
Real Time Streaming Data with
Kafka and TensorFlow
Yong Tang
Maintainer & SIG IO Lead,
TensorFlow
GitHub: yongtang
Stream Data Processing & Machine Learning
Apache Kafka & Event Driven Microservices
Service
Service
Service
Service
Service
TensorFlow & 2.0
• Most popular machine learning framework
• 1.x
• 2.0
• Keras high level API
• Eager execution
• Data processing: tf.data
• Distribution Strategy
Training Workflow (TensorFlow 2.0)
Read and Preprocess
Data
Model Building Training Saving
tf.data
tf.keras
tf.estimator
eager execution
autograph
distribution strategy
saved model
# <= read data into tf.data
dataset = …. , test_dataset = ….
# <= build keras model
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax) ])
model.compile(optimizer='adam’, loss='sparse_categorical_crossentropy’,
metrics=['accuracy’])
# <= train the model
model.fit(dataset, epochs=5)
# <= inference
predictions = model.predict(test_dataset, callbacks=[callback])
Steam Data + Machine Learning: Challenges
Streaming & big data
• Hadoop/Zookeeper
• Kafka, Spark, Flink
• Java/Scala
• Parquet/Avro/Arrow
Machine leaning
• CPU/GPU/TPU
• Python (C++, CUDA)
• Numpy
• TFRecord/CSV/Text/JPEG/PNG
TensorFlow TFRecord Format
• Sequence of binary strings
• Protocol buffer for serialization
• Flexible and simple
• ONLY used by TensorFlow
• NO ecosystem support like parquet, etc.
Infrastructure Solution
Service
Service
TensorFlow
Convert To
TFRecord File
Write Back to
Kafka
Persistent
Storage
/ File System
tf.data in TensorFlow
• Part of the graph in TensorFlow
• Support Eager and Graph mode
• Focus on data transformation
tf.data.TFRecordDataset
tf.data.TextLineDataset
tf.data.FixedLengthRecordDataset
KafkaDataset for TensorFlow
• Subclass of tf.data.Dataset
• Written in C++ (linked with librdkafka)
• Part of the graph in TensorFlow
• Support Eager and Graph mode
KafkaDataset for TensorFlow
• Part of tensorflow package in TF 1.x
• tf.contrib.kafka
• Part of tensorflow-io package in TF 2.0
• KafkaDataset (Read)
• KafkaOutputSequence (Write)
• Python and R (other language support possible)
$ pip install tensorflow-io
import tensorflow_io.kafka as kafka_io
dataset = kafka_io.KafkaDataset( 'topic', server='localhost', group= '')
# <= pre-process data (if needed), batch/repeat/etc.
dataset = dataset.map(
lambda x: ……)
# <= build keras model
model = tf.keras.models…
model.compile(…)
# <= train the model
model.fit(dataset, epochs=5)
# continue with inference…
# build keras callback
class OutputCallback(tf.keras.callbacks.Callback):
def __init__(self, batch_size, topic, servers):
self._sequence = kafka_io.KafkaOutputSequence(
topic=topic, servers=servers)
self._batch_size = batch_size
def on_predict_batch_end(self, batch, logs=None):
...
self._sequence.setitem(index, class_names[np.argmax(output)])
# <= inference with callback for streaming input and output
model.predict(
test_dataset, callbacks=[OutputCallback(32, 'topic', 'localhost')])
KafkaDataset for TensorFlow
Service
Service
TensorFlow
Service
TensorFlow SIG I/O
• https://github.com/tensorflow/io
• Special Interest Group under TensorFlow org
• Focus on I/O, streaming, and data formats
• Streaming:
• Apache Kafka, AWS Kinesis, Google Cloud PubSub
• Memory, caching and serialization
• Apache Ignite, Apache Arrow, Apache Parquet
• File formats:
• MNIST, WebP/TIFF/etc, Video, Audio
• Cloud Storage
• GCS, S3, Azure
Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Kafka Summit NYC 2019

Contenu connexe

Tendances

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 

Tendances (20)

Kafka Connect - debezium
Kafka Connect - debeziumKafka Connect - debezium
Kafka Connect - debezium
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with Python
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Exactly once with spark streaming
Exactly once with spark streamingExactly once with spark streaming
Exactly once with spark streaming
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 

Similaire à Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Kafka Summit NYC 2019

Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly
 
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Raffi Khatchadourian
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 

Similaire à Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Kafka Summit NYC 2019 (20)

Meetup tensorframes
Meetup tensorframesMeetup tensorframes
Meetup tensorframes
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
TensorFlow Dev Summit 2017 요약
TensorFlow Dev Summit 2017 요약TensorFlow Dev Summit 2017 요약
TensorFlow Dev Summit 2017 요약
 
Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016Advanced Spark and TensorFlow Meetup May 26, 2016
Advanced Spark and TensorFlow Meetup May 26, 2016
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Keras and TensorFlow
Keras and TensorFlowKeras and TensorFlow
Keras and TensorFlow
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
slide-keras-tf.pptx
slide-keras-tf.pptxslide-keras-tf.pptx
slide-keras-tf.pptx
 
running Tensorflow in Production
running Tensorflow in Productionrunning Tensorflow in Production
running Tensorflow in Production
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
 
Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...
Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...
Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...
 
Introduction to TensorFlow 2
Introduction to TensorFlow 2Introduction to TensorFlow 2
Introduction to TensorFlow 2
 
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning PerformanceAnirudh Koul. 30 Golden Rules of Deep Learning Performance
Anirudh Koul. 30 Golden Rules of Deep Learning Performance
 
Introduction to TensorFlow 2 and Keras
Introduction to TensorFlow 2 and KerasIntroduction to TensorFlow 2 and Keras
Introduction to TensorFlow 2 and Keras
 
A Tour of Tensorflow's APIs
A Tour of Tensorflow's APIsA Tour of Tensorflow's APIs
A Tour of Tensorflow's APIs
 
Boosting machine learning workflow with TensorFlow 2.0
Boosting machine learning workflow with TensorFlow 2.0Boosting machine learning workflow with TensorFlow 2.0
Boosting machine learning workflow with TensorFlow 2.0
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Flink Forward SF 2017: Eron Wright - Introducing Flink TensorflowFlink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 

Plus de confluent

Plus de confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Dernier

Dernier (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Real Time Streaming Data with Kafka and TensorFlow (Yong Tang, MobileIron) Kafka Summit NYC 2019

  • 1. Machine Learning for Real Time Streaming Data with Kafka and TensorFlow Yong Tang Maintainer & SIG IO Lead, TensorFlow GitHub: yongtang
  • 2. Stream Data Processing & Machine Learning
  • 3. Apache Kafka & Event Driven Microservices Service Service Service Service Service
  • 4. TensorFlow & 2.0 • Most popular machine learning framework • 1.x • 2.0 • Keras high level API • Eager execution • Data processing: tf.data • Distribution Strategy
  • 5. Training Workflow (TensorFlow 2.0) Read and Preprocess Data Model Building Training Saving tf.data tf.keras tf.estimator eager execution autograph distribution strategy saved model
  • 6. # <= read data into tf.data dataset = …. , test_dataset = …. # <= build keras model model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(512, activation=tf.nn.relu), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation=tf.nn.softmax) ]) model.compile(optimizer='adam’, loss='sparse_categorical_crossentropy’, metrics=['accuracy’]) # <= train the model model.fit(dataset, epochs=5) # <= inference predictions = model.predict(test_dataset, callbacks=[callback])
  • 7. Steam Data + Machine Learning: Challenges Streaming & big data • Hadoop/Zookeeper • Kafka, Spark, Flink • Java/Scala • Parquet/Avro/Arrow Machine leaning • CPU/GPU/TPU • Python (C++, CUDA) • Numpy • TFRecord/CSV/Text/JPEG/PNG
  • 8. TensorFlow TFRecord Format • Sequence of binary strings • Protocol buffer for serialization • Flexible and simple • ONLY used by TensorFlow • NO ecosystem support like parquet, etc.
  • 9. Infrastructure Solution Service Service TensorFlow Convert To TFRecord File Write Back to Kafka Persistent Storage / File System
  • 10. tf.data in TensorFlow • Part of the graph in TensorFlow • Support Eager and Graph mode • Focus on data transformation tf.data.TFRecordDataset tf.data.TextLineDataset tf.data.FixedLengthRecordDataset
  • 11. KafkaDataset for TensorFlow • Subclass of tf.data.Dataset • Written in C++ (linked with librdkafka) • Part of the graph in TensorFlow • Support Eager and Graph mode
  • 12. KafkaDataset for TensorFlow • Part of tensorflow package in TF 1.x • tf.contrib.kafka • Part of tensorflow-io package in TF 2.0 • KafkaDataset (Read) • KafkaOutputSequence (Write) • Python and R (other language support possible)
  • 13. $ pip install tensorflow-io import tensorflow_io.kafka as kafka_io dataset = kafka_io.KafkaDataset( 'topic', server='localhost', group= '') # <= pre-process data (if needed), batch/repeat/etc. dataset = dataset.map( lambda x: ……) # <= build keras model model = tf.keras.models… model.compile(…) # <= train the model model.fit(dataset, epochs=5)
  • 14. # continue with inference… # build keras callback class OutputCallback(tf.keras.callbacks.Callback): def __init__(self, batch_size, topic, servers): self._sequence = kafka_io.KafkaOutputSequence( topic=topic, servers=servers) self._batch_size = batch_size def on_predict_batch_end(self, batch, logs=None): ... self._sequence.setitem(index, class_names[np.argmax(output)]) # <= inference with callback for streaming input and output model.predict( test_dataset, callbacks=[OutputCallback(32, 'topic', 'localhost')])
  • 16. TensorFlow SIG I/O • https://github.com/tensorflow/io • Special Interest Group under TensorFlow org • Focus on I/O, streaming, and data formats • Streaming: • Apache Kafka, AWS Kinesis, Google Cloud PubSub • Memory, caching and serialization • Apache Ignite, Apache Arrow, Apache Parquet • File formats: • MNIST, WebP/TIFF/etc, Video, Audio • Cloud Storage • GCS, S3, Azure