SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
LESSONS LEARNED:
USING SPARK AND MICROSERVICES
(TO EMPOWER DATA SCIENTISTS AND DATA ENGINEERS)
Alexis Seigneurin
Who I am
• Software engineer for 15+ years
• Consultant at Ippon USA, previously at Ippon France
• Favorite subjects: Spark, Machine Learning, Cassandra
• Spark trainer
• @aseigneurin
• 200 software engineers in France, the US and Australia
• In the US: offices in DC, NYC and Richmond,Virginia
• Digital, Big Data and Cloud applications
• Java & Agile expertise
• Open-source projects: JHipster,Tatami, etc.
• @ipponusa
The project
• Analyze records from customers → Give feedback to the customer on their data
• High volume of data
• 25 millions records per day (average)
• Need to keep at least 60 days of history = 1.5 Billion records
• Seasonal peaks...
• Need an hybrid platform
• Batch processing for some types of analysis
• Streaming for other analyses
• Hybrid team
• Data Scientists: more familiar with Python
• Software Engineers: Java
Technical Overview
Processing technology - Spark
• Mature platform
• Supports batch jobs and streaming jobs
• Support for multiple programming languages
• Python → Data Scientists
• Scala/Java → Software Engineers
Architecture - Real time platform
• New use cases are implemented by Data Scientists all the time
• Need the implementations to be independent from each other
• One Spark Streaming job per use case
• Microservice-inspired architecture
• Diamond-shaped
• Upstream jobs are written in Scala
• Core is made of multiple Python jobs, one per use case
• Downstream jobs are written in Scala
• Plumbing between the jobs → Kafka
1/2
Architecture - Real time platform 2/2
Messaging technology - Kafka
From kafka.apache.org
• “A high-throughput distributed messaging system”
• Messaging: between 2 Spark jobs
• Distributed: fits well with Spark, can be scaled up or down
• High-throughput: so as to handle an average of 300 messages/second, peaks at 2000 m/s
• “Apache Kafka is publish-subscribe messaging rethought as a distributed
commit log”
• Commit log so that you can go back in time and reprocess data
• Only used as such when a job crashes, for resilience purposes
Storage
• Currently PostgreSQL:
• SQL databases are well known by developers and easy to work with
• PostgreSQL is available “as-a-service” on AWS
• Working on transitioning to Cassandra (more on that
later)
Deployment platform
• Amazon AWS
• Company standard - Everything in the cloud
• Easy to scale up or down, ability to choose the hardware
• Some limitations
• Requirement to use company-crafted AMIs
• Cannot use some services (EMR…)
• AMIs are renewed every 2 months → need to recreate the platform
continuously
Strengths of the platform
Modularity
• One Spark job per use case
• Hot deployments: can roll out new use cases (= new jobs) without
stopping existing jobs
• Can roll out updated code without affecting other jobs
• Able to measure the resources consumed by a single job
• Shared services are provided by upstream and
downstream jobs
A/B testing
• A/B testing of updated features
• Run 2 implementations of the code in parallel
• Let each filter process the data of all the customers
• Post-filter to let the customers receive A or B
• (Measure…)
• Can be used to slowly roll out new features
Data Scientists can contribute
• Spark in Python → pySpark
• Data Scientists know Python (and don’t want to hear about Java/
Scala!)
• Business logic implemented in Python
• Code is easy to write and to read
• Data Scientists are real contributors → quick iterations to production
Challenges
Data Scientist code in production
• Shipping code written by Data Scientists is not ideal
• Need production-grade code (error handling, logging…)
• Code is less tested than Scala code
• Harder to deploy than a JAR file → PythonVirtual Environments
• blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-
hadoop-cluster-for-pyspark-jobs/
Allocation of resources in Spark
• With Spark Streaming, resources (CPU & memory) are allocated per job
• Resources are allocated when the job is submitted and cannot be updated on the
fly
• Have to allocate 1 core to the Driver of the job → unused resource
• Have to allocate extra resources to each job to handle variations in traffic →
unused resources
• For peak periods, easy to add new Spark Workers but jobs have to restarted
• Idea to be tested:
• Over allocation of real resources, e.g let Spark know it has 6 cores on a 4-cores server
Micro-batches
Spark streaming processes events in micro-batches
• Impact on the latency
• Spark Streaming micro-batches → hard to achieve sub-second latency
• See spark.apache.org/docs/latest/streaming-programming-guide.html#task-launching-overheads
• Total latency of the system = sum of the latencies of each stage
• In this use case, events are independent from each other - no need for windowing computation → a
real streaming framework would be more appropriate
• Impact on memory usage
• Kafka+Spark using the direct approach = 1 RDD partition per Kafka partition
• If you start the Spark with lots of unprocessed data in Kafka, RDD partitions can exceed the size of
the memory
Resilience of Spark jobs
• Spark Streaming application = 1 Driver + 1 Application
• Application = N Executors
• If an Executor dies → restarted (seamless)
• If the Driver dies, the whole Application must be restarted
• Scala/Java jobs → “supervised” mode
• Python jobs → not supported with Spark Standalone
Resilience with Spark & Kafka
• Connecting Spark to Kafka, 2 methods:
• Receiver-based approach: not ideal for parallelism
• Direct approach: better for parallelism but have to deal with Kafka offsets
• Dealing with Kafka offsets
• Default: consumes from the end of the Kafka topic (or the beginning)
• Documentation → Use checkpoints
• Tasks have to be Serializable (not always possible: dependent libraries)
• Harder to deploy the application (classes are serialized) → run a new instance in parallel and
kill the first one (harder to automate; messages consumed twice)
• Requires a shared file system (HDFS, S3) → big latency on these FS that forces to increase the
micro-batch interval
1/2
Resilience with Spark & Kafka
• Dealing with Kafka offsets
• Solution: deal with offsets in the Spark Streaming application
• Write the offsets to a reliable storage: ZooKeeper, Kafka…
• Write after processing the data
• Read the offsets on startup (if no offsets, start from the end)
• ippon.tech/blog/spark-kafka-achieving-zero-data-loss/
2/2
Writing to Kafka
• Spark Streaming comes with a library to read from Kafka
but none to write to Kafka!
• Flink or Kafka Streams do that out-of-the-box
• Cloudera provides an open-source library:
• github.com/cloudera/spark-kafka-writer
• (Has been removed by now!)
Idempotence
Spark and fault-tolerance semantics:
• Spark can provide exactly once guarantee only for the transformation
of the data
• Writing the data is at least once with non-transactional systems
(including Kafka in our case)
• See spark.apache.org/docs/latest/streaming-programming-
guide.html#fault-tolerance-semantics
→The overall system has to be idempotent
Message format & schemas
• Spark jobs are decoupled, but each depends on the upstream job
• Message formats have to be agreed upon
• JSON
• Pros: flexible
• Cons: flexible! (missing fields)
• Avro
• Pros: enforces a structure (named fields + types)
• Cons: hard to propagate the schemas
→ Confluent’s Schema Registry (more on that later)
Potential & upcoming
improvements
Confluent’s Schema Registry
docs.confluent.io/3.0.0/schema-registry/docs/index.html
• Separate (web) server to manage & enforce Avro schemas
• Stores schemas, versions them, and can perform compatibility checks
(configurable: backward or forward)
• Makes life simpler:
✓ no need to share schemas (“what version of the schema is this?”)
✓ no need to share generated classes
✓ can update the producer with backward-compatible messages without affecting the
consumers
1/2
Confluent’s Schema Registry
• Comes with:
• A Kafka Serializer (for the producer): sends the schema of the object to the Schema Registry before sending the record to Kafka
• Message sending fails if schema compatibility fails
• A Kafka Decoder (for the consumer): retrieves the schema from the Schema Registry when a message comes in
2/2
Kafka Streams
docs.confluent.io/3.0.0/streams/index.html
• “powerful, easy-to-use library for building highly scalable, fault-tolerant, distributed stream
processing applications on top of Apache Kafka”
• Perfect fit for micro-services on top of Kafka
• Natively consumes messages from Kafka
• Natively pushes produced messages to Kafka
• Processes messages one at a time → very low latency
1/2
• Pros
• API is very similar to Spark’s API
• Deploy new instances of the application to scale out
• Cons
• JVM languages only - no support for Python
• Outside of Spark - one more thing to manage
Kafka Streams
Properties props = new Properties();

props.put(StreamsConfig.APPLICATION_ID_CONFIG, "xxx");

props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9093");

props.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "localhost:2182");

props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());

props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());



props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");



KStreamBuilder builder = new KStreamBuilder();



KStream<String, String> kafkaInput = builder.stream(“INPUT-TOPIC");

KStream<String, RealtimeXXX> auths = kafkaInput.mapValues(value -> ...);

KStream<String, byte[]> serializedAuths = auths.mapValues(a -> AvroSerializer.serialize(a));



serializedAuths.to(Serdes.String(), Serdes.ByteArray(), “OUTPUT-TOPIC");



KafkaStreams streams = new KafkaStreams(builder, props);

streams.start();
2/2
Example (Java)
Database migration
• The database stores the state
• Client settings or analyzed behavior
• Historical data (up to 60 days)
• Produced outputs
• Some technologies can store a state (e.g. Samza) but hardly 60 days of data
• Initially used PostgreSQL
• Easy to start with
• Available on AWS “as-a-service”: RDS
• Cannot scale to 60 days of historical data, though
• Cassandra is a good fit
• Scales out for the storage of historical data
• Connects to Spark
• Load Cassandra data into Spark, or saves data from Spark to Cassandra
• Can be used to reprocess existing data for denormalization purposes
Summary
&
Conclusion
Summary
Is the microservices architecture adequate?
• Interesting to separate the implementations of the use cases
• Overhead for the other services
Is Spark adequate?
• Supports Python (not supported by Kafka Streams)
• Micro-batches not adequate
Thank you!
@aseigneurin

Contenu connexe

Tendances

Helm - Application deployment management for Kubernetes
Helm - Application deployment management for KubernetesHelm - Application deployment management for Kubernetes
Helm - Application deployment management for KubernetesAlexei Ledenev
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
FIWARE Training: JSON-LD and NGSI-LD
FIWARE Training: JSON-LD and NGSI-LDFIWARE Training: JSON-LD and NGSI-LD
FIWARE Training: JSON-LD and NGSI-LDFIWARE
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsLightbend
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Aura Framework Overview
Aura Framework OverviewAura Framework Overview
Aura Framework Overviewrajdeep
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Service Mesh @Lara Camp Myanmar - 02 Sep,2023
Service Mesh @Lara Camp Myanmar - 02 Sep,2023Service Mesh @Lara Camp Myanmar - 02 Sep,2023
Service Mesh @Lara Camp Myanmar - 02 Sep,2023Hello Cloud
 
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20Phil Wilkins
 
SIngle Sign On with Keycloak
SIngle Sign On with KeycloakSIngle Sign On with Keycloak
SIngle Sign On with KeycloakJulien Pivotto
 
Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18CodeOps Technologies LLP
 
Kuberntes Ingress with Kong
Kuberntes Ingress with KongKuberntes Ingress with Kong
Kuberntes Ingress with KongNebulaworks
 
Introduction to Swagger
Introduction to SwaggerIntroduction to Swagger
Introduction to SwaggerKnoldus Inc.
 
Behavior Driven Development and Automation Testing Using Cucumber
Behavior Driven Development and Automation Testing Using CucumberBehavior Driven Development and Automation Testing Using Cucumber
Behavior Driven Development and Automation Testing Using CucumberKMS Technology
 
Fluentd v0.14 Plugin API Details
Fluentd v0.14 Plugin API DetailsFluentd v0.14 Plugin API Details
Fluentd v0.14 Plugin API DetailsSATOSHI TAGOMORI
 
seven-ways-to-run-flink-on-aws.pdf
seven-ways-to-run-flink-on-aws.pdfseven-ways-to-run-flink-on-aws.pdf
seven-ways-to-run-flink-on-aws.pdfSergioBruno21
 

Tendances (20)

Helm - Application deployment management for Kubernetes
Helm - Application deployment management for KubernetesHelm - Application deployment management for Kubernetes
Helm - Application deployment management for Kubernetes
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
FIWARE Training: JSON-LD and NGSI-LD
FIWARE Training: JSON-LD and NGSI-LDFIWARE Training: JSON-LD and NGSI-LD
FIWARE Training: JSON-LD and NGSI-LD
 
API Management
API ManagementAPI Management
API Management
 
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsRunning Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Aura Framework Overview
Aura Framework OverviewAura Framework Overview
Aura Framework Overview
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Service Mesh @Lara Camp Myanmar - 02 Sep,2023
Service Mesh @Lara Camp Myanmar - 02 Sep,2023Service Mesh @Lara Camp Myanmar - 02 Sep,2023
Service Mesh @Lara Camp Myanmar - 02 Sep,2023
 
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20
gRPC, GraphQL, REST - Which API Tech to use - API Conference Berlin oct 20
 
Kubernetes PPT.pptx
Kubernetes PPT.pptxKubernetes PPT.pptx
Kubernetes PPT.pptx
 
SIngle Sign On with Keycloak
SIngle Sign On with KeycloakSIngle Sign On with Keycloak
SIngle Sign On with Keycloak
 
Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18
 
Kuberntes Ingress with Kong
Kuberntes Ingress with KongKuberntes Ingress with Kong
Kuberntes Ingress with Kong
 
Introduction to Swagger
Introduction to SwaggerIntroduction to Swagger
Introduction to Swagger
 
Behavior Driven Development and Automation Testing Using Cucumber
Behavior Driven Development and Automation Testing Using CucumberBehavior Driven Development and Automation Testing Using Cucumber
Behavior Driven Development and Automation Testing Using Cucumber
 
Fluentd v0.14 Plugin API Details
Fluentd v0.14 Plugin API DetailsFluentd v0.14 Plugin API Details
Fluentd v0.14 Plugin API Details
 
Ingress overview
Ingress overviewIngress overview
Ingress overview
 
Jenkins-CI
Jenkins-CIJenkins-CI
Jenkins-CI
 
seven-ways-to-run-flink-on-aws.pdf
seven-ways-to-run-flink-on-aws.pdfseven-ways-to-run-flink-on-aws.pdf
seven-ways-to-run-flink-on-aws.pdf
 

Similaire à Lessons Learned: Using Spark and Microservices

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine LearningChris Fregly
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
 
Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationKnoldus Inc.
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafkaconfluent
 

Similaire à Lessons Learned: Using Spark and Microservices (20)

Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams Presentation
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
 

Plus de Alexis Seigneurin

Data Quality Monitoring in Realtime and at Scale
Data Quality Monitoring in Realtime and at ScaleData Quality Monitoring in Realtime and at Scale
Data Quality Monitoring in Realtime and at ScaleAlexis Seigneurin
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software DevelopmentAlexis Seigneurin
 
Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)Alexis Seigneurin
 
Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Alexis Seigneurin
 
Spark - Alexis Seigneurin (English)
Spark - Alexis Seigneurin (English)Spark - Alexis Seigneurin (English)
Spark - Alexis Seigneurin (English)Alexis Seigneurin
 
Spark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairSpark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairAlexis Seigneurin
 

Plus de Alexis Seigneurin (8)

Data Quality Monitoring in Realtime and at Scale
Data Quality Monitoring in Realtime and at ScaleData Quality Monitoring in Realtime and at Scale
Data Quality Monitoring in Realtime and at Scale
 
0712_Seigneurin
0712_Seigneurin0712_Seigneurin
0712_Seigneurin
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software Development
 
Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)Spark (v1.3) - Présentation (Français)
Spark (v1.3) - Présentation (Français)
 
Spark - Ippevent 19-02-2015
Spark - Ippevent 19-02-2015Spark - Ippevent 19-02-2015
Spark - Ippevent 19-02-2015
 
Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)Spark - Alexis Seigneurin (Français)
Spark - Alexis Seigneurin (Français)
 
Spark - Alexis Seigneurin (English)
Spark - Alexis Seigneurin (English)Spark - Alexis Seigneurin (English)
Spark - Alexis Seigneurin (English)
 
Spark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclairSpark, ou comment traiter des données à la vitesse de l'éclair
Spark, ou comment traiter des données à la vitesse de l'éclair
 

Dernier

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Lessons Learned: Using Spark and Microservices

  • 1. LESSONS LEARNED: USING SPARK AND MICROSERVICES (TO EMPOWER DATA SCIENTISTS AND DATA ENGINEERS) Alexis Seigneurin
  • 2. Who I am • Software engineer for 15+ years • Consultant at Ippon USA, previously at Ippon France • Favorite subjects: Spark, Machine Learning, Cassandra • Spark trainer • @aseigneurin
  • 3. • 200 software engineers in France, the US and Australia • In the US: offices in DC, NYC and Richmond,Virginia • Digital, Big Data and Cloud applications • Java & Agile expertise • Open-source projects: JHipster,Tatami, etc. • @ipponusa
  • 4. The project • Analyze records from customers → Give feedback to the customer on their data • High volume of data • 25 millions records per day (average) • Need to keep at least 60 days of history = 1.5 Billion records • Seasonal peaks... • Need an hybrid platform • Batch processing for some types of analysis • Streaming for other analyses • Hybrid team • Data Scientists: more familiar with Python • Software Engineers: Java
  • 6. Processing technology - Spark • Mature platform • Supports batch jobs and streaming jobs • Support for multiple programming languages • Python → Data Scientists • Scala/Java → Software Engineers
  • 7. Architecture - Real time platform • New use cases are implemented by Data Scientists all the time • Need the implementations to be independent from each other • One Spark Streaming job per use case • Microservice-inspired architecture • Diamond-shaped • Upstream jobs are written in Scala • Core is made of multiple Python jobs, one per use case • Downstream jobs are written in Scala • Plumbing between the jobs → Kafka 1/2
  • 8. Architecture - Real time platform 2/2
  • 9. Messaging technology - Kafka From kafka.apache.org • “A high-throughput distributed messaging system” • Messaging: between 2 Spark jobs • Distributed: fits well with Spark, can be scaled up or down • High-throughput: so as to handle an average of 300 messages/second, peaks at 2000 m/s • “Apache Kafka is publish-subscribe messaging rethought as a distributed commit log” • Commit log so that you can go back in time and reprocess data • Only used as such when a job crashes, for resilience purposes
  • 10. Storage • Currently PostgreSQL: • SQL databases are well known by developers and easy to work with • PostgreSQL is available “as-a-service” on AWS • Working on transitioning to Cassandra (more on that later)
  • 11. Deployment platform • Amazon AWS • Company standard - Everything in the cloud • Easy to scale up or down, ability to choose the hardware • Some limitations • Requirement to use company-crafted AMIs • Cannot use some services (EMR…) • AMIs are renewed every 2 months → need to recreate the platform continuously
  • 12. Strengths of the platform
  • 13. Modularity • One Spark job per use case • Hot deployments: can roll out new use cases (= new jobs) without stopping existing jobs • Can roll out updated code without affecting other jobs • Able to measure the resources consumed by a single job • Shared services are provided by upstream and downstream jobs
  • 14. A/B testing • A/B testing of updated features • Run 2 implementations of the code in parallel • Let each filter process the data of all the customers • Post-filter to let the customers receive A or B • (Measure…) • Can be used to slowly roll out new features
  • 15. Data Scientists can contribute • Spark in Python → pySpark • Data Scientists know Python (and don’t want to hear about Java/ Scala!) • Business logic implemented in Python • Code is easy to write and to read • Data Scientists are real contributors → quick iterations to production
  • 17. Data Scientist code in production • Shipping code written by Data Scientists is not ideal • Need production-grade code (error handling, logging…) • Code is less tested than Scala code • Harder to deploy than a JAR file → PythonVirtual Environments • blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache- hadoop-cluster-for-pyspark-jobs/
  • 18. Allocation of resources in Spark • With Spark Streaming, resources (CPU & memory) are allocated per job • Resources are allocated when the job is submitted and cannot be updated on the fly • Have to allocate 1 core to the Driver of the job → unused resource • Have to allocate extra resources to each job to handle variations in traffic → unused resources • For peak periods, easy to add new Spark Workers but jobs have to restarted • Idea to be tested: • Over allocation of real resources, e.g let Spark know it has 6 cores on a 4-cores server
  • 19. Micro-batches Spark streaming processes events in micro-batches • Impact on the latency • Spark Streaming micro-batches → hard to achieve sub-second latency • See spark.apache.org/docs/latest/streaming-programming-guide.html#task-launching-overheads • Total latency of the system = sum of the latencies of each stage • In this use case, events are independent from each other - no need for windowing computation → a real streaming framework would be more appropriate • Impact on memory usage • Kafka+Spark using the direct approach = 1 RDD partition per Kafka partition • If you start the Spark with lots of unprocessed data in Kafka, RDD partitions can exceed the size of the memory
  • 20. Resilience of Spark jobs • Spark Streaming application = 1 Driver + 1 Application • Application = N Executors • If an Executor dies → restarted (seamless) • If the Driver dies, the whole Application must be restarted • Scala/Java jobs → “supervised” mode • Python jobs → not supported with Spark Standalone
  • 21. Resilience with Spark & Kafka • Connecting Spark to Kafka, 2 methods: • Receiver-based approach: not ideal for parallelism • Direct approach: better for parallelism but have to deal with Kafka offsets • Dealing with Kafka offsets • Default: consumes from the end of the Kafka topic (or the beginning) • Documentation → Use checkpoints • Tasks have to be Serializable (not always possible: dependent libraries) • Harder to deploy the application (classes are serialized) → run a new instance in parallel and kill the first one (harder to automate; messages consumed twice) • Requires a shared file system (HDFS, S3) → big latency on these FS that forces to increase the micro-batch interval 1/2
  • 22. Resilience with Spark & Kafka • Dealing with Kafka offsets • Solution: deal with offsets in the Spark Streaming application • Write the offsets to a reliable storage: ZooKeeper, Kafka… • Write after processing the data • Read the offsets on startup (if no offsets, start from the end) • ippon.tech/blog/spark-kafka-achieving-zero-data-loss/ 2/2
  • 23. Writing to Kafka • Spark Streaming comes with a library to read from Kafka but none to write to Kafka! • Flink or Kafka Streams do that out-of-the-box • Cloudera provides an open-source library: • github.com/cloudera/spark-kafka-writer • (Has been removed by now!)
  • 24. Idempotence Spark and fault-tolerance semantics: • Spark can provide exactly once guarantee only for the transformation of the data • Writing the data is at least once with non-transactional systems (including Kafka in our case) • See spark.apache.org/docs/latest/streaming-programming- guide.html#fault-tolerance-semantics →The overall system has to be idempotent
  • 25. Message format & schemas • Spark jobs are decoupled, but each depends on the upstream job • Message formats have to be agreed upon • JSON • Pros: flexible • Cons: flexible! (missing fields) • Avro • Pros: enforces a structure (named fields + types) • Cons: hard to propagate the schemas → Confluent’s Schema Registry (more on that later)
  • 27. Confluent’s Schema Registry docs.confluent.io/3.0.0/schema-registry/docs/index.html • Separate (web) server to manage & enforce Avro schemas • Stores schemas, versions them, and can perform compatibility checks (configurable: backward or forward) • Makes life simpler: ✓ no need to share schemas (“what version of the schema is this?”) ✓ no need to share generated classes ✓ can update the producer with backward-compatible messages without affecting the consumers 1/2
  • 28. Confluent’s Schema Registry • Comes with: • A Kafka Serializer (for the producer): sends the schema of the object to the Schema Registry before sending the record to Kafka • Message sending fails if schema compatibility fails • A Kafka Decoder (for the consumer): retrieves the schema from the Schema Registry when a message comes in 2/2
  • 29. Kafka Streams docs.confluent.io/3.0.0/streams/index.html • “powerful, easy-to-use library for building highly scalable, fault-tolerant, distributed stream processing applications on top of Apache Kafka” • Perfect fit for micro-services on top of Kafka • Natively consumes messages from Kafka • Natively pushes produced messages to Kafka • Processes messages one at a time → very low latency 1/2 • Pros • API is very similar to Spark’s API • Deploy new instances of the application to scale out • Cons • JVM languages only - no support for Python • Outside of Spark - one more thing to manage
  • 30. Kafka Streams Properties props = new Properties();
 props.put(StreamsConfig.APPLICATION_ID_CONFIG, "xxx");
 props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9093");
 props.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "localhost:2182");
 props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
 props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
 
 props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
 
 KStreamBuilder builder = new KStreamBuilder();
 
 KStream<String, String> kafkaInput = builder.stream(“INPUT-TOPIC");
 KStream<String, RealtimeXXX> auths = kafkaInput.mapValues(value -> ...);
 KStream<String, byte[]> serializedAuths = auths.mapValues(a -> AvroSerializer.serialize(a));
 
 serializedAuths.to(Serdes.String(), Serdes.ByteArray(), “OUTPUT-TOPIC");
 
 KafkaStreams streams = new KafkaStreams(builder, props);
 streams.start(); 2/2 Example (Java)
  • 31. Database migration • The database stores the state • Client settings or analyzed behavior • Historical data (up to 60 days) • Produced outputs • Some technologies can store a state (e.g. Samza) but hardly 60 days of data • Initially used PostgreSQL • Easy to start with • Available on AWS “as-a-service”: RDS • Cannot scale to 60 days of historical data, though • Cassandra is a good fit • Scales out for the storage of historical data • Connects to Spark • Load Cassandra data into Spark, or saves data from Spark to Cassandra • Can be used to reprocess existing data for denormalization purposes
  • 33. Summary Is the microservices architecture adequate? • Interesting to separate the implementations of the use cases • Overhead for the other services Is Spark adequate? • Supports Python (not supported by Kafka Streams) • Micro-batches not adequate