Apache Big Data 2017, Miami (Florida/USA): Talk by Josef Adersberger (@adersberger, CTO at QAware)
Abstract:
We see a big data processing pattern emerging using the Microservice approach to build an integrated, flexible, and distributed system of data processing tasks. We call this the Dataservice pattern. In this presentation we'll introduce into Dataservices: their basic concepts, the technology typically in use (like Kubernetes, Kafka, Cassandra and Spring) and some architectures from real-life.
3. BIG DATA FAST DATA
SMART DATA
All things distributed:
‣distributed
processing
‣distributed
databases
Data to information:
‣machine (deep) learning
‣advanced statistics
‣natural language processing
‣semantic web
Low latency and
high throughput:
‣stream processing
‣messaging
‣event-driven
9. BASIC IDEA: DATA PROCESSING WITHIN A GRAPH OF MICROSERVICES
Microservice
(aka Dataservice)
Message
Queue
Sources Processors Sinks
DIRECTED GRAPH OF MICROSERVICES EXCHANGING DATA VIA MESSAGING
10. BASIC IDEA: COHERENT PLATFORM FOR MICRO- AND DATASERVICES
CLUSTER OPERATING SYSTEM
IAAS ON PREM LOCAL
MICROSERVICES
DATASERVICES
MICROSERVICES PLATFORM
DATASERVICES PLATFORM
11. OPEN SOURCE DATASERVICE PLATFORMS
‣ Open source project based on the Spring stack
‣ Microservices: Spring Boot
‣ Messaging: Kafka 0.9, Kafka 0.10, RabbitMQ
‣ Standardized API with several open source implementations
‣ Microservices: JavaEE micro container
‣ Messaging: JMS
‣ Open source by Lightbend (part. commercialised & proprietary)
‣ Microservices: Lagom, Play
‣ Messaging: akka
13. BASIC IDEA: DATA PROCESSING WITHIN A GRAPH OF MICROSERVICES
Sources Processors Sinks
DIRECTED GRAPH OF SPRING BOOT MICROSERVICES EXCHANGING DATA VIA MESSAGING
Stream
App
Message
Broker
Channel
14. THE BIG PICTURE
SPRING CLOUD DATA FLOW SERVER (SCDF SERVER)
TARGET RUNTIME
SPI
API
LOCAL
SCDF Shell
SCDF Admin UI
Flo Stream Designer
15. THE BIG PICTURE
SPRING CLOUD DATA FLOW SERVER (SCDF SERVER)
TARGET RUNTIME
MESSAGE BROKER
APP
SPRING BOOT
SPRING FRAMEWORK
SPRING CLOUD STREAM
SPRING INTEGRATION
BINDER
APP
APP
APP
CHANNELS
(input/output)
16. THE VEINS: SCALABLE DATA LOGISTICS WITH MESSAGING
Sources Processors Sinks
STREAM PARTITIONING: TO BE ABLE TO SCALE MICROSERVICES
BACK PRESSURE HANDLING: TO BE ABLE TO COPE WITH PEEKS
18. BACK PRESSURE HANDLING
1
3
2
1. Signals if (message) pressure is too high
2. Regulates inbound (message) flow
3. (Data) retention lake
19. DISCLAIMER: THERE IS ALSO A TASK EXECUTION MODEL (WE WILL IGNORE)
‣ short-living
‣finite data set
‣programming model = Spring Cloud Task
‣starters available for JDBC and Spark
as data source/sink
20. CONNECTED CAR PLATFORM
EDGE SERVICE
MQTT Broker
(apigee Link)
MQTT Source Data
Cleansing
Realtime traffic
analytics
KPI ANALYTICS
Spark
DASHBOARD
react-vis
Presto
Masterdata
Blending
Camel
KafkaKafka
ESB
gPRC
35. PRO CON
specialized programming
model -> efficient
specialized execution
environment -> efficient
support for all types of data
(big, fast, smart)
disjoint programming model
(data processing <-> services)
maybe a disjoint execution
environment
(data stack <-> service stack)
BEST USED
further on: as default for {big,fast,smart} data processing
36. PRO CON
coherent execution
environment (runs on
microservice stack)
coherent programming
model with emphasis on
separation of concerns
bascialy supports all types of
data (big, fast, smart)
has limitations on throughput
(big & fast data) due to less
optimization (like data affinity,
query optimizer, …) and
message-wise processing
technology immature in certain
parts (e.g. diagnosability)
BEST USED FOR
hybrid applications of data processing, system integration, API, UI
moderate throughput data applications with existing dev team
Message by message processing
43. ARCHITECT’S VIEW
THE SECRET OF BIG DATA PERFORMANCE
Rule 1: Be as close to the data as possible!
(CPU cache > memory > local disk > network)
Rule 2: Reduce data volume as early as possible!
(as long as you don’t sacrifice parallelization)
Rule 3: Parallelize as much as possible!
Rule 4: Premature diagnosability and optimization