End to End Streaming Architectures

1© Cloudera, Inc. All rights reserved.
Beyond ETL: End to End
Streaming Architectures

Your Speakers
Amandeep Khurana
Solutions Architect
Sean Anderson
Product Marketing

Agenda
Trends
I. Traditional Architectures
II. New Solutions
Primer on Streaming Systems
I. Kafka
II. Flume
III. Storm
IV. Spark Streaming
V. Flink
Typical Architectures
I. Bus centric
II. File system centric
III. Hybrid

Poll
How many are familiar with the Hadoop ecosystem components?
• Little/no familiarity
• Starting out
• Very familiar, not in production
• Using in production

Disruption in the Space
16 billion connected devices
generating more data than
ever. Data is driving modern
businesses.
Popular data warehouse
platforms were not designed to
handle the scale of modern
data.
Processing technology
struggling to keep up with
increased data and new real-
time formats.
Explosion of Unstructured Data Data Warehouse Limitations Increased Processing Demands

Traditional Pipelines

Challenges with a Traditional Solution
1) Limited Data
Archive
ETL
ELT
Staging Environments Enterprise Data
Warehouse
Applications
BI System
Modeling
Reporting
Storage
Archive
Unstructured Data
Data Sources
Structured Data
Ingest
Ingest
1
Serve
Model
Load
Process
Load

1) Limited Data
Archive
ETL
ELT
Warehouse
Applications
BI System
Modeling
Reporting
Storage
Archive
Unstructured Data
Data Sources
Structured Data
Ingest
Ingest
1
Serve
Model
Load
Process
2) Poor Performance
2
2
Load

1) Limited Data
Archive
ETL
ELT
Warehouse
Applications
BI System
Modeling
Reporting
Storage
Archive
Unstructured Data
Data Sources
Structured Data
Ingest
Ingest
1
Serve
Model
Load
Process
2) Poor Performance
2
2
3) Operational Complexity
3
Load

What is the impact to the business?
Limited Data Access
• Archived data is inaccessible
• Streaming data not captured
• Unstructured data not captured
Missed Processing SLA’s
• Data ingest/transformations exceeding
processing windows
• New projects abandoned due to workload
• Decreased Engineering and Data Science
productivity
Poor ROI
• High Data Warehouse/RDBMS Costs
• Data archived to save cost
• Increased performance drag on analytic systems
Operational Fragmentation
• Separate platforms for batch and
stream processing
• Separate security, governance, and
management
• Insufficient access for developers

Data Ingestion and Processing at
Cloudera

Cloudera Enterprise, A New Way Forward

Poll
Which streaming systems have you heard of?
• Kafka
• Storm
• Spark Streaming
• Samza
• Flink
• Kafka Streams
• Flume

Streaming Systems and
Architectures

Ingestion
The foundation of your data platform
Data can come from a variety of “siloed” sources
▪ Existing databases
▪ Sensor data
▪ Server logs
▪ Chat transcripts
Value of data is multiplied when combined and
correlated with other data
▪ “40% value improvement from combining data from
multiple IoT sources” McKinsey Global Institute

Apache Sqoop
SQL to Hadoop
Efficiently exchange data between
database and Hadoop
• Bidirectional
• Import all or partial/new data
• Export for shared data access across
systems
Easily get started with high
performance connectors
• Free to use
• Optimized connectors for popular RDBMS,
EDW, and NoSQL options
OPERATIONS
DATAMANAGEMENT
UNIFIED SERVICES
PROCESS,ANALYZE, SERVE
STORE
INTEGRATE

Streaming Systems in Hadoop
Flume Kafka Spark Streaming
Storm
Flink
Samza
Tightly coupled ingestion General purpose bus Processing

Apache Flume
Log & Event Aggregation for Hadoop
• Efficiently move large amounts of
streaming/log data
• Easily collect data from multiple systems
(sources)
• Built-in sources, sinks, and channels
• Customize data flow to transform data on-
the-fly
• Reliable, scalable, and extensible
for production
• Manage and monitor with Cloudera
Manager
OPERATIONS
DATAMANAGEMENT
UNIFIED SERVICES
STORE
INTEGRATE

Typical pipeline

Apache Kafka
Pub-Sub Messaging for Hadoop
Backbone for real-time architectures
• Fast, flexible messaging for a wide range of use
cases
• Scale to support more data sources and
growing data volumes
• Zero data loss durability and always-on fault-
tolerance
• Built-in security and data protection
Seamless integration across the platform
• Connect to Flume, Spark Streaming, HBase, and
more
• Manage and monitor with Cloudera Manager
OPERATIONS
DATAMANAGEMENT
UNIFIED SERVICES
STORE
INTEGRATE

Kafka is a Publish-Subscribe Messaging System
What is a pub-sub messaging system?
• Act as Broker between producers of data and consumers of data
• Producers don’t worry about who will consume, and making sure they get the data
• Consumers consume from the Broker and don’t talk to the producers
• Broker makes sure data is delivered fast and reliably
Decouple Data Pipelines
Producer
Producer
Producer
Producer
Consumer
Consumer
Consumer
Consumer
Producer
Producer
Producer
Producer
Consumer
Consumer
Consumer
Consumer
Pub-Sub
Broker

• Messages are organized into topics
• Topics are broken into partitions
• Partitions are replicated across the
brokers as replicas
• Kafka runs in a cluster. Nodes are
called brokers
• Producers push messages
• Consumers pull messages
The Basics

Replicas
• A partition has 1 leader replica. The
others are followers.
• Followers are considered in-sync when:
• The replica is alive
• The replica is not “too far” behind the
leader (configurable)
• The group of in-sync replicas for a
partition is called the ISR (In-Sync
Replicas)
• Replicas map to physical locations on a
broker
Messages
• optionally be keyed in order to map to a
static partitions
• Used if ordering within a partition is
needed
• Avoid otherwise (extra complexity,
skew, etc.)
• Location of a message is denoted by its
topic, partition & offset
• A partitions offset increases as
messages are appended
Beyond Basics…

Brokers
• Heavily rely on Linux PageCache
• The I/O scheduler will batch together
consecutive small writes into bigger
physical writes which improves
throughput.
• The I/O scheduler will attempt to re-
sequence writes to minimize
movement of the disk head which
improves throughput.
• It automatically uses all the free
memory on the machine
Clients
• Batch messages
• Reduce network overhead
• Allow efficient compression
• Load balance across the cluster via
partitions
• They talk to multiple nodes
• Utilize zero copy I/O using sendfile
Beyond Basics…

• Brokers: 3->15 per Cluster
• Common to start with 3-5
• Largest clusters ~30-40 nodes
• Having many clusters is common
• Topics: 1->100s per Cluster
• Partitions: 1->1000s per Topic
• Clusters with up to 10k total
partitions are workable. Beyond
that we don't aggressively test. [src]
• Consumer Groups: 1->100s active per
Cluster
• Could Consume 1 to all topics
Kafka Cardinality—What is large?

• Kafka is not designed for very large
messages
• Optimal performance ~10KB
• Could consider breaking up the
messages/files into smaller chunks
Large Messages

Typical pipeline

Kafka + Apache Flume
• Kafka can be configured as a fast, reliable Flume Channel
• Flume Sources and Sinks can be used as out-of-the-box Kafka Producers and Consumers
Flume Sinks Consume from Kafka:
Write data to HDFS, HBase, or Search
Flume Sources Write to Kafka:
Read from logs, files, jms, http, rpc, thrift, etc and
write events to Kafka

Data Processing
Leverage the right processing for your job
Data may require unique processing characteristics
▪ Batch
▪ Streaming
▪ Real-time
Hadoop arose to address one and now the ecosystem
is answering the rest.
▪ “We’re doubling down on Spark. We invested earliest,
and we’ve invested most, in making Hadoop
enterprise-grade” Doug Cutting

Processing System Latencies
Custom Custom
~50 ms >500 ms >30,000 ms
Samza/Storm
Flume Interceptors
Trident
Spark Streaming
Hive
>90,000 ms
Spark
Impala
Hive
Spark
Impala
MR
Near Real-Time Processing
Flink

Storm 101
• Open source project
• Fundamental abstraction - Streams, consisting of tuples
• Deployment - Nimbus (master) and Supervisors (workers)
• ZK for membership and state
• Storm processes are stateless
• Applications defined by topologies
• Topologies consist of Spouts and Bolts
• Spout - source of stream
• Bolt - consumer of stream from Spouts. Outputs a stream
• Topology runs till you terminate it
• When nodes fail, storm restarts them. You can set parallelism

Storm 101
• Work happens in a bolt at a tuple level
• 3 levels of guarantees
• At-least-once
• At-most-once
• Exactly-once (most expensive. needs Trident)
• New project @ Twitter - Heron

Typical pipeline

Flink 101
• Fundamentally a stream processing system
• Core abstraction: DataStreams
• Consume events from any streaming source
• Transformation operators - Map, FlatMap, Filter, Reduce, Fold, Window etc
• Fault tolerance based on Chandy Lamport distributed snapshots
• Ensures exactly-once semantics
• Does optimizations internally to club subsequent transformations where
possible

Spark Streaming
What is it?
• Run continuous processing of data using
Spark’s core API
• Extends Spark concepts to fault-tolerant,
transformable streams
• Adds “rolling window” operations
• Example: Compute rolling averages or counts
for data over last five minutes
Benefits:
• Reuse knowledge and code in both contexts
• Same programming paradigm for streaming and
batch
• Simplicity of development
• High-level API with automatic DAG generation
• Excellent throughput
• Scale easily to support large volumes of data ingest
Common Use Cases:
• “On-the-fly” ETL as data is ingested into
Hadoop/HDFS
• Detect anomalous behavior and trigger alerts
• Continuous reporting of summary metrics for
incoming data

Dstreams

Typical pipeline

Key difference in approach
 Spark is a batch processing system
that can approximate stream
processing.
 Flink is a stream processing system
that can look like a batch processor.

The Spark Ecosystem & Hadoop
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
SQL
Impala
SEARCH
Solr
SDK
Kite
BATCH &
STREAM
Spark
Spark
Streaming Spark SQL DataFrames MLlib …

Architectural Patterns

One source, one destination

Multiple sources, one destination

Multiple sources, multiple destinations

Hypothetical Anomaly Detection System
• Definition of anomalous activity:
• Amount > previous Max amount (per event decision)
• Location is different than what mobile device suggests (per event
decision)
• >2 transactions in the last 10 seconds (window based decision)

Architecture

Architectural patterns
• Kafka is front and center • HDFS is front and center • Best tool for the job
Bus centric Data hub centric Hybrid

Real world - Hybrid
Data
Sources
Kafka
Stream
ingestion
via Pub-Sub
Cloudera Enterprise Data Hub
Ingestion
Custom
Apps
Preparation Analytics
Spark
Stream processing
Iterative processing
Machine learning
MapReduce
Deep, batch
processing
On-Premise
Cloud
(Cloudera Director)
Cluster
Management
(Cloudera Manager)
Security
(Sentry, Record
Service)
Metadata &
Governance
(Cloudera Navigator)
Uniﬁed Cluster
Management Suite
Flexible Deployment
Options
EDW
OLTP
DB
Analytical
Tools
Cloudera Search
Real-time
Search
HBase
NoSQL
Impala
Fast analytics
Sqoop
RDBMS
integration

Cloudera makes Data Processing Fast, Easy, & Secure.
Fast
Leadership in Kafka and Spark to help turn
processing windows from hours to
minutes.
Secure
End-to-end Security, Governance, and
Data Management
The leading big data platform from the leaders in enterprise Hadoop.
Easy
Deliver optimum system utilization and
meet SLA commitments, on-premises or
in the cloud, with minimum effort.

Getting Started is Easy
Visit our Data
Engineering
Webpage
Signup for
Spark Training
Contact Us to
start a POC
1 2 3

Thank you

End to End Streaming Architectures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to End to End Streaming Architectures

Similar to End to End Streaming Architectures (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

End to End Streaming Architectures