SlideShare a Scribd company logo
1 of 50
1© Cloudera, Inc. All rights reserved.
Beyond ETL: End to End
Streaming Architectures
2© Cloudera, Inc. All rights reserved.
Your Speakers
Amandeep Khurana
Solutions Architect
Sean Anderson
Product Marketing
3© Cloudera, Inc. All rights reserved.
Agenda
Trends
I. Traditional Architectures
II. New Solutions
Primer on Streaming Systems
I. Kafka
II. Flume
III. Storm
IV. Spark Streaming
V. Flink
Typical Architectures
I. Bus centric
II. File system centric
III. Hybrid
4© Cloudera, Inc. All rights reserved.
Poll
How many are familiar with the Hadoop ecosystem components?
• Little/no familiarity
• Starting out
• Very familiar, not in production
• Using in production
5© Cloudera, Inc. All rights reserved.
Disruption in the Space
16 billion connected devices
generating more data than
ever. Data is driving modern
businesses.
Popular data warehouse
platforms were not designed to
handle the scale of modern
data.
Processing technology
struggling to keep up with
increased data and new real-
time formats.
Explosion of Unstructured Data Data Warehouse Limitations Increased Processing Demands
6© Cloudera, Inc. All rights reserved.
Traditional Pipelines
7© Cloudera, Inc. All rights reserved.
Challenges with a Traditional Solution
1) Limited Data
Archive
ETL
ELT
Staging Environments Enterprise Data
Warehouse
Applications
BI System
Modeling
Reporting
Storage
Archive
Unstructured Data
Data Sources
Structured Data
Ingest
Ingest
1
Serve
Model
Load
Process
Load
8© Cloudera, Inc. All rights reserved.
Challenges with a Traditional Solution
1) Limited Data
Archive
ETL
ELT
Staging Environments Enterprise Data
Warehouse
Applications
BI System
Modeling
Reporting
Storage
Archive
Unstructured Data
Data Sources
Structured Data
Ingest
Ingest
1
Serve
Model
Load
Process
2) Poor Performance
2
2
Load
9© Cloudera, Inc. All rights reserved.
Challenges with a Traditional Solution
1) Limited Data
Archive
ETL
ELT
Staging Environments Enterprise Data
Warehouse
Applications
BI System
Modeling
Reporting
Storage
Archive
Unstructured Data
Data Sources
Structured Data
Ingest
Ingest
1
Serve
Model
Load
Process
2) Poor Performance
2
2
3) Operational Complexity
3
Load
10© Cloudera, Inc. All rights reserved.
What is the impact to the business?
Limited Data Access
• Archived data is inaccessible
• Streaming data not captured
• Unstructured data not captured
Missed Processing SLA’s
• Data ingest/transformations exceeding
processing windows
• New projects abandoned due to workload
• Decreased Engineering and Data Science
productivity
Poor ROI
• High Data Warehouse/RDBMS Costs
• Data archived to save cost
• Increased performance drag on analytic systems
Operational Fragmentation
• Separate platforms for batch and
stream processing
• Separate security, governance, and
management
• Insufficient access for developers
11© Cloudera, Inc. All rights reserved.
Data Ingestion and Processing at
Cloudera
12© Cloudera, Inc. All rights reserved.
Cloudera Enterprise, A New Way Forward
13© Cloudera, Inc. All rights reserved.
Poll
Which streaming systems have you heard of?
• Kafka
• Storm
• Spark Streaming
• Samza
• Flink
• Kafka Streams
• Flume
14© Cloudera, Inc. All rights reserved.
Streaming Systems and
Architectures
15© Cloudera, Inc. All rights reserved.
Ingestion
The foundation of your data platform
Data can come from a variety of “siloed” sources
▪ Existing databases
▪ Sensor data
▪ Server logs
▪ Chat transcripts
Value of data is multiplied when combined and
correlated with other data
▪ “40% value improvement from combining data from
multiple IoT sources” McKinsey Global Institute
16© Cloudera, Inc. All rights reserved.
Apache Sqoop
SQL to Hadoop
Efficiently exchange data between
database and Hadoop
• Bidirectional
• Import all or partial/new data
• Export for shared data access across
systems
Easily get started with high
performance connectors
• Free to use
• Optimized connectors for popular RDBMS,
EDW, and NoSQL options
OPERATIONS
DATAMANAGEMENT
UNIFIED SERVICES
PROCESS,ANALYZE, SERVE
STORE
INTEGRATE
17© Cloudera, Inc. All rights reserved.
Streaming Systems in Hadoop
Flume Kafka Spark Streaming
Storm
Flink
Samza
Tightly coupled ingestion General purpose bus Processing
18© Cloudera, Inc. All rights reserved.
Apache Flume
Log & Event Aggregation for Hadoop
• Efficiently move large amounts of
streaming/log data
• Easily collect data from multiple systems
(sources)
• Built-in sources, sinks, and channels
• Customize data flow to transform data on-
the-fly
• Reliable, scalable, and extensible
for production
• Manage and monitor with Cloudera
Manager
OPERATIONS
DATAMANAGEMENT
UNIFIED SERVICES
PROCESS,ANALYZE, SERVE
STORE
INTEGRATE
19© Cloudera, Inc. All rights reserved.
Typical pipeline
20© Cloudera, Inc. All rights reserved.
Apache Kafka
Pub-Sub Messaging for Hadoop
Backbone for real-time architectures
• Fast, flexible messaging for a wide range of use
cases
• Scale to support more data sources and
growing data volumes
• Zero data loss durability and always-on fault-
tolerance
• Built-in security and data protection
Seamless integration across the platform
• Connect to Flume, Spark Streaming, HBase, and
more
• Manage and monitor with Cloudera Manager
OPERATIONS
DATAMANAGEMENT
UNIFIED SERVICES
PROCESS,ANALYZE, SERVE
STORE
INTEGRATE
21© Cloudera, Inc. All rights reserved.
Kafka is a Publish-Subscribe Messaging System
What is a pub-sub messaging system?
• Act as Broker between producers of data and consumers of data
• Producers don’t worry about who will consume, and making sure they get the data
• Consumers consume from the Broker and don’t talk to the producers
• Broker makes sure data is delivered fast and reliably
Decouple Data Pipelines
Producer
Producer
Producer
Producer
Consumer
Consumer
Consumer
Consumer
Producer
Producer
Producer
Producer
Consumer
Consumer
Consumer
Consumer
Pub-Sub
Broker
22© Cloudera, Inc. All rights reserved.
• Messages are organized into topics
• Topics are broken into partitions
• Partitions are replicated across the
brokers as replicas
• Kafka runs in a cluster. Nodes are
called brokers
• Producers push messages
• Consumers pull messages
The Basics
23© Cloudera, Inc. All rights reserved.
Replicas
• A partition has 1 leader replica. The
others are followers.
• Followers are considered in-sync when:
• The replica is alive
• The replica is not “too far” behind the
leader (configurable)
• The group of in-sync replicas for a
partition is called the ISR (In-Sync
Replicas)
• Replicas map to physical locations on a
broker
Messages
• optionally be keyed in order to map to a
static partitions
• Used if ordering within a partition is
needed
• Avoid otherwise (extra complexity,
skew, etc.)
• Location of a message is denoted by its
topic, partition & offset
• A partitions offset increases as
messages are appended
Beyond Basics…
24© Cloudera, Inc. All rights reserved.
Brokers
• Heavily rely on Linux PageCache
• The I/O scheduler will batch together
consecutive small writes into bigger
physical writes which improves
throughput.
• The I/O scheduler will attempt to re-
sequence writes to minimize
movement of the disk head which
improves throughput.
• It automatically uses all the free
memory on the machine
Clients
• Batch messages
• Reduce network overhead
• Allow efficient compression
• Load balance across the cluster via
partitions
• They talk to multiple nodes
• Utilize zero copy I/O using sendfile
Beyond Basics…
25© Cloudera, Inc. All rights reserved.
• Brokers: 3->15 per Cluster
• Common to start with 3-5
• Largest clusters ~30-40 nodes
• Having many clusters is common
• Topics: 1->100s per Cluster
• Partitions: 1->1000s per Topic
• Clusters with up to 10k total
partitions are workable. Beyond
that we don't aggressively test. [src]
• Consumer Groups: 1->100s active per
Cluster
• Could Consume 1 to all topics
Kafka Cardinality—What is large?
26© Cloudera, Inc. All rights reserved.
• Kafka is not designed for very large
messages
• Optimal performance ~10KB
• Could consider breaking up the
messages/files into smaller chunks
Large Messages
27© Cloudera, Inc. All rights reserved.
Typical pipeline
28© Cloudera, Inc. All rights reserved.
Kafka + Apache Flume
• Kafka can be configured as a fast, reliable Flume Channel
• Flume Sources and Sinks can be used as out-of-the-box Kafka Producers and Consumers
Flume Sinks Consume from Kafka:
Write data to HDFS, HBase, or Search
Flume Sources Write to Kafka:
Read from logs, files, jms, http, rpc, thrift, etc and
write events to Kafka
29© Cloudera, Inc. All rights reserved.
Data Processing
Leverage the right processing for your job
Data may require unique processing characteristics
▪ Batch
▪ Streaming
▪ Real-time
Hadoop arose to address one and now the ecosystem
is answering the rest.
▪ “We’re doubling down on Spark. We invested earliest,
and we’ve invested most, in making Hadoop
enterprise-grade” Doug Cutting
30© Cloudera, Inc. All rights reserved.
Processing System Latencies
Custom Custom
~50 ms >500 ms >30,000 ms
Samza/Storm
Flume Interceptors
Trident
Spark Streaming
Hive
>90,000 ms
Spark
Impala
Hive
Spark
Impala
MR
Near Real-Time Processing
Flink
31© Cloudera, Inc. All rights reserved.
Storm 101
• Open source project
• Fundamental abstraction - Streams, consisting of tuples
• Deployment - Nimbus (master) and Supervisors (workers)
• ZK for membership and state
• Storm processes are stateless
• Applications defined by topologies
• Topologies consist of Spouts and Bolts
• Spout - source of stream
• Bolt - consumer of stream from Spouts. Outputs a stream
• Topology runs till you terminate it
• When nodes fail, storm restarts them. You can set parallelism
32© Cloudera, Inc. All rights reserved.
Storm 101
• Work happens in a bolt at a tuple level
• 3 levels of guarantees
• At-least-once
• At-most-once
• Exactly-once (most expensive. needs Trident)
• New project @ Twitter - Heron
33© Cloudera, Inc. All rights reserved.
Typical pipeline
34© Cloudera, Inc. All rights reserved.
Flink 101
• Fundamentally a stream processing system
• Core abstraction: DataStreams
• Consume events from any streaming source
• Transformation operators - Map, FlatMap, Filter, Reduce, Fold, Window etc
• Fault tolerance based on Chandy Lamport distributed snapshots
• Ensures exactly-once semantics
• Does optimizations internally to club subsequent transformations where
possible
35© Cloudera, Inc. All rights reserved.
Spark Streaming
What is it?
• Run continuous processing of data using
Spark’s core API
• Extends Spark concepts to fault-tolerant,
transformable streams
• Adds “rolling window” operations
• Example: Compute rolling averages or counts
for data over last five minutes
Benefits:
• Reuse knowledge and code in both contexts
• Same programming paradigm for streaming and
batch
• Simplicity of development
• High-level API with automatic DAG generation
• Excellent throughput
• Scale easily to support large volumes of data ingest
Common Use Cases:
• “On-the-fly” ETL as data is ingested into
Hadoop/HDFS
• Detect anomalous behavior and trigger alerts
• Continuous reporting of summary metrics for
incoming data
36© Cloudera, Inc. All rights reserved.
Dstreams
37© Cloudera, Inc. All rights reserved.
Typical pipeline
38© Cloudera, Inc. All rights reserved.
Key difference in approach
 Spark is a batch processing system
that can approximate stream
processing.
 Flink is a stream processing system
that can look like a batch processor.
39© Cloudera, Inc. All rights reserved.
The Spark Ecosystem & Hadoop
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
SQL
Impala
SEARCH
Solr
SDK
Kite
BATCH &
STREAM
Spark
Spark
Streaming Spark SQL DataFrames MLlib …
40© Cloudera, Inc. All rights reserved.
Architectural Patterns
41© Cloudera, Inc. All rights reserved.
One source, one destination
42© Cloudera, Inc. All rights reserved.
Multiple sources, one destination
43© Cloudera, Inc. All rights reserved.
Multiple sources, multiple destinations
44© Cloudera, Inc. All rights reserved.
Hypothetical Anomaly Detection System
• Definition of anomalous activity:
• Amount > previous Max amount (per event decision)
• Location is different than what mobile device suggests (per event
decision)
• >2 transactions in the last 10 seconds (window based decision)
45© Cloudera, Inc. All rights reserved.
Architecture
46© Cloudera, Inc. All rights reserved.
Architectural patterns
• Kafka is front and center • HDFS is front and center • Best tool for the job
Bus centric Data hub centric Hybrid
47© Cloudera, Inc. All rights reserved.
Real world - Hybrid
Data
Sources
Kafka
Stream
ingestion
via Pub-Sub
Cloudera Enterprise Data Hub
Ingestion
Custom
Apps
Preparation Analytics
Spark
Stream processing
Iterative processing
Machine learning
MapReduce
Deep, batch
processing
On-Premise
Cloud
(Cloudera Director)
Cluster
Management
(Cloudera Manager)
Security
(Sentry, Record
Service)
Metadata &
Governance
(Cloudera Navigator)
Unified Cluster
Management Suite
Flexible Deployment
Options
EDW
OLTP
DB
Analytical
Tools
Cloudera Search
Real-time
Search
HBase
NoSQL
Impala
Fast analytics
Sqoop
RDBMS
integration
48© Cloudera, Inc. All rights reserved.
Cloudera makes Data Processing Fast, Easy, & Secure.
Fast
Leadership in Kafka and Spark to help turn
processing windows from hours to
minutes.
Secure
End-to-end Security, Governance, and
Data Management
The leading big data platform from the leaders in enterprise Hadoop.
Easy
Deliver optimum system utilization and
meet SLA commitments, on-premises or
in the cloud, with minimum effort.
49© Cloudera, Inc. All rights reserved.
Getting Started is Easy
Visit our Data
Engineering
Webpage
Signup for
Spark Training
Contact Us to
start a POC
1 2 3
50© Cloudera, Inc. All rights reserved.
Thank you

More Related Content

What's hot

Multi-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BTMulti-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BTCloudera, Inc.
 
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...Cloudera, Inc.
 
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Cloudera, Inc.
 
Data Drive Applications_Webinar
Data Drive Applications_WebinarData Drive Applications_Webinar
Data Drive Applications_WebinarSean Spediacci
 
Secure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game ChangersSecure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game ChangersCloudera, Inc.
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...Cloudera, Inc.
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Cloudera, Inc.
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...Cloudera, Inc.
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18Cloudera, Inc.
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Cloudera, Inc.
 
Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchIntuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchCloudera, Inc.
 
Relying on Data for Strategic Decision-Making--Financial Services Experience
Relying on Data for Strategic Decision-Making--Financial Services ExperienceRelying on Data for Strategic Decision-Making--Financial Services Experience
Relying on Data for Strategic Decision-Making--Financial Services ExperienceCloudera, Inc.
 
Self-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft AzureSelf-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft AzureCloudera, Inc.
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseMaking Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseCloudera, Inc.
 
Kelley Blue Book Uses Big Data to Increase User Engagement Over 100%
Kelley Blue Book Uses Big Data to Increase User Engagement Over 100%Kelley Blue Book Uses Big Data to Increase User Engagement Over 100%
Kelley Blue Book Uses Big Data to Increase User Engagement Over 100%Cloudera, Inc.
 
RecordService for Unified Access Control
RecordService for Unified Access ControlRecordService for Unified Access Control
RecordService for Unified Access ControlCloudera, Inc.
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduCloudera, Inc.
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.
 

What's hot (20)

Multi-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BTMulti-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BT
 
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
 
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
 
Data Drive Applications_Webinar
Data Drive Applications_WebinarData Drive Applications_Webinar
Data Drive Applications_Webinar
 
Secure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game ChangersSecure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game Changers
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
 
Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchIntuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with Search
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Relying on Data for Strategic Decision-Making--Financial Services Experience
Relying on Data for Strategic Decision-Making--Financial Services ExperienceRelying on Data for Strategic Decision-Making--Financial Services Experience
Relying on Data for Strategic Decision-Making--Financial Services Experience
 
Self-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft AzureSelf-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft Azure
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseMaking Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the Enterprise
 
Kelley Blue Book Uses Big Data to Increase User Engagement Over 100%
Kelley Blue Book Uses Big Data to Increase User Engagement Over 100%Kelley Blue Book Uses Big Data to Increase User Engagement Over 100%
Kelley Blue Book Uses Big Data to Increase User Engagement Over 100%
 
RecordService for Unified Access Control
RecordService for Unified Access ControlRecordService for Unified Access Control
RecordService for Unified Access Control
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8
 

Viewers also liked

Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)ucelebi
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallSpark Summit
 
Building financial systems in scala
Building financial systems in scalaBuilding financial systems in scala
Building financial systems in scalaoxbow_lakes
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Grouphadooparchbook
 
Non-blocking I/O, Event loops and node.js
Non-blocking I/O, Event loops and node.jsNon-blocking I/O, Event loops and node.js
Non-blocking I/O, Event loops and node.jsMarcus Frödin
 
Introduction to node.js
Introduction to node.jsIntroduction to node.js
Introduction to node.jsjacekbecela
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...Amazon Web Services
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduCloudera, Inc.
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...DataWorks Summit
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangSpark Summit
 
(PFC403) Maximizing Amazon S3 Performance | AWS re:Invent 2014
(PFC403) Maximizing Amazon S3 Performance | AWS re:Invent 2014(PFC403) Maximizing Amazon S3 Performance | AWS re:Invent 2014
(PFC403) Maximizing Amazon S3 Performance | AWS re:Invent 2014Amazon Web Services
 
Node js presentation
Node js presentationNode js presentation
Node js presentationmartincabrera
 

Viewers also liked (20)

Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
Building financial systems in scala
Building financial systems in scalaBuilding financial systems in scala
Building financial systems in scala
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
 
Non-blocking I/O, Event loops and node.js
Non-blocking I/O, Event loops and node.jsNon-blocking I/O, Event loops and node.js
Non-blocking I/O, Event loops and node.js
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to node.js
Introduction to node.jsIntroduction to node.js
Introduction to node.js
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug Grall
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
(PFC403) Maximizing Amazon S3 Performance | AWS re:Invent 2014
(PFC403) Maximizing Amazon S3 Performance | AWS re:Invent 2014(PFC403) Maximizing Amazon S3 Performance | AWS re:Invent 2014
(PFC403) Maximizing Amazon S3 Performance | AWS re:Invent 2014
 
Node js introduction
Node js introductionNode js introduction
Node js introduction
 
Node js presentation
Node js presentationNode js presentation
Node js presentation
 

Similar to End to End Streaming Architectures

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Stefan Lipp
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaDataWorks Summit
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Stefan Lipp
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
 
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...Precisely
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Kathleen Ting
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table NotesTimothy Spann
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Datamichaelguia
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsTimothy Spann
 

Similar to End to End Streaming Architectures (20)

Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 
Rev Up Your HPC Engine
Rev Up Your HPC EngineRev Up Your HPC Engine
Rev Up Your HPC Engine
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2
 

Recently uploaded (20)

%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 

End to End Streaming Architectures

  • 1. 1© Cloudera, Inc. All rights reserved. Beyond ETL: End to End Streaming Architectures
  • 2. 2© Cloudera, Inc. All rights reserved. Your Speakers Amandeep Khurana Solutions Architect Sean Anderson Product Marketing
  • 3. 3© Cloudera, Inc. All rights reserved. Agenda Trends I. Traditional Architectures II. New Solutions Primer on Streaming Systems I. Kafka II. Flume III. Storm IV. Spark Streaming V. Flink Typical Architectures I. Bus centric II. File system centric III. Hybrid
  • 4. 4© Cloudera, Inc. All rights reserved. Poll How many are familiar with the Hadoop ecosystem components? • Little/no familiarity • Starting out • Very familiar, not in production • Using in production
  • 5. 5© Cloudera, Inc. All rights reserved. Disruption in the Space 16 billion connected devices generating more data than ever. Data is driving modern businesses. Popular data warehouse platforms were not designed to handle the scale of modern data. Processing technology struggling to keep up with increased data and new real- time formats. Explosion of Unstructured Data Data Warehouse Limitations Increased Processing Demands
  • 6. 6© Cloudera, Inc. All rights reserved. Traditional Pipelines
  • 7. 7© Cloudera, Inc. All rights reserved. Challenges with a Traditional Solution 1) Limited Data Archive ETL ELT Staging Environments Enterprise Data Warehouse Applications BI System Modeling Reporting Storage Archive Unstructured Data Data Sources Structured Data Ingest Ingest 1 Serve Model Load Process Load
  • 8. 8© Cloudera, Inc. All rights reserved. Challenges with a Traditional Solution 1) Limited Data Archive ETL ELT Staging Environments Enterprise Data Warehouse Applications BI System Modeling Reporting Storage Archive Unstructured Data Data Sources Structured Data Ingest Ingest 1 Serve Model Load Process 2) Poor Performance 2 2 Load
  • 9. 9© Cloudera, Inc. All rights reserved. Challenges with a Traditional Solution 1) Limited Data Archive ETL ELT Staging Environments Enterprise Data Warehouse Applications BI System Modeling Reporting Storage Archive Unstructured Data Data Sources Structured Data Ingest Ingest 1 Serve Model Load Process 2) Poor Performance 2 2 3) Operational Complexity 3 Load
  • 10. 10© Cloudera, Inc. All rights reserved. What is the impact to the business? Limited Data Access • Archived data is inaccessible • Streaming data not captured • Unstructured data not captured Missed Processing SLA’s • Data ingest/transformations exceeding processing windows • New projects abandoned due to workload • Decreased Engineering and Data Science productivity Poor ROI • High Data Warehouse/RDBMS Costs • Data archived to save cost • Increased performance drag on analytic systems Operational Fragmentation • Separate platforms for batch and stream processing • Separate security, governance, and management • Insufficient access for developers
  • 11. 11© Cloudera, Inc. All rights reserved. Data Ingestion and Processing at Cloudera
  • 12. 12© Cloudera, Inc. All rights reserved. Cloudera Enterprise, A New Way Forward
  • 13. 13© Cloudera, Inc. All rights reserved. Poll Which streaming systems have you heard of? • Kafka • Storm • Spark Streaming • Samza • Flink • Kafka Streams • Flume
  • 14. 14© Cloudera, Inc. All rights reserved. Streaming Systems and Architectures
  • 15. 15© Cloudera, Inc. All rights reserved. Ingestion The foundation of your data platform Data can come from a variety of “siloed” sources ▪ Existing databases ▪ Sensor data ▪ Server logs ▪ Chat transcripts Value of data is multiplied when combined and correlated with other data ▪ “40% value improvement from combining data from multiple IoT sources” McKinsey Global Institute
  • 16. 16© Cloudera, Inc. All rights reserved. Apache Sqoop SQL to Hadoop Efficiently exchange data between database and Hadoop • Bidirectional • Import all or partial/new data • Export for shared data access across systems Easily get started with high performance connectors • Free to use • Optimized connectors for popular RDBMS, EDW, and NoSQL options OPERATIONS DATAMANAGEMENT UNIFIED SERVICES PROCESS,ANALYZE, SERVE STORE INTEGRATE
  • 17. 17© Cloudera, Inc. All rights reserved. Streaming Systems in Hadoop Flume Kafka Spark Streaming Storm Flink Samza Tightly coupled ingestion General purpose bus Processing
  • 18. 18© Cloudera, Inc. All rights reserved. Apache Flume Log & Event Aggregation for Hadoop • Efficiently move large amounts of streaming/log data • Easily collect data from multiple systems (sources) • Built-in sources, sinks, and channels • Customize data flow to transform data on- the-fly • Reliable, scalable, and extensible for production • Manage and monitor with Cloudera Manager OPERATIONS DATAMANAGEMENT UNIFIED SERVICES PROCESS,ANALYZE, SERVE STORE INTEGRATE
  • 19. 19© Cloudera, Inc. All rights reserved. Typical pipeline
  • 20. 20© Cloudera, Inc. All rights reserved. Apache Kafka Pub-Sub Messaging for Hadoop Backbone for real-time architectures • Fast, flexible messaging for a wide range of use cases • Scale to support more data sources and growing data volumes • Zero data loss durability and always-on fault- tolerance • Built-in security and data protection Seamless integration across the platform • Connect to Flume, Spark Streaming, HBase, and more • Manage and monitor with Cloudera Manager OPERATIONS DATAMANAGEMENT UNIFIED SERVICES PROCESS,ANALYZE, SERVE STORE INTEGRATE
  • 21. 21© Cloudera, Inc. All rights reserved. Kafka is a Publish-Subscribe Messaging System What is a pub-sub messaging system? • Act as Broker between producers of data and consumers of data • Producers don’t worry about who will consume, and making sure they get the data • Consumers consume from the Broker and don’t talk to the producers • Broker makes sure data is delivered fast and reliably Decouple Data Pipelines Producer Producer Producer Producer Consumer Consumer Consumer Consumer Producer Producer Producer Producer Consumer Consumer Consumer Consumer Pub-Sub Broker
  • 22. 22© Cloudera, Inc. All rights reserved. • Messages are organized into topics • Topics are broken into partitions • Partitions are replicated across the brokers as replicas • Kafka runs in a cluster. Nodes are called brokers • Producers push messages • Consumers pull messages The Basics
  • 23. 23© Cloudera, Inc. All rights reserved. Replicas • A partition has 1 leader replica. The others are followers. • Followers are considered in-sync when: • The replica is alive • The replica is not “too far” behind the leader (configurable) • The group of in-sync replicas for a partition is called the ISR (In-Sync Replicas) • Replicas map to physical locations on a broker Messages • optionally be keyed in order to map to a static partitions • Used if ordering within a partition is needed • Avoid otherwise (extra complexity, skew, etc.) • Location of a message is denoted by its topic, partition & offset • A partitions offset increases as messages are appended Beyond Basics…
  • 24. 24© Cloudera, Inc. All rights reserved. Brokers • Heavily rely on Linux PageCache • The I/O scheduler will batch together consecutive small writes into bigger physical writes which improves throughput. • The I/O scheduler will attempt to re- sequence writes to minimize movement of the disk head which improves throughput. • It automatically uses all the free memory on the machine Clients • Batch messages • Reduce network overhead • Allow efficient compression • Load balance across the cluster via partitions • They talk to multiple nodes • Utilize zero copy I/O using sendfile Beyond Basics…
  • 25. 25© Cloudera, Inc. All rights reserved. • Brokers: 3->15 per Cluster • Common to start with 3-5 • Largest clusters ~30-40 nodes • Having many clusters is common • Topics: 1->100s per Cluster • Partitions: 1->1000s per Topic • Clusters with up to 10k total partitions are workable. Beyond that we don't aggressively test. [src] • Consumer Groups: 1->100s active per Cluster • Could Consume 1 to all topics Kafka Cardinality—What is large?
  • 26. 26© Cloudera, Inc. All rights reserved. • Kafka is not designed for very large messages • Optimal performance ~10KB • Could consider breaking up the messages/files into smaller chunks Large Messages
  • 27. 27© Cloudera, Inc. All rights reserved. Typical pipeline
  • 28. 28© Cloudera, Inc. All rights reserved. Kafka + Apache Flume • Kafka can be configured as a fast, reliable Flume Channel • Flume Sources and Sinks can be used as out-of-the-box Kafka Producers and Consumers Flume Sinks Consume from Kafka: Write data to HDFS, HBase, or Search Flume Sources Write to Kafka: Read from logs, files, jms, http, rpc, thrift, etc and write events to Kafka
  • 29. 29© Cloudera, Inc. All rights reserved. Data Processing Leverage the right processing for your job Data may require unique processing characteristics ▪ Batch ▪ Streaming ▪ Real-time Hadoop arose to address one and now the ecosystem is answering the rest. ▪ “We’re doubling down on Spark. We invested earliest, and we’ve invested most, in making Hadoop enterprise-grade” Doug Cutting
  • 30. 30© Cloudera, Inc. All rights reserved. Processing System Latencies Custom Custom ~50 ms >500 ms >30,000 ms Samza/Storm Flume Interceptors Trident Spark Streaming Hive >90,000 ms Spark Impala Hive Spark Impala MR Near Real-Time Processing Flink
  • 31. 31© Cloudera, Inc. All rights reserved. Storm 101 • Open source project • Fundamental abstraction - Streams, consisting of tuples • Deployment - Nimbus (master) and Supervisors (workers) • ZK for membership and state • Storm processes are stateless • Applications defined by topologies • Topologies consist of Spouts and Bolts • Spout - source of stream • Bolt - consumer of stream from Spouts. Outputs a stream • Topology runs till you terminate it • When nodes fail, storm restarts them. You can set parallelism
  • 32. 32© Cloudera, Inc. All rights reserved. Storm 101 • Work happens in a bolt at a tuple level • 3 levels of guarantees • At-least-once • At-most-once • Exactly-once (most expensive. needs Trident) • New project @ Twitter - Heron
  • 33. 33© Cloudera, Inc. All rights reserved. Typical pipeline
  • 34. 34© Cloudera, Inc. All rights reserved. Flink 101 • Fundamentally a stream processing system • Core abstraction: DataStreams • Consume events from any streaming source • Transformation operators - Map, FlatMap, Filter, Reduce, Fold, Window etc • Fault tolerance based on Chandy Lamport distributed snapshots • Ensures exactly-once semantics • Does optimizations internally to club subsequent transformations where possible
  • 35. 35© Cloudera, Inc. All rights reserved. Spark Streaming What is it? • Run continuous processing of data using Spark’s core API • Extends Spark concepts to fault-tolerant, transformable streams • Adds “rolling window” operations • Example: Compute rolling averages or counts for data over last five minutes Benefits: • Reuse knowledge and code in both contexts • Same programming paradigm for streaming and batch • Simplicity of development • High-level API with automatic DAG generation • Excellent throughput • Scale easily to support large volumes of data ingest Common Use Cases: • “On-the-fly” ETL as data is ingested into Hadoop/HDFS • Detect anomalous behavior and trigger alerts • Continuous reporting of summary metrics for incoming data
  • 36. 36© Cloudera, Inc. All rights reserved. Dstreams
  • 37. 37© Cloudera, Inc. All rights reserved. Typical pipeline
  • 38. 38© Cloudera, Inc. All rights reserved. Key difference in approach  Spark is a batch processing system that can approximate stream processing.  Flink is a stream processing system that can look like a batch processor.
  • 39. 39© Cloudera, Inc. All rights reserved. The Spark Ecosystem & Hadoop STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService FILESYSTEM HDFS RELATIONAL Kudu NoSQL HBase STORE INTEGRATE SQL Impala SEARCH Solr SDK Kite BATCH & STREAM Spark Spark Streaming Spark SQL DataFrames MLlib …
  • 40. 40© Cloudera, Inc. All rights reserved. Architectural Patterns
  • 41. 41© Cloudera, Inc. All rights reserved. One source, one destination
  • 42. 42© Cloudera, Inc. All rights reserved. Multiple sources, one destination
  • 43. 43© Cloudera, Inc. All rights reserved. Multiple sources, multiple destinations
  • 44. 44© Cloudera, Inc. All rights reserved. Hypothetical Anomaly Detection System • Definition of anomalous activity: • Amount > previous Max amount (per event decision) • Location is different than what mobile device suggests (per event decision) • >2 transactions in the last 10 seconds (window based decision)
  • 45. 45© Cloudera, Inc. All rights reserved. Architecture
  • 46. 46© Cloudera, Inc. All rights reserved. Architectural patterns • Kafka is front and center • HDFS is front and center • Best tool for the job Bus centric Data hub centric Hybrid
  • 47. 47© Cloudera, Inc. All rights reserved. Real world - Hybrid Data Sources Kafka Stream ingestion via Pub-Sub Cloudera Enterprise Data Hub Ingestion Custom Apps Preparation Analytics Spark Stream processing Iterative processing Machine learning MapReduce Deep, batch processing On-Premise Cloud (Cloudera Director) Cluster Management (Cloudera Manager) Security (Sentry, Record Service) Metadata & Governance (Cloudera Navigator) Unified Cluster Management Suite Flexible Deployment Options EDW OLTP DB Analytical Tools Cloudera Search Real-time Search HBase NoSQL Impala Fast analytics Sqoop RDBMS integration
  • 48. 48© Cloudera, Inc. All rights reserved. Cloudera makes Data Processing Fast, Easy, & Secure. Fast Leadership in Kafka and Spark to help turn processing windows from hours to minutes. Secure End-to-end Security, Governance, and Data Management The leading big data platform from the leaders in enterprise Hadoop. Easy Deliver optimum system utilization and meet SLA commitments, on-premises or in the cloud, with minimum effort.
  • 49. 49© Cloudera, Inc. All rights reserved. Getting Started is Easy Visit our Data Engineering Webpage Signup for Spark Training Contact Us to start a POC 1 2 3
  • 50. 50© Cloudera, Inc. All rights reserved. Thank you