SlideShare une entreprise Scribd logo
1  sur  36
Moving Towards a
Streaming Architecture
Garry Turkington (CTO), Gabriele Modena (Data Scientist)
Improve Digital
Real Time Adtech
SSP
http://improvedigital.com/
@ImproveDigital
Data sources
 A geographically distributed fleet of ad servers push out several types of data,
primarily:
- 1 minute metrics packets
- 15 minute full log files
 Previously used ad hoc system to process the former, Hadoop for the latter
Conceptualizing the data
 We think of event streams when considering the operational system
 But have this sliced into many log files for practical reasons
 Recreating the stream-based view gets us back to what is happening
 And brings data creation and processing closer together
How did we get here
 Need to process more data, faster
- Reporting
- Decision Making
 Evolution of the system to meet scalability requirements
 Stream processing by definition
 It is changing the way we think about data collection and distribution
 Not everything is obvious
 It opened new doors
Let’s take a step back
 Aggregates grow in number and size
 Keep a copy in HDFS
 Run job, then go and make coffee
 Generate new / other aggregates / ad hoc datasets
 Analytics datawarehouse (column store), denormalized data model
 BI, reporting
 Hadoop for logs and event processing (eg. ETL, simulation, learning, optimization)
Hadoop workflows
Incremental processing
 Ingest data every eg. 24 hours (/data/raw/event/d=YYYY-MM-DD)
 Scheduling and partitioning are straightforward
 Schema migrations: Avro + HCatalog
 Fixed length time windows
- What is a good length?
- Ok if it takes 1 hour to process 15 minutes of data
 Need to reprocess data
 It can be made efficient (eg. DataFu Hourglass)
 MapReduce
A more interactive system
We added tools to circumvent MapReduce
- Latency, really
- Still a “batch” mindset
Cloudera Impala (+ Parquet)
- Fast, familiar
Move computation closer to data
- Spark (among other things) for prototyping
- Python/R integration, batch and “streaming”
Outcome
 Multiple overlapping data sources, processing tools
- point-to-point
 Different pipelines for analytics and production models
- Divide
- Where is the “correct” data ?
All in all
 Data collection and distribution is critical
 We are still looking at “yesterday’s” data
 We need a different level of abstraction
Challenges of near-realtime data
 Instructive to ask why near-realtime data analysis is hard
 In Hadoop batch world we’d be happy processing 24hrs of data in 4 hrs
 So why do we flinch to process 24hrs in 24hrs?
 What allows batch workloads to process in sub-linear time:
- Scheduling
- Parallelization
- Partitioning
The streaming view
 Consider these from the streaming perspective:
- Scheduling is challenging because of unknown and variable data rates
- Parallelization breaks our nice logical model of a single data stream
- Partitioning can’t be simple file size/no of blocks
- The method of partitioning the stream is fundamental
Kafka
 We use Kafka (http://kafka.apache.org) for all data we view as streams
 It is resilient, performant and provides a very clear topic-based interface
 Producers write to topics, consumers read from topics
 Topics can be partitioned based on a producer-provided key
 Provides at-least once message delivery semantics
 Messages are persistently stored – topics can be re-read
Samza
 We use Samza (http://samza.incubator.apache.org) for stream processing
 Analogous to MapReduce atop YARN atop HDFS
 We have Samza atop YARN atop Kafka
E L E M O D E N A L E A R N IN G H A D O O P 2
A p a c h e S a m z a
afka for streaming !
arn for resource
anagement and exec!
amzaAPI for
ocessing!
weet spot: second,
inutes
Y A R N
S a m z a A P I
K a fk a
Samza API
A simple call-back API that should be familiar to MapReduce developers:
public interface StreamTask{
void process(IncomingMessageEnvelope envelope,
MessageCollector collector, TaskCoordinator) ;
}
public interface WindowableTask{
void window(MessageCollector collector,
StreamCoordinator coordinator);
}
Kafka/Samza integration
 A Samza job is comprised of multiple tasks
 Each Kafka partition topic is consumed by only one Samza task
 Each YARN container runs multiple Samza tasks
 The output of a task is often written to a different topic
 Overall applications are then constructed by composition of jobs
Stream job output
 You’ve processed the data, now what?
 Need consider how the output of each stream partition will be processed
 A downstream destination with an idempotent interface is ideal
 Otherwise try to have expressions of unique state
 Hardest is when the outputs need applied in sequence
Example: log data
 It’s a common model: server process rolls a log every x minutes
 Each resultant file is ingested into Kafka and read by multiple Samza jobs
 Each Samza task receives a series of potentially interleaved log chunks
 Consider a task receiving data from servers S1, S2 and S3 across time periods
T1..T3
 It’s view may become something like:
S1[T1..T2], S2[T1..T2], S1[T2..T3], S3[T1..T2]…
Example - Streaming aggregation
 Input is a Stream of records with the form (key, value)
 Output is a series of (key, aggregate)
 3 approaches to producing these outputs:
- Each task pushes local counts downstream
- Each task pushes to another stream aggregated by a different job
- Each task uses local state to produce final aggregate values
Comparing stream aggregation
 Option 1: each stream partition outputs local values for each key it sees across a
time window
- Each partial result is an update record (e.g. increment key k at time t by x)
- Downstream consumer has the responsibility of combining partial results
 Option 2 : aggregate partial results in a second stream processing job
- The output becomes a series of state changes
(set the value of key k at time t to x)
Comparing stream aggregation contd
 Option 3 - push state into the state processors
- Samza tasks can hold persistent local state
- Creates a key/value store modelled as a Kafka topic
- Ensure that all records for each key are sent to the same partition
- State processing task can then create unique state aggregates locally
A side effect of Kafka
 A firehose for event data
- Can tap into with Samza
- Can tap into with Spark
 Unifies how we access and publish data
- Streams are defined by application / use case
- These can come from other teams
- Single entry point, multiple outputs
Analytics on streams
Incremental processing pipelines as a basis
Tried to reuse logic
 Not that simple
Conceptualization vs. implementation
Time windows
 Temporal constraints
 When is data ready?
- The key is defining “ready”, which is weaker than “complete”
- Enough to make an informed decision (this varies case by case)
- We can update the dataset to improve a confidence interval
 When are we done processing?
- Tradeoff between timeliness and accuracy
Schema evolution
 Schema changes appear within / across time windows
- Not a solved problem
 Stop streaming, propagate changes, replay or resume
 SAMZA-317
- Avro SerDe
- Schema stream
Challenges
 Data completeness
 Windows length
 Tradeoff between finding answers / satisfying temporal constraints
 Need to adjust our approach (representation and data structures)
- approximate
- probabilistic
- iterations to convergence
 This is a work in progress
Model based outlier detection
Previously:
Batch and real time datasets = different approaches
Now:
Same dataset
Model based anomaly detection
 Reuse knowledge and infrastructure
What is normal?
 How do we set thresholds ?
 Literature review (q-digest, t-digest)
Testing and alerting
 Some features are hard to test
 Functionally correct but unexpected consequences
 unexpected ~ hard to predict
 advertisment is a complex system
 Bugs are outliers we can detect (sort of)
Testing as predictive modelling
 Monitoring (next to testing)
 Make software monitorable
 Think of testing as an experiment
 Establish short, continuous, feedback loops
Implications
 How do we use real time output
- “live” dashboards
- Instrumentation and alerting (predict failure)
- feed optimization models
- back propagating methods into ETL
 How we think about feedback loops
 Make data more consumable
Conclusion
 Streaming is part of the broader system
 We are fine tuning how the two fit together
 Data collection and distribution is important
 publish results follows
 Kafka + Samza
 data and processing model that fits our applications well
 Conceptualization vs. implementation
 Real time != incremental
 A certain degree of uncertainty
 Tradeoffs

Contenu connexe

Tendances

Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream ProcessingGyula Fóra
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseKostas Tzoumas
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)Apache Apex
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
 
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingFlink Forward
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Introduction to near real time computing
Introduction to near real time computingIntroduction to near real time computing
Introduction to near real time computingTao Li
 
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...confluent
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkFlink Forward
 

Tendances (20)

Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Stateful Distributed Stream Processing
Stateful Distributed Stream ProcessingStateful Distributed Stream Processing
Stateful Distributed Stream Processing
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Flink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Introduction to near real time computing
Introduction to near real time computingIntroduction to near real time computing
Introduction to near real time computing
 
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Apache flink
Apache flinkApache flink
Apache flink
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
 

Similaire à Moving Towards a Streaming Architecture

Time Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming PlatformTime Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming Platformconfluent
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingPalani Kumar
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixJeff Magnusson
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data AnalysisJ Singh
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationGeorge Long
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark StreamingKnoldus Inc.
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platformDavid Walker
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase Türkiye
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsDataWorks Summit/Hadoop Summit
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 

Similaire à Moving Towards a Streaming Architecture (20)

Time Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming PlatformTime Series Analysis… using an Event Streaming Platform
Time Series Analysis… using an Event Streaming Platform
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
High Throughput Data Analysis
High Throughput Data AnalysisHigh Throughput Data Analysis
High Throughput Data Analysis
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Building an analytical platform
Building an analytical platformBuilding an analytical platform
Building an analytical platform
 
Sybase IQ ile Analitik Platform
Sybase IQ ile Analitik PlatformSybase IQ ile Analitik Platform
Sybase IQ ile Analitik Platform
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 

Plus de Gabriele Modena

Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed DatasetsGabriele Modena
 
Estimating mental states of a depressed person with Bayesian Networks
Estimating mental states of a depressed person with Bayesian NetworksEstimating mental states of a depressed person with Bayesian Networks
Estimating mental states of a depressed person with Bayesian NetworksGabriele Modena
 
Analysis of grid log data with Affinity Propagation
Analysis of grid log data with Affinity PropagationAnalysis of grid log data with Affinity Propagation
Analysis of grid log data with Affinity PropagationGabriele Modena
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingGabriele Modena
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Gabriele Modena
 
Analysis of interrupt latencies in a real-time kernel
Analysis of interrupt latencies in a real-time kernelAnalysis of interrupt latencies in a real-time kernel
Analysis of interrupt latencies in a real-time kernelGabriele Modena
 

Plus de Gabriele Modena (6)

Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Estimating mental states of a depressed person with Bayesian Networks
Estimating mental states of a depressed person with Bayesian NetworksEstimating mental states of a depressed person with Bayesian Networks
Estimating mental states of a depressed person with Bayesian Networks
 
Analysis of grid log data with Affinity Propagation
Analysis of grid log data with Affinity PropagationAnalysis of grid log data with Affinity Propagation
Analysis of grid log data with Affinity Propagation
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
Analysis of interrupt latencies in a real-time kernel
Analysis of interrupt latencies in a real-time kernelAnalysis of interrupt latencies in a real-time kernel
Analysis of interrupt latencies in a real-time kernel
 

Dernier

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 

Dernier (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 

Moving Towards a Streaming Architecture

  • 1. Moving Towards a Streaming Architecture Garry Turkington (CTO), Gabriele Modena (Data Scientist) Improve Digital
  • 3.
  • 4.
  • 5.
  • 6. Data sources  A geographically distributed fleet of ad servers push out several types of data, primarily: - 1 minute metrics packets - 15 minute full log files  Previously used ad hoc system to process the former, Hadoop for the latter
  • 7. Conceptualizing the data  We think of event streams when considering the operational system  But have this sliced into many log files for practical reasons  Recreating the stream-based view gets us back to what is happening  And brings data creation and processing closer together
  • 8. How did we get here  Need to process more data, faster - Reporting - Decision Making  Evolution of the system to meet scalability requirements  Stream processing by definition  It is changing the way we think about data collection and distribution  Not everything is obvious  It opened new doors
  • 9. Let’s take a step back  Aggregates grow in number and size  Keep a copy in HDFS  Run job, then go and make coffee  Generate new / other aggregates / ad hoc datasets  Analytics datawarehouse (column store), denormalized data model  BI, reporting  Hadoop for logs and event processing (eg. ETL, simulation, learning, optimization)
  • 11. Incremental processing  Ingest data every eg. 24 hours (/data/raw/event/d=YYYY-MM-DD)  Scheduling and partitioning are straightforward  Schema migrations: Avro + HCatalog  Fixed length time windows - What is a good length? - Ok if it takes 1 hour to process 15 minutes of data  Need to reprocess data  It can be made efficient (eg. DataFu Hourglass)  MapReduce
  • 12. A more interactive system We added tools to circumvent MapReduce - Latency, really - Still a “batch” mindset Cloudera Impala (+ Parquet) - Fast, familiar Move computation closer to data - Spark (among other things) for prototyping - Python/R integration, batch and “streaming”
  • 13. Outcome  Multiple overlapping data sources, processing tools - point-to-point  Different pipelines for analytics and production models - Divide - Where is the “correct” data ?
  • 14. All in all  Data collection and distribution is critical  We are still looking at “yesterday’s” data  We need a different level of abstraction
  • 15.
  • 16. Challenges of near-realtime data  Instructive to ask why near-realtime data analysis is hard  In Hadoop batch world we’d be happy processing 24hrs of data in 4 hrs  So why do we flinch to process 24hrs in 24hrs?  What allows batch workloads to process in sub-linear time: - Scheduling - Parallelization - Partitioning
  • 17. The streaming view  Consider these from the streaming perspective: - Scheduling is challenging because of unknown and variable data rates - Parallelization breaks our nice logical model of a single data stream - Partitioning can’t be simple file size/no of blocks - The method of partitioning the stream is fundamental
  • 18. Kafka  We use Kafka (http://kafka.apache.org) for all data we view as streams  It is resilient, performant and provides a very clear topic-based interface  Producers write to topics, consumers read from topics  Topics can be partitioned based on a producer-provided key  Provides at-least once message delivery semantics  Messages are persistently stored – topics can be re-read
  • 19. Samza  We use Samza (http://samza.incubator.apache.org) for stream processing  Analogous to MapReduce atop YARN atop HDFS  We have Samza atop YARN atop Kafka E L E M O D E N A L E A R N IN G H A D O O P 2 A p a c h e S a m z a afka for streaming ! arn for resource anagement and exec! amzaAPI for ocessing! weet spot: second, inutes Y A R N S a m z a A P I K a fk a
  • 20. Samza API A simple call-back API that should be familiar to MapReduce developers: public interface StreamTask{ void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator) ; } public interface WindowableTask{ void window(MessageCollector collector, StreamCoordinator coordinator); }
  • 21. Kafka/Samza integration  A Samza job is comprised of multiple tasks  Each Kafka partition topic is consumed by only one Samza task  Each YARN container runs multiple Samza tasks  The output of a task is often written to a different topic  Overall applications are then constructed by composition of jobs
  • 22. Stream job output  You’ve processed the data, now what?  Need consider how the output of each stream partition will be processed  A downstream destination with an idempotent interface is ideal  Otherwise try to have expressions of unique state  Hardest is when the outputs need applied in sequence
  • 23. Example: log data  It’s a common model: server process rolls a log every x minutes  Each resultant file is ingested into Kafka and read by multiple Samza jobs  Each Samza task receives a series of potentially interleaved log chunks  Consider a task receiving data from servers S1, S2 and S3 across time periods T1..T3  It’s view may become something like: S1[T1..T2], S2[T1..T2], S1[T2..T3], S3[T1..T2]…
  • 24. Example - Streaming aggregation  Input is a Stream of records with the form (key, value)  Output is a series of (key, aggregate)  3 approaches to producing these outputs: - Each task pushes local counts downstream - Each task pushes to another stream aggregated by a different job - Each task uses local state to produce final aggregate values
  • 25. Comparing stream aggregation  Option 1: each stream partition outputs local values for each key it sees across a time window - Each partial result is an update record (e.g. increment key k at time t by x) - Downstream consumer has the responsibility of combining partial results  Option 2 : aggregate partial results in a second stream processing job - The output becomes a series of state changes (set the value of key k at time t to x)
  • 26. Comparing stream aggregation contd  Option 3 - push state into the state processors - Samza tasks can hold persistent local state - Creates a key/value store modelled as a Kafka topic - Ensure that all records for each key are sent to the same partition - State processing task can then create unique state aggregates locally
  • 27. A side effect of Kafka  A firehose for event data - Can tap into with Samza - Can tap into with Spark  Unifies how we access and publish data - Streams are defined by application / use case - These can come from other teams - Single entry point, multiple outputs
  • 28. Analytics on streams Incremental processing pipelines as a basis Tried to reuse logic  Not that simple Conceptualization vs. implementation
  • 29. Time windows  Temporal constraints  When is data ready? - The key is defining “ready”, which is weaker than “complete” - Enough to make an informed decision (this varies case by case) - We can update the dataset to improve a confidence interval  When are we done processing? - Tradeoff between timeliness and accuracy
  • 30. Schema evolution  Schema changes appear within / across time windows - Not a solved problem  Stop streaming, propagate changes, replay or resume  SAMZA-317 - Avro SerDe - Schema stream
  • 31. Challenges  Data completeness  Windows length  Tradeoff between finding answers / satisfying temporal constraints  Need to adjust our approach (representation and data structures) - approximate - probabilistic - iterations to convergence  This is a work in progress
  • 32. Model based outlier detection Previously: Batch and real time datasets = different approaches Now: Same dataset Model based anomaly detection  Reuse knowledge and infrastructure What is normal?  How do we set thresholds ?  Literature review (q-digest, t-digest)
  • 33. Testing and alerting  Some features are hard to test  Functionally correct but unexpected consequences  unexpected ~ hard to predict  advertisment is a complex system  Bugs are outliers we can detect (sort of)
  • 34. Testing as predictive modelling  Monitoring (next to testing)  Make software monitorable  Think of testing as an experiment  Establish short, continuous, feedback loops
  • 35. Implications  How do we use real time output - “live” dashboards - Instrumentation and alerting (predict failure) - feed optimization models - back propagating methods into ETL  How we think about feedback loops  Make data more consumable
  • 36. Conclusion  Streaming is part of the broader system  We are fine tuning how the two fit together  Data collection and distribution is important  publish results follows  Kafka + Samza  data and processing model that fits our applications well  Conceptualization vs. implementation  Real time != incremental  A certain degree of uncertainty  Tradeoffs