SlideShare une entreprise Scribd logo
1  sur  35
Functional architectural
patterns
Lars Albertsson
1
Who’s talking?
Swedish Institute of Comp. Sc. (test tools)
Sun Microsystems (very large machines)
Google (Hangouts, productivity)
Recorded Future (NLP startup)
Cinnober Financial Tech. (trading systems)
Spotify (data processing & modelling)
Schibsted (data processing & modelling)
2
Why functional?
Verbs
... has made ... expanding ...
... flourishes ... merged ... has been unable to escape lingering .. built ...
... are ... placed ... say ... are ... to explode ...
.. are considering ... to reopen … to recall ...
3
Or object-oriented?
Nouns, pronouns
... bankruptcy ... government bailout ... automaker Chrysler ... comeback ... sales ... Jeep sport utility vehicles.
... Chrysler ... part ... Fiat Chrysler Automobiles, it ... concerns ... the safety ... Jeeps ...
... Jeeps ... gas tanks ... regulators ... safety advocates ... rear-end crash.
... regulators ... an investigation ... those Jeeps ... Fiat Chrysler’s agreement ... models.
4
Functional benefits? My version.
Matches a few problems
Data processing
Matches a few computer properties
Consistency through immutability
Deterministic - replay for resilience
5
Local vs distributed properties
Local
Hardware provides
strong consistency
Faults -> death
6
Distributed
Eventual consistency
Faults must be
survived
Architectural functional patterns
Personal anti-pattern experiences
Strive to look for
Immutability
Reexecution
7
MapReduce
Discovered pattern, not invention
Well known, enough said
Succeeded by Spark RDD paradigm
8
Data flows
9
Users
Page
views
Sales
Sales
reports
Views with
demographics
Sales with
demographics
Conversion
analytics
Conversion
analytics
Views with
demographics
Dataset artifacts, typically files with date
parameter.
Raw Derived
Anti-pattern - isolated batch jobs
Get data (more on that later)
Cron an ETL batch job (function)
Output solidifies. Mostly.
Steps in isolation - often different teams
What to do on ETL code changes?
10
Sales with
demographics
Views with
demographics
Pattern: data pipeline
End-to-end sequences/DAG of jobs
Not only exist, but treated end-to-end
Input is raw, original data
Separate raw data from generated
11
Users
Page
views
Sales with
demographics
Conversio
n analytics
Conversion
analytics
Views with
demographics
Lambda architecture, part 1
Save all collected data without preprocessing
But timestamp on generation, register,
arrival
Rerun everything downstream on code change
Human fault tolerance
In conflict with privacy management?
12
Pipeline workflow orchestration
Ideally: Good old make + cluster + IDE + xUnit
Test end-to-end
Rebuild on upstream changes (but not all)
State of practice: Luigi, Pinball, Azkaban
Don’t take you all the way :-(
13
Lambda architecture, part 2
Parallel batch and real-time pipelines
Batch more accurate, overrides
Real-time for window of recent data
14
Obtaining data
Log things. Conceptually stable, but collection
is challenging at scale.
Have legacy code and master data in
databases? Let us have a look.
15
Database dimensioned for online traffic
Hadoop = herd of elephants
Load spike
Height = #mapper nodes
Area = #users
Anti-pattern: direct dump
16
API
Direct dumps in the trenches
Company successful - #users increasing
More Sqoop mappers - higher DB load
Daily dump jobs went to 25h
Devops firewalled off Hadoop to recover
17
Anti-pattern: dump through API
SOA/microservice culture
DB protected by throttling
API not used to elephants
Query area is still large
Herd of elephants through gate - 1-2 weeks
18
API
Anti-pattern: slave dump
Protect live service by mirroring to a dump
slave
No online service risk, good!
Why anti-pattern?
19
All dumps are non-deterministic
HDFS down? Dump later.
State is gone - dump not accurate
Slave replication down?
Dump not accurate
20
Anti-pattern: deterministic mirror
Replay commit log until full day/hour
Discovered through archaeology :-)
Not scalable, point of failure
Hourly dump took 45 minutes, increasing...
2121
(Anti-)pattern: better dumping
Netflix Aegisthus
Snapshot Cassandra (fast, atomic,
reliable)
Transfer SSTables to HDFS
Replicate compaction in MapReduce
Other DBs? Depends on atomic snapshot.
22
All dumps are anti-patterns?
Typical use: Join activity events with user info
Event time != dump time
Aggregation discards information
Which users enabled X, tried, and disabled?
23
Pattern: Event source
All facts are events. Immutable, timestamped
Event stream is source of truth
No explicit “current state”
The functional data architecture?
24
Event source incarnated: unified log
Pour events into pub/sub bus, with long history.
Kafka de-facto standard.
Tap from bus to HDFS/S3 in time buckets.
Camus/Secor
Stream processing pipelines to dest topics
Replay on code changes
25
Unified log, practical considerations
Long history necessary
Must have time to fix stream process bugs
Use 3+ months and use stream as temp
DB
Unified log also useful for meta and control
Tweak Kafka for low latency
26
Event source + views
View = snapshot of aggregated state @ time
For ETL, choice of hourly/daily aggregates or
exact views
27
Logs
View View
Event source + database
Business logic may demand “current state”
Event stream is truth, keep DB in sync
28
Event source, synced database
A. Service interface generates
events and DB transactions
B. Generate stream from DB
commit log.
Postgres, MySQL -> Kafka
C.Build DB with stream
processing
29
APIAPIAPI
Deployment & orchestration
System = many machines
Desired system state = code + config
Actual state = Orchestrator(current, desired)
30
Anti-pattern: stateful orchestration
Orchestrator = Puppet|Chef|Ansible {
current.changeSomeProperties(desired)
return current
// current.otherProperties unchanged
}
31
Stateful orchestration in the trench
Desired = { case roleA: install(x,y)
case roleB: install(z) }
Current = x installed on roleB. Old x. Zombie
woke up when B load decreased.
Puppet+apt = No simple way to remove
undesired state
32
Pattern: artifacts from source
Orchestrator = Docker|Packer {
delete current
return Image(desired)
}
No state leak from existing state. Sort of.
33
Deterministic, predictable?
Image building leaky on purpose
E.g. “apt-get update && apt-get install”
Imports external state
Ephemeral databases preserve state
Ability to rebuild from unified log is
valuable
34
Jay Kreps, Confluent: Unified log
Martin Kleppman: Unified log, Bottled Water
Nathan Marz: Lambda
Sander Mak @ Jfokus: Event sourcing
Datomic
Questions?
More?
35

Contenu connexe

Tendances

Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseBig Data Spain
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ IndixRajesh Muppalla
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France
 
The Revolution Will be Streamed
The Revolution Will be StreamedThe Revolution Will be Streamed
The Revolution Will be StreamedDatabricks
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoTaro L. Saito
 
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONUJerome Boulon
 
Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture Tin Ho
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniquesLars Albertsson
 
Realtime Reporting using Spark Streaming
Realtime Reporting using Spark StreamingRealtime Reporting using Spark Streaming
Realtime Reporting using Spark StreamingSantosh Sahoo
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberXiang Fu
 

Tendances (20)

Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
The Revolution Will be Streamed
The Revolution Will be StreamedThe Revolution Will be Streamed
The Revolution Will be Streamed
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONU
 
Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture Speed layer : Real time views in LAMBDA architecture
Speed layer : Real time views in LAMBDA architecture
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Omid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBaseOmid: A Transactional Framework for HBase
Omid: A Transactional Framework for HBase
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
 
Realtime Reporting using Spark Streaming
Realtime Reporting using Spark StreamingRealtime Reporting using Spark Streaming
Realtime Reporting using Spark Streaming
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 

En vedette

Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data SuccessLars Albertsson
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Basic concepts in environmental engineering
Basic concepts in environmental engineeringBasic concepts in environmental engineering
Basic concepts in environmental engineeringjoefreim
 
Architectural Design Concepts Approaches - كونسيبت التصميم المعمارى و الفكرة ...
Architectural Design Concepts Approaches - كونسيبت التصميم المعمارى و الفكرة ...Architectural Design Concepts Approaches - كونسيبت التصميم المعمارى و الفكرة ...
Architectural Design Concepts Approaches - كونسيبت التصميم المعمارى و الفكرة ...Galala University
 
Early christian architecture
Early christian architectureEarly christian architecture
Early christian architectureGoby Cracked
 
Chinese gen arch. characteristics
Chinese gen arch. characteristicsChinese gen arch. characteristics
Chinese gen arch. characteristicsbenazirmohamedkhan
 

En vedette (6)

Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data Success
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Basic concepts in environmental engineering
Basic concepts in environmental engineeringBasic concepts in environmental engineering
Basic concepts in environmental engineering
 
Architectural Design Concepts Approaches - كونسيبت التصميم المعمارى و الفكرة ...
Architectural Design Concepts Approaches - كونسيبت التصميم المعمارى و الفكرة ...Architectural Design Concepts Approaches - كونسيبت التصميم المعمارى و الفكرة ...
Architectural Design Concepts Approaches - كونسيبت التصميم المعمارى و الفكرة ...
 
Early christian architecture
Early christian architectureEarly christian architecture
Early christian architecture
 
Chinese gen arch. characteristics
Chinese gen arch. characteristicsChinese gen arch. characteristics
Chinese gen arch. characteristics
 

Similaire à Functional architectural patterns

Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, BarcelonaReal-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, BarcelonaDobo Radichkov
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Big Data Spain
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Stavros Kontopoulos
 
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB PipelineLearning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB PipelineScyllaDB
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgDavid Pilato
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero Lars Albertsson
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 
Synapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineSynapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineCalvin French-Owen
 
Data Analytics at Altocloud
Data Analytics at Altocloud Data Analytics at Altocloud
Data Analytics at Altocloud Altocloud
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxTrayan Iliev
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Stavros Kontopoulos
 

Similaire à Functional architectural patterns (20)

Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, BarcelonaReal-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB PipelineLearning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
Learning Rust the Hard Way for a Production Kafka + ScyllaDB Pipeline
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Synapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineSynapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipeline
 
Data Analytics at Altocloud
Data Analytics at Altocloud Data Analytics at Altocloud
Data Analytics at Altocloud
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFlux
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
 

Plus de Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processingLars Albertsson
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisisLars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipelineLars Albertsson
 

Plus de Lars Albertsson (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Data democratised
Data democratisedData democratised
Data democratised
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 

Dernier

8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...software pro Development
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 

Dernier (20)

8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 

Functional architectural patterns

  • 2. Who’s talking? Swedish Institute of Comp. Sc. (test tools) Sun Microsystems (very large machines) Google (Hangouts, productivity) Recorded Future (NLP startup) Cinnober Financial Tech. (trading systems) Spotify (data processing & modelling) Schibsted (data processing & modelling) 2
  • 3. Why functional? Verbs ... has made ... expanding ... ... flourishes ... merged ... has been unable to escape lingering .. built ... ... are ... placed ... say ... are ... to explode ... .. are considering ... to reopen … to recall ... 3
  • 4. Or object-oriented? Nouns, pronouns ... bankruptcy ... government bailout ... automaker Chrysler ... comeback ... sales ... Jeep sport utility vehicles. ... Chrysler ... part ... Fiat Chrysler Automobiles, it ... concerns ... the safety ... Jeeps ... ... Jeeps ... gas tanks ... regulators ... safety advocates ... rear-end crash. ... regulators ... an investigation ... those Jeeps ... Fiat Chrysler’s agreement ... models. 4
  • 5. Functional benefits? My version. Matches a few problems Data processing Matches a few computer properties Consistency through immutability Deterministic - replay for resilience 5
  • 6. Local vs distributed properties Local Hardware provides strong consistency Faults -> death 6 Distributed Eventual consistency Faults must be survived
  • 7. Architectural functional patterns Personal anti-pattern experiences Strive to look for Immutability Reexecution 7
  • 8. MapReduce Discovered pattern, not invention Well known, enough said Succeeded by Spark RDD paradigm 8
  • 9. Data flows 9 Users Page views Sales Sales reports Views with demographics Sales with demographics Conversion analytics Conversion analytics Views with demographics Dataset artifacts, typically files with date parameter. Raw Derived
  • 10. Anti-pattern - isolated batch jobs Get data (more on that later) Cron an ETL batch job (function) Output solidifies. Mostly. Steps in isolation - often different teams What to do on ETL code changes? 10 Sales with demographics Views with demographics
  • 11. Pattern: data pipeline End-to-end sequences/DAG of jobs Not only exist, but treated end-to-end Input is raw, original data Separate raw data from generated 11 Users Page views Sales with demographics Conversio n analytics Conversion analytics Views with demographics
  • 12. Lambda architecture, part 1 Save all collected data without preprocessing But timestamp on generation, register, arrival Rerun everything downstream on code change Human fault tolerance In conflict with privacy management? 12
  • 13. Pipeline workflow orchestration Ideally: Good old make + cluster + IDE + xUnit Test end-to-end Rebuild on upstream changes (but not all) State of practice: Luigi, Pinball, Azkaban Don’t take you all the way :-( 13
  • 14. Lambda architecture, part 2 Parallel batch and real-time pipelines Batch more accurate, overrides Real-time for window of recent data 14
  • 15. Obtaining data Log things. Conceptually stable, but collection is challenging at scale. Have legacy code and master data in databases? Let us have a look. 15
  • 16. Database dimensioned for online traffic Hadoop = herd of elephants Load spike Height = #mapper nodes Area = #users Anti-pattern: direct dump 16 API
  • 17. Direct dumps in the trenches Company successful - #users increasing More Sqoop mappers - higher DB load Daily dump jobs went to 25h Devops firewalled off Hadoop to recover 17
  • 18. Anti-pattern: dump through API SOA/microservice culture DB protected by throttling API not used to elephants Query area is still large Herd of elephants through gate - 1-2 weeks 18 API
  • 19. Anti-pattern: slave dump Protect live service by mirroring to a dump slave No online service risk, good! Why anti-pattern? 19
  • 20. All dumps are non-deterministic HDFS down? Dump later. State is gone - dump not accurate Slave replication down? Dump not accurate 20
  • 21. Anti-pattern: deterministic mirror Replay commit log until full day/hour Discovered through archaeology :-) Not scalable, point of failure Hourly dump took 45 minutes, increasing... 2121
  • 22. (Anti-)pattern: better dumping Netflix Aegisthus Snapshot Cassandra (fast, atomic, reliable) Transfer SSTables to HDFS Replicate compaction in MapReduce Other DBs? Depends on atomic snapshot. 22
  • 23. All dumps are anti-patterns? Typical use: Join activity events with user info Event time != dump time Aggregation discards information Which users enabled X, tried, and disabled? 23
  • 24. Pattern: Event source All facts are events. Immutable, timestamped Event stream is source of truth No explicit “current state” The functional data architecture? 24
  • 25. Event source incarnated: unified log Pour events into pub/sub bus, with long history. Kafka de-facto standard. Tap from bus to HDFS/S3 in time buckets. Camus/Secor Stream processing pipelines to dest topics Replay on code changes 25
  • 26. Unified log, practical considerations Long history necessary Must have time to fix stream process bugs Use 3+ months and use stream as temp DB Unified log also useful for meta and control Tweak Kafka for low latency 26
  • 27. Event source + views View = snapshot of aggregated state @ time For ETL, choice of hourly/daily aggregates or exact views 27 Logs View View
  • 28. Event source + database Business logic may demand “current state” Event stream is truth, keep DB in sync 28
  • 29. Event source, synced database A. Service interface generates events and DB transactions B. Generate stream from DB commit log. Postgres, MySQL -> Kafka C.Build DB with stream processing 29 APIAPIAPI
  • 30. Deployment & orchestration System = many machines Desired system state = code + config Actual state = Orchestrator(current, desired) 30
  • 31. Anti-pattern: stateful orchestration Orchestrator = Puppet|Chef|Ansible { current.changeSomeProperties(desired) return current // current.otherProperties unchanged } 31
  • 32. Stateful orchestration in the trench Desired = { case roleA: install(x,y) case roleB: install(z) } Current = x installed on roleB. Old x. Zombie woke up when B load decreased. Puppet+apt = No simple way to remove undesired state 32
  • 33. Pattern: artifacts from source Orchestrator = Docker|Packer { delete current return Image(desired) } No state leak from existing state. Sort of. 33
  • 34. Deterministic, predictable? Image building leaky on purpose E.g. “apt-get update && apt-get install” Imports external state Ephemeral databases preserve state Ability to rebuild from unified log is valuable 34
  • 35. Jay Kreps, Confluent: Unified log Martin Kleppman: Unified log, Bottled Water Nathan Marz: Lambda Sander Mak @ Jfokus: Event sourcing Datomic Questions? More? 35