SlideShare une entreprise Scribd logo
1  sur  42
N..e.ar ..re.
..ana.
tec....hnolo...gy choi...ce
✦

pavlo.baron@codecentric.de	


✦

@pavlobaron
Wile E. Coyote
✦

pretty slow	


✦

running on own demand	


✦

very wide field of vision	


✦

very long memory	


✦

purely proactive	


✦

✦

thoroughly analysing and
preparing	

always loses
Road Runner
✦

hell fast	


✦

ever running	


✦

very narrow field of vision	


✦

very short memory	


✦

purely reactive	


✦

✦

forced to immediately
decide	

always wins
Coyote: slow
✦

✦

✦

too much mumbo-jumbo,
too many tools, totally
dependent on ACME 	

needs a complex, partially
distributed setup	

complex decisions,
depending on Runner,
weather, environment etc.
Runner: fast
✦

✦

✦

zero hoo-ha, zero
tools, just own
body	

road bound	

simple decisions
like run | halt | step
aside | beep beep
Coyote: offline
✦

✦

mostly stands around,
observing and planning	

only sprints on demand,
when Runner passes by
Runner: non-stop
✦

✦

never stops fully, just
occasionally halts for food
and to fool Coyote	

continuously runs the road
in search for food
Coyote: wide vision
✦

✦

sees the whole environment	

tries to use the whole
environment to catch
Runner, predicting his
paths
Runner: narrow vision
✦

✦

only sees what’s in front of
his nose on the road	

due to speed and short-time
predictions, feels well with
the narrow, momentary
vision
Coyote: long memory
✦

✦

as far as possible, learns
from previous failures	

continuously improves
tricks to catch Runner
Runner: short memory
✦

✦

ultimate carpe diem	

predicts Coyote’s actions in
last minute, avoiding being
harmed right before the fact
Coyote: proactive

✦

plans and tries out, looks
for new ways to catch
Runner
Runner: reactive

✦

doesn’t plan, just reacts on
Coyote’s actions
Coyote: thorough
✦

✦

thoroughly analyses the
situation	

throughly plans ahead,
prepares for one single shot
Runner: spontaneous
✦

✦

decides immediately and
spontaneously, depending
on what Coyote does	

makes the best immediate
decision to achieve the
highest level of Coyote
fooling
Coyote: loses
✦

✦

no matter how hard he
tries, he’s never fast or
savvy enough to catch
Runner	

never gives up though
Runner: wins
✦

✦

doesn’t even try to win, but
always does thanks to speed
and immediate situation
analysis, followed by
reaction. Also, due to
Coyote’s continuous failure	

every time has fun fooling
Coyote
Coyote is batch.
Runner is near realtime.
Batch (analytics)
✦

✦

✦

✦

is when you have plenty of
time for analysis	

is when you explore
patterns and models in
historic data	

is when you try to fit any
sort of data into a
hypothetic model	

is when you plan and
forecast the future instead
of (re)acting immediately
Batch (architecture)
✦

✦

✦

✦

is when you
(synchronously) query
previously stored data	

is when you use main
memory primarily for
temporary caches	

is when you do ETL and
alike, even on Hadoop’s
rails	

is when you split large
amounts of historic data in
smaller portions for
distributed / parallel
analysis
Batch (technology)
✦

✦

✦

is when you build on
(R)DBMS or (softschema) NoSQL data
stores in a classic way	

is when you store in HDFS
and process with Hadoop &
Co.	

is when you generally rely
on disks / storage
Near realtime (analytics)
✦

✦

✦

✦

is when you don’t have time	

is when you analyse data as
it comes	

is when you already have a
fixed model, and data flying
in fits it 100%	

is when you (re)act
immediately, based on
patterns you learned online
and in the batch analysis
Near realtime (architecture)
✦

✦

✦

✦

is when you don’t query
data, but expect / assume it	

is when you use main
memory as primary data
storage	

is when you process event
streams	

is when you distribute and
parallelise only independent
computations (it’s hairy
enough even on one
machine - explicit loop
tiling, skewing etc.)
Near realtime (technology)
✦

✦

✦

✦

✦

is when you build on
DSMS, event processing
systems and alike	

is when you store (almost)
only for archiving reasons	

is when you don’t hit disks
or speak of “storage”	

is when you do your best to
avoid horizontal network
gossip	

is when you must go for
accelerators such as GPUs
in case of complex math
Near realtime - non-stop,
immediate analytics cannot
be done as / in batch.
Near realtime is tricky
✦

✦

✦

✦

✦

you need to build event-driven, non-blocking,
lock-free, reactive programs (buzzword
award!)	

you need to work time-bound, penalising or
compensating late events	

you need to keep everything (sliced, autoexpiring) in main memory	

you need to completely utilise resources of one
single machine (speaking of mechanical
sympathy), without waste	

you need to fix your model and work with
fixed-size (binary) events
Scaling near realtime
✦

✦

✦

✦

scaling near realtime analytics is pretty hard.
Similar challenges parallelising on one
machine or scaling out in a distributed way	

you scale through logical or physical stream
splitting, online scatter-gather and alike	

you keep distributed / parallel computation
independent, until you have to merge in the
next processing stage. And so on.	

you scale through receive-and-forward, fireand-forget, cascading, pipelining, multicast,
redundant (who’s first, role-based etc.)
processing
Surviving near realtime
✦

✦

✦

building a restlessly eventoriented, in-memory analytics
system brings some challenges	

disaster recovery: yet again,
splitting streams (for storage),
redundant (role-based)
computation	

short-term failure recovery: upfront temporary, auto-expiring
storage, auto-replay or penalising
events
Near realtime is limited
✦

✦

✦

you need to run most of
analytics on event windows
of some size	

you switch from exact to
probabilistic / approximate
results	

you can only predict near
future, cluster based on
relatively short time periods
and recognise short-term
patterns and anomalies only
Near realtime mining
✦

✦

✦

✦

you mine live streams instead of
passive data sources	

typical algorithms such as
Apriori, 1-class-SVM, k-means,
regressions etc. are easily
possible, but on stream portions
only	

NLP can be done by giving
words identifiers and dealing
with binary messages instead of
text	

as long as it fits into main
memory, it’s comparable to
classic mining, but is much faster
Near realtime + batch?
✦

✦

✦

the combination of both is
what can make a winning
solution. Example reference
architecture: Lambda, but
it’s even more	

exploratory, offline
analytics, baseline analysis,
pattern mining, algorithm
training and alike you do in
the batch	

you apply batch analytics’
results to near realtime and
prove or reject hypothesis’,
detect anomalies, run
forecasts, derive trends etc.
Near realtime, no batch?
✦

✦

✦

✦

it’s possible to do some of this
completely without batch, just
on streams - even more than
basic counters and stats	

you need to keep every single
historic event in a data store	

you need to replay historic
events instead of querying /
mining your data store	

don’t query your database - let
the database stream what it has
to you
Near realtime example tools
✦

✦

✦

✦

✦

query/store-oriented/passivelyadapting: Spark/Shark, Impala,
Drill, ParStream, Splunk	

full-blown CEP engines /
continuous querying DSMSs:
Esper, TIBCO/StreamBase	

more pragmatic stream
processors: Storm, S4, Samza	

event-oriented, continuous
analysers: keen.io, also
speaker’s current WIP	

etc. etc. etc...
Near realtime - DIY
✦

✦

✦

✦

in the end, you’ll have to build it (or core
parts of it) yourself	

you’ll have to work with circular / ring
buffers and / or zero-overhead queuing
software: Disruptor, 0MQ	

ideally, you keep everything in one single
OS process - multi-threading is still hairy
enough then	

managing and using machine’s overall
memory is the tricky part	


✦

for GPUs: OpenCL, Rootbeer	


✦

embed analytics / statistics into the process
Near realtime - DIY
✦

✦

✦

✦

✦

✦

picking the basis platform has less to do with the
personal flavour than with what it offers	

C is a good and a valid choice, but very “manual”	

Erlang/OTP is great for glue, but hard for analytics
and integration. In the end, it’s C, but pretty tricky
here	

Node.js is C in the end at this point, but it’s not for
single-process / multi-threading and still maturing	

JVM is a good compromise. Managed / GCcontrolled memory with object wrappers will be
sacrificed for off-heap memory with primitives though	

Most of the rest doesn’t apply for this sort of tasks
Near realtime - DIY

✦

✦

✦

✦

✦

programming paradigms and thus
languages are the essential, secret sauce	

functional programming is ideal for
analytics and event-processing	

(functional) reactive programming,
Reactor (as pattern or framework), RX
are good for building this sort of
systems	

JavaScript is partially there, Erlang,
Clojure, Scala & Co. are further, but can
be uncontrollable in runtime behaviour	

pure Java can be (later) a healthy tradeoff though - now with RX or Reactor,
Netty etc.
Time in near realtime
✦

✦

✦

✦

✦

realtime still means real time, even if “near”	

the platform of your choice might not be ideal
for hard or soft realtime, since the difference is
primarily in what happens with late events and
under high load	

Erlang will do its best to trigger a timer. Same
with Node.js. But they don’t interrupt hard, are
scheduling on their own and thus leaving you
with an approximation	

JVM comes close, but still no easy way to
interrupt explicitly. Alternative: Hashing
Wheel, own scheduler on dedicated core	

C is the winner, OS-support essential (RTOS
alike)
Near realtime + data store?
✦

✦

✦

✦

near realtime analytics systems need to
store data in different stages: shortterm replay, disaster protection, history	

the trick is to turn around the way you
work with the data store	

your data store knows model and
queries beforehand, and only waits for
events to start streaming historic data
satisfying the static query / view	

most NoSQL stores, but also classic
RDBMS have implantable workers /
jobs / coprocessors as built-in feature:
Oracle, Riak, HBase etc.
Near realtime business cases
✦

✦

anomaly / novelty / outlier detection in
any sort of system	

fraud, attack detection based on
patterns	


✦

situational pricing, product placement	


✦

stock, inventory control and forecast	


✦

online bidding, trading	


✦

automated traffic optimization	


✦

semi-automated operations	


✦

immediate visualization and tracing
Why speed?
✦

✦

✦

✦

why be slow if it’s possible, with
comparable effort, to be fast in
making decisions and automating
them? If not you, then your
competitor	

since everybody can mine data,
speed and quality are the only
technical success factors left	

it’s about how fast you can decide
based on data. The best way is to
start very early, at the source of data	

“new economy” is all about speed,
not (only) lobbies
✦

cartoon images found on the
internet and are directly or
indirectly property/copyright of
or related to Time Warner

Contenu connexe

En vedette

Assistech: An AAC Device for Autistic Children
Assistech: An AAC Device for Autistic ChildrenAssistech: An AAC Device for Autistic Children
Assistech: An AAC Device for Autistic Children
Susie Herbstritt
 
Future Things/Robotic Products
Future Things/Robotic ProductsFuture Things/Robotic Products
Future Things/Robotic Products
Susie Herbstritt
 

En vedette (11)

f6k & l10n
f6k & l10nf6k & l10n
f6k & l10n
 
Assistech: An AAC Device for Autistic Children
Assistech: An AAC Device for Autistic ChildrenAssistech: An AAC Device for Autistic Children
Assistech: An AAC Device for Autistic Children
 
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)
 
Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)Set this Big Data technology zoo in order (@pavlobaron)
Set this Big Data technology zoo in order (@pavlobaron)
 
a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)a Tech guy’s take on Big Data business cases (@pavlobaron)
a Tech guy’s take on Big Data business cases (@pavlobaron)
 
Q1_networks
Q1_networksQ1_networks
Q1_networks
 
20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)20 reasons why we don't need architects (@pavlobaron)
20 reasons why we don't need architects (@pavlobaron)
 
Future Things/Robotic Products
Future Things/Robotic ProductsFuture Things/Robotic Products
Future Things/Robotic Products
 
BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)BigData & CDN - OOP2011 (Pavlo Baron)
BigData & CDN - OOP2011 (Pavlo Baron)
 
Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)Let It Crash (@pavlobaron)
Let It Crash (@pavlobaron)
 
Kokkola
KokkolaKokkola
Kokkola
 

Similaire à Near realtime analytics - technology choice (@pavlobaron)

Garbage Collection in Hotspot JVM
Garbage Collection in Hotspot JVMGarbage Collection in Hotspot JVM
Garbage Collection in Hotspot JVM
jaganmohanreddyk
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit
Antti Haapala
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Java
jbellis
 

Similaire à Near realtime analytics - technology choice (@pavlobaron) (20)

Gopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracowGopher in performance_tales_ms_go_cracow
Gopher in performance_tales_ms_go_cracow
 
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
 
Need for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applicationsNeed for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applications
 
Zen and the Art of ILS Migration--KUDOSCon 2011
Zen and the Art of ILS Migration--KUDOSCon 2011Zen and the Art of ILS Migration--KUDOSCon 2011
Zen and the Art of ILS Migration--KUDOSCon 2011
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
MongoDB & Machine Learning
MongoDB & Machine LearningMongoDB & Machine Learning
MongoDB & Machine Learning
 
Spaghetti gate
Spaghetti gateSpaghetti gate
Spaghetti gate
 
Distributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta MeetupDistributed GLM with H2O - Atlanta Meetup
Distributed GLM with H2O - Atlanta Meetup
 
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)
 
Should You Build Your Own Backtester? by Michael Halls-Moore at QuantCon 2016
Should You Build Your Own Backtester? by Michael Halls-Moore at QuantCon 2016Should You Build Your Own Backtester? by Michael Halls-Moore at QuantCon 2016
Should You Build Your Own Backtester? by Michael Halls-Moore at QuantCon 2016
 
DC JUG: Understanding Java Garbage Collection
DC JUG: Understanding Java Garbage CollectionDC JUG: Understanding Java Garbage Collection
DC JUG: Understanding Java Garbage Collection
 
Understanding GC, JavaOne 2017
Understanding GC, JavaOne 2017Understanding GC, JavaOne 2017
Understanding GC, JavaOne 2017
 
Understanding Java Garbage Collection - And What You Can Do About It
Understanding Java Garbage Collection - And What You Can Do About ItUnderstanding Java Garbage Collection - And What You Can Do About It
Understanding Java Garbage Collection - And What You Can Do About It
 
SQL Server High Availability and DR - Too Many Choices!
SQL Server High Availability and DR - Too Many Choices!SQL Server High Availability and DR - Too Many Choices!
SQL Server High Availability and DR - Too Many Choices!
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
 
BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)
BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)
BYO/DIY Analytics Platform (MeasureCamp Presentation by Clancy Childs)
 
Garbage Collection in Hotspot JVM
Garbage Collection in Hotspot JVMGarbage Collection in Hotspot JVM
Garbage Collection in Hotspot JVM
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit
 
Generating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in juliaGenerating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in julia
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Java
 

Plus de Pavlo Baron

Plus de Pavlo Baron (15)

@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve it@pavlobaron Why monitoring sucks and how to improve it
@pavlobaron Why monitoring sucks and how to improve it
 
Why we do tech the way we do tech now (@pavlobaron)
Why we do tech the way we do tech now (@pavlobaron)Why we do tech the way we do tech now (@pavlobaron)
Why we do tech the way we do tech now (@pavlobaron)
 
Qcon2015 living database
Qcon2015 living databaseQcon2015 living database
Qcon2015 living database
 
Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)Becoming reactive without overreacting (@pavlobaron)
Becoming reactive without overreacting (@pavlobaron)
 
The hidden costs of the parallel world (@pavlobaron)
The hidden costs of the parallel world (@pavlobaron)The hidden costs of the parallel world (@pavlobaron)
The hidden costs of the parallel world (@pavlobaron)
 
data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)data, ..., profit (@pavlobaron)
data, ..., profit (@pavlobaron)
 
(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)(Functional) reactive programming (@pavlobaron)
(Functional) reactive programming (@pavlobaron)
 
Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)Diving into Erlang is a one-way ticket (@pavlobaron)
Diving into Erlang is a one-way ticket (@pavlobaron)
 
Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)Dynamo concepts in depth (@pavlobaron)
Dynamo concepts in depth (@pavlobaron)
 
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)
 
What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)What can be done with Java, but should better be done with Erlang (@pavlobaron)
What can be done with Java, but should better be done with Erlang (@pavlobaron)
 
NoSQL - how it works (@pavlobaron)
NoSQL - how it works (@pavlobaron)NoSQL - how it works (@pavlobaron)
NoSQL - how it works (@pavlobaron)
 
The Agile Alibi (Pavlo Baron)
The Agile Alibi (Pavlo Baron)The Agile Alibi (Pavlo Baron)
The Agile Alibi (Pavlo Baron)
 
Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)Harry Potter and Enormous Data (Pavlo Baron)
Harry Potter and Enormous Data (Pavlo Baron)
 
Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)Big Data & NoSQL - EFS'11 (Pavlo Baron)
Big Data & NoSQL - EFS'11 (Pavlo Baron)
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Near realtime analytics - technology choice (@pavlobaron)

  • 3. Wile E. Coyote ✦ pretty slow ✦ running on own demand ✦ very wide field of vision ✦ very long memory ✦ purely proactive ✦ ✦ thoroughly analysing and preparing always loses
  • 4. Road Runner ✦ hell fast ✦ ever running ✦ very narrow field of vision ✦ very short memory ✦ purely reactive ✦ ✦ forced to immediately decide always wins
  • 5. Coyote: slow ✦ ✦ ✦ too much mumbo-jumbo, too many tools, totally dependent on ACME needs a complex, partially distributed setup complex decisions, depending on Runner, weather, environment etc.
  • 6. Runner: fast ✦ ✦ ✦ zero hoo-ha, zero tools, just own body road bound simple decisions like run | halt | step aside | beep beep
  • 7. Coyote: offline ✦ ✦ mostly stands around, observing and planning only sprints on demand, when Runner passes by
  • 8. Runner: non-stop ✦ ✦ never stops fully, just occasionally halts for food and to fool Coyote continuously runs the road in search for food
  • 9. Coyote: wide vision ✦ ✦ sees the whole environment tries to use the whole environment to catch Runner, predicting his paths
  • 10. Runner: narrow vision ✦ ✦ only sees what’s in front of his nose on the road due to speed and short-time predictions, feels well with the narrow, momentary vision
  • 11. Coyote: long memory ✦ ✦ as far as possible, learns from previous failures continuously improves tricks to catch Runner
  • 12. Runner: short memory ✦ ✦ ultimate carpe diem predicts Coyote’s actions in last minute, avoiding being harmed right before the fact
  • 13. Coyote: proactive ✦ plans and tries out, looks for new ways to catch Runner
  • 14. Runner: reactive ✦ doesn’t plan, just reacts on Coyote’s actions
  • 15. Coyote: thorough ✦ ✦ thoroughly analyses the situation throughly plans ahead, prepares for one single shot
  • 16. Runner: spontaneous ✦ ✦ decides immediately and spontaneously, depending on what Coyote does makes the best immediate decision to achieve the highest level of Coyote fooling
  • 17. Coyote: loses ✦ ✦ no matter how hard he tries, he’s never fast or savvy enough to catch Runner never gives up though
  • 18. Runner: wins ✦ ✦ doesn’t even try to win, but always does thanks to speed and immediate situation analysis, followed by reaction. Also, due to Coyote’s continuous failure every time has fun fooling Coyote
  • 19. Coyote is batch. Runner is near realtime.
  • 20. Batch (analytics) ✦ ✦ ✦ ✦ is when you have plenty of time for analysis is when you explore patterns and models in historic data is when you try to fit any sort of data into a hypothetic model is when you plan and forecast the future instead of (re)acting immediately
  • 21. Batch (architecture) ✦ ✦ ✦ ✦ is when you (synchronously) query previously stored data is when you use main memory primarily for temporary caches is when you do ETL and alike, even on Hadoop’s rails is when you split large amounts of historic data in smaller portions for distributed / parallel analysis
  • 22. Batch (technology) ✦ ✦ ✦ is when you build on (R)DBMS or (softschema) NoSQL data stores in a classic way is when you store in HDFS and process with Hadoop & Co. is when you generally rely on disks / storage
  • 23. Near realtime (analytics) ✦ ✦ ✦ ✦ is when you don’t have time is when you analyse data as it comes is when you already have a fixed model, and data flying in fits it 100% is when you (re)act immediately, based on patterns you learned online and in the batch analysis
  • 24. Near realtime (architecture) ✦ ✦ ✦ ✦ is when you don’t query data, but expect / assume it is when you use main memory as primary data storage is when you process event streams is when you distribute and parallelise only independent computations (it’s hairy enough even on one machine - explicit loop tiling, skewing etc.)
  • 25. Near realtime (technology) ✦ ✦ ✦ ✦ ✦ is when you build on DSMS, event processing systems and alike is when you store (almost) only for archiving reasons is when you don’t hit disks or speak of “storage” is when you do your best to avoid horizontal network gossip is when you must go for accelerators such as GPUs in case of complex math
  • 26. Near realtime - non-stop, immediate analytics cannot be done as / in batch.
  • 27. Near realtime is tricky ✦ ✦ ✦ ✦ ✦ you need to build event-driven, non-blocking, lock-free, reactive programs (buzzword award!) you need to work time-bound, penalising or compensating late events you need to keep everything (sliced, autoexpiring) in main memory you need to completely utilise resources of one single machine (speaking of mechanical sympathy), without waste you need to fix your model and work with fixed-size (binary) events
  • 28. Scaling near realtime ✦ ✦ ✦ ✦ scaling near realtime analytics is pretty hard. Similar challenges parallelising on one machine or scaling out in a distributed way you scale through logical or physical stream splitting, online scatter-gather and alike you keep distributed / parallel computation independent, until you have to merge in the next processing stage. And so on. you scale through receive-and-forward, fireand-forget, cascading, pipelining, multicast, redundant (who’s first, role-based etc.) processing
  • 29. Surviving near realtime ✦ ✦ ✦ building a restlessly eventoriented, in-memory analytics system brings some challenges disaster recovery: yet again, splitting streams (for storage), redundant (role-based) computation short-term failure recovery: upfront temporary, auto-expiring storage, auto-replay or penalising events
  • 30. Near realtime is limited ✦ ✦ ✦ you need to run most of analytics on event windows of some size you switch from exact to probabilistic / approximate results you can only predict near future, cluster based on relatively short time periods and recognise short-term patterns and anomalies only
  • 31. Near realtime mining ✦ ✦ ✦ ✦ you mine live streams instead of passive data sources typical algorithms such as Apriori, 1-class-SVM, k-means, regressions etc. are easily possible, but on stream portions only NLP can be done by giving words identifiers and dealing with binary messages instead of text as long as it fits into main memory, it’s comparable to classic mining, but is much faster
  • 32. Near realtime + batch? ✦ ✦ ✦ the combination of both is what can make a winning solution. Example reference architecture: Lambda, but it’s even more exploratory, offline analytics, baseline analysis, pattern mining, algorithm training and alike you do in the batch you apply batch analytics’ results to near realtime and prove or reject hypothesis’, detect anomalies, run forecasts, derive trends etc.
  • 33. Near realtime, no batch? ✦ ✦ ✦ ✦ it’s possible to do some of this completely without batch, just on streams - even more than basic counters and stats you need to keep every single historic event in a data store you need to replay historic events instead of querying / mining your data store don’t query your database - let the database stream what it has to you
  • 34. Near realtime example tools ✦ ✦ ✦ ✦ ✦ query/store-oriented/passivelyadapting: Spark/Shark, Impala, Drill, ParStream, Splunk full-blown CEP engines / continuous querying DSMSs: Esper, TIBCO/StreamBase more pragmatic stream processors: Storm, S4, Samza event-oriented, continuous analysers: keen.io, also speaker’s current WIP etc. etc. etc...
  • 35. Near realtime - DIY ✦ ✦ ✦ ✦ in the end, you’ll have to build it (or core parts of it) yourself you’ll have to work with circular / ring buffers and / or zero-overhead queuing software: Disruptor, 0MQ ideally, you keep everything in one single OS process - multi-threading is still hairy enough then managing and using machine’s overall memory is the tricky part ✦ for GPUs: OpenCL, Rootbeer ✦ embed analytics / statistics into the process
  • 36. Near realtime - DIY ✦ ✦ ✦ ✦ ✦ ✦ picking the basis platform has less to do with the personal flavour than with what it offers C is a good and a valid choice, but very “manual” Erlang/OTP is great for glue, but hard for analytics and integration. In the end, it’s C, but pretty tricky here Node.js is C in the end at this point, but it’s not for single-process / multi-threading and still maturing JVM is a good compromise. Managed / GCcontrolled memory with object wrappers will be sacrificed for off-heap memory with primitives though Most of the rest doesn’t apply for this sort of tasks
  • 37. Near realtime - DIY ✦ ✦ ✦ ✦ ✦ programming paradigms and thus languages are the essential, secret sauce functional programming is ideal for analytics and event-processing (functional) reactive programming, Reactor (as pattern or framework), RX are good for building this sort of systems JavaScript is partially there, Erlang, Clojure, Scala & Co. are further, but can be uncontrollable in runtime behaviour pure Java can be (later) a healthy tradeoff though - now with RX or Reactor, Netty etc.
  • 38. Time in near realtime ✦ ✦ ✦ ✦ ✦ realtime still means real time, even if “near” the platform of your choice might not be ideal for hard or soft realtime, since the difference is primarily in what happens with late events and under high load Erlang will do its best to trigger a timer. Same with Node.js. But they don’t interrupt hard, are scheduling on their own and thus leaving you with an approximation JVM comes close, but still no easy way to interrupt explicitly. Alternative: Hashing Wheel, own scheduler on dedicated core C is the winner, OS-support essential (RTOS alike)
  • 39. Near realtime + data store? ✦ ✦ ✦ ✦ near realtime analytics systems need to store data in different stages: shortterm replay, disaster protection, history the trick is to turn around the way you work with the data store your data store knows model and queries beforehand, and only waits for events to start streaming historic data satisfying the static query / view most NoSQL stores, but also classic RDBMS have implantable workers / jobs / coprocessors as built-in feature: Oracle, Riak, HBase etc.
  • 40. Near realtime business cases ✦ ✦ anomaly / novelty / outlier detection in any sort of system fraud, attack detection based on patterns ✦ situational pricing, product placement ✦ stock, inventory control and forecast ✦ online bidding, trading ✦ automated traffic optimization ✦ semi-automated operations ✦ immediate visualization and tracing
  • 41. Why speed? ✦ ✦ ✦ ✦ why be slow if it’s possible, with comparable effort, to be fast in making decisions and automating them? If not you, then your competitor since everybody can mine data, speed and quality are the only technical success factors left it’s about how fast you can decide based on data. The best way is to start very early, at the source of data “new economy” is all about speed, not (only) lobbies
  • 42. ✦ cartoon images found on the internet and are directly or indirectly property/copyright of or related to Time Warner