Scalable real-time processing techniques

•

6 j'aime•1,527 vues

Lars Albertsson

A glance at a few scalable stream processing techniques.

Logiciels

Scalable real-time
processing techniques
How to almost count
Lars Albertsson, Schibsted

“We promised to count live...
...but since you can’t do that, we used historical
numbers and this cool math to extrapolate.”
?!?

Stream counting is simple
You already have the building blocks
Yet many wait for batch execution
Or go through estimation hoops

Accurate counting
Server Bus
Bucketiser
Bucketiser
Bucketiser
Aggregator
Server
Server
Server
● Straightforward, with some plumbing.
● Heavier than you need.

Now or later? Exact or rough?
Approximation now >> accurate later

Basic scenarios
● How many distinct items in last x minutes?
● What are the top k items in last x minutes?
● How many Ys in last x minutes?
These base techniques are sufficient for
implementing e.g. personalisation and
recommendation algorithms.

Cardinality - distinct stream count
● Naive: Set of hashes. X bits per item.

Cardinality - distinct stream count
● Naive: Set of hashes. X bits per item.
● Naive 2: Set approximation with Bloom filter
+ counter.

Counting in context
● Look backward, different time windows,
compare.
● Count for a small time quantum, keep
history.
● Aggregate old windows.
● Monoid representations are desirable.

Cardinality - distinct stream count
● Naive: Set of hashes. X bits per item.
● Naive 2: Set approximation with Bloom filter
+ counter.
● Naive 3: Hash to bitmap. Count bits.

Cardinality - distinct stream count
Source: Shakespeare, highscalability.com

Top K counting
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 18
U2 65
Gaga 46
Avicii 23
Eminem 21
Peps 19
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 20
● Keep k items, assume absentees have
lowest value
● Accurate at top, overcounting in bottom

Approx counting - Count-Min Sketch
● Compute n hashes for key.
● Increment once on each row, col by mod
(hash)
● Retrieve by min() over rows
3 7 20 3 11 6 3+1 4 1 1
3 8 6 2+1 17 13 1 0 4 5
12 7 6 14 2 0 2 3 6+1 7
3 2 12 8+1 10 2 7 2 11 2

Top K with Count-Min Sketch
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 18
U2 65
Gaga 46
Avicii 23
Eminem 21
Peps 2
U2 65
Gaga 46
Avicii 23
Eminem 21
Dolly 19
● Keep Heavy Hitters list.
● Lookup absentees in CMS.
● Risk of overcount is smaller and spread out.

Cubic CMS
● Decorate song with geo, age, etc. Pour into
CMS.
● Keep heavy hitters per geo, age group.
*:*:<U2>
SE:*:<U2>
*:31-40:<U2>
SE:31-40:<U2>
+1
+1
+1
+1

Machinery
O(104) messages / s per machine.
You probably only need one. If not, use Storm.
Read and write to pub/sub channel, e.g. Kafka
or ZeroMQ.

Brute force alternative
Dump every single message into
ElasticSearch.
Suitable for high dimensionality cubes.

Recommendations, you said?
● Collaborative filtering - similarity matrix
Users
2 4 1 1 5 2
0 1 7 1 0 6
5 2 9 0 3 0
3 8 0 6 0 7
Items

Shave the matrix
Users
Items
0,0 3
0,1 5
0,2 0
0,3 2
1,0 8
... ...
2,1 9
1,0 8
2,2 7
5,0 7
5,2 6
... ...
Flip Sort
2,1 9
1,0 8
2,2 7
5,0 7
5,2 6
Cut
0 0 0 0 0 0
0 0 7 0 0 6
0 0 9 0 0 0
0 8 0 0 0 7
Noise removed - fine for
recommendations
2 4 1 1 5 2
0 1 7 1 0 6
5 2 9 0 3 0
3 8 0 6 0 7

Hungry for more?
Mikio Braun: http://www.berlinbuzzwords.de/session/real-time-personalization-and-
recommendation-stream-mining
Ted Dunning on deep learning for real-time anomaly detection: http://www.
berlinbuzzwords.de/session/deep-learning-high-performance-time-series-databases
Ted Dunning on Storm: http://www.youtube.com/watch?v=7PcmbI5aC20
Open source: stream-lib, Algebird

Want to work in this area?
lalle@schibsted.com

Scalable real-time processing techniques

Recommandé

Need for Time series DatabasePramit Choudhary

Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...Hakka Labs

Introduction to influx dbRoberto Gaudenzi

InfluxDB & GrafanaPedro Salgado

Chronix Time Series Database - The New Time Series Kid on the BlockQAware GmbH

Beautiful Monitoring With Grafana and InfluxDBleesjensen

Introduction to InfluxDB and TICK StackAhmed AbouZaid

Measure your app internals with InfluxDB and Symfony2Corley S.r.l.

Recommandé

Need for Time series DatabasePramit Choudhary

Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...Hakka Labs

Introduction to influx dbRoberto Gaudenzi

InfluxDB & GrafanaPedro Salgado

Chronix Time Series Database - The New Time Series Kid on the BlockQAware GmbH

Beautiful Monitoring With Grafana and InfluxDBleesjensen

Introduction to InfluxDB and TICK StackAhmed AbouZaid

Measure your app internals with InfluxDB and Symfony2Corley S.r.l.

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin

TickVincenzo Ferrari

ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...Altinity Ltd

Introduction to InfluxDBJorn Jambers

Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...InfluxData

Time series database, InfluxDB & PHPCorley S.r.l.

Apache Solr as a compressed, scalable, and high performance time series databaseFlorian Lautenschlager

Time Series Data in a Time Series WorldMapR Technologies

Devoxx france 2015 influxdbNicolas Muller

Statsd introductionRick Chang

Developing Ansible Dynamic Inventory Script - Nov 2017Ahmed AbouZaid

ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...Altinity Ltd

InfluxDbGuamaral Vasil

Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks

Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLONOutlyer

Monitoring in a scalable worldTechExeter

Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...InfluxData

RBea: Scalable Real-Time Analytics at KingGyula Fóra

GraphiteDavid Lutz

Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016DataStax

Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan

Building real time data-driven productsLars Albertsson

Contenu connexe

Tendances

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin

TickVincenzo Ferrari

ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...Altinity Ltd

Introduction to InfluxDBJorn Jambers

Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...InfluxData

Time series database, InfluxDB & PHPCorley S.r.l.

Apache Solr as a compressed, scalable, and high performance time series databaseFlorian Lautenschlager

Time Series Data in a Time Series WorldMapR Technologies

Devoxx france 2015 influxdbNicolas Muller

Statsd introductionRick Chang

Developing Ansible Dynamic Inventory Script - Nov 2017Ahmed AbouZaid

ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...Altinity Ltd

InfluxDbGuamaral Vasil

Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks

Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLONOutlyer

Monitoring in a scalable worldTechExeter

Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...InfluxData

RBea: Scalable Real-Time Analytics at KingGyula Fóra

GraphiteDavid Lutz

Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016DataStax

Tendances (20)

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...

Tick

ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...

Introduction to InfluxDB

Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...

Time series database, InfluxDB & PHP

Apache Solr as a compressed, scalable, and high performance time series database

Time Series Data in a Time Series World

Devoxx france 2015 influxdb

Statsd introduction

Developing Ansible Dynamic Inventory Script - Nov 2017

ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...

InfluxDb

Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark

Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON

Monitoring in a scalable world

Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...

RBea: Scalable Real-Time Analytics at King

Graphite

Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016

En vedette

Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan

Building real time data-driven productsLars Albertsson

Test strategies for data processing pipelinesLars Albertsson

When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...confluent

Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira

Effective testing for spark programs Strata NY 2015Holden Karau

(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services

Interactive Recommender Systems with Netflix and SpotifyChris Johnson

10 more lessons learned from building Machine Learning systemsXavier Amatriain

En vedette (9)

Building Scalable Data Pipelines - 2016 DataPalooza Seattle

Building real time data-driven products

Test strategies for data processing pipelines

When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...

Kafka Reliability - When it absolutely, positively has to be there

Effective testing for spark programs Strata NY 2015

(CMP310) Data Processing Pipelines Using Containers & Spot Instances

Interactive Recommender Systems with Netflix and Spotify

10 more lessons learned from building Machine Learning systems

Similaire à Scalable real-time processing techniques

FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...Rob Skillington

HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon

14 query processing-sortingrameswara reddy venkat

Approximation Data Structures for Streaming ApplicationsDebasish Ghosh

Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion

Reactor, Reactive streams and MicroServicesStéphane Maldini

Webinar: Using Control Theory to Keep Compactions Under ControlScyllaDB

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward

streamingalgo88585858585858585pppppp.pptxGopiNathVelivela

[db analytics showcase Sapporo 2018] B33　H2O4GPU and GoAI: harnessing the pow...Insight Technology, Inc.

Sketch algorithmsSimon Belak

Approximate "Now" is Better Than Accurate "Later"NUS-ISS

Data streaming algorithmsSandeep Joshi

Counting (Using Computer)roshmat

Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScyllaDB

Interview questions slide deckMikeBegley

#GDC15 Code ClinicMike Acton

Amazon RedshiftJeff Patti

Probabilistic data structureThinh Dang

Feature EngineeringHJ van Veen

Similaire à Scalable real-time processing techniques (20)

FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...

HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase

14 query processing-sorting

Approximation Data Structures for Streaming Applications

Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018

Reactor, Reactive streams and MicroServices

Webinar: Using Control Theory to Keep Compactions Under Control

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas

streamingalgo88585858585858585pppppp.pptx

[db analytics showcase Sapporo 2018] B33　H2O4GPU and GoAI: harnessing the pow...

Sketch algorithms

Approximate "Now" is Better Than Accurate "Later"

Data streaming algorithms

Counting (Using Computer)

Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records

Interview questions slide deck

#GDC15 Code Clinic

Amazon Redshift

Probabilistic data structure

Feature Engineering

Plus de Lars Albertsson

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Industrialised data - the key to AI success.pdfLars Albertsson

Crossing the data divideLars Albertsson

Schema management with ScalametaLars Albertsson

How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson

Data engineering in 10 years.pdfLars Albertsson

The 7 habits of data effective companies.pdfLars Albertsson

Holistic data application qualityLars Albertsson

Secure software supply chain on a shoestring budgetLars Albertsson

DataOps - Lean principles and lean practicesLars Albertsson

Ai legal and ethicsLars Albertsson

The right side of speed - learning to shift leftLars Albertsson

Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson

Data ops in practice - Swedish styleLars Albertsson

The lean principles of data opsLars Albertsson

Data democratisedLars Albertsson

Engineering data qualityLars Albertsson

Eventually, time will kill your data processingLars Albertsson

Taming the reproducibility crisisLars Albertsson

Eventually, time will kill your data pipelineLars Albertsson

Plus de Lars Albertsson (20)

Schema on read is obsolete. Welcome metaprogramming..pdf

Industrialised data - the key to AI success.pdf

Crossing the data divide

Schema management with Scalameta

How to not kill people - Berlin Buzzwords 2023.pdf

Data engineering in 10 years.pdf

The 7 habits of data effective companies.pdf

Holistic data application quality

Secure software supply chain on a shoestring budget

DataOps - Lean principles and lean practices

Ai legal and ethics

The right side of speed - learning to shift left

Mortal analytics - Covid-19 and the problem of data quality

Data ops in practice - Swedish style

The lean principles of data ops

Data democratised

Engineering data quality

Eventually, time will kill your data processing

Taming the reproducibility crisis

Eventually, time will kill your data pipeline

Dernier

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit

Software Quality Assurance Interview QuestionsArshad QA

tonesoftglanshi9

WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2

Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Right Money Management App For Your Financial GoalsJhone kinadey

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2

%in Harare+277-882-255-28 abortion pills for sale in Hararemasabamasaba

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba

%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2

WSO2CON 2024 - Does Open Source Still Matter?WSO2

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

Dernier (20)

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...

Software Quality Assurance Interview Questions

tonesoftg

WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution

Microsoft AI Transformation Partner Playbook.pdf

Right Money Management App For Your Financial Goals

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation

%in Harare+277-882-255-28 abortion pills for sale in Harare

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein

%in ivory park+277-882-255-28 abortion pills for sale in ivory park

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...

WSO2CON 2024 - Does Open Source Still Matter?

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

Scalable real-time processing techniques

1. Scalable real-time processing techniques How to almost count Lars Albertsson, Schibsted

2. “We promised to count live... ...but since you can’t do that, we used historical numbers and this cool math to extrapolate.” ?!?

3. Stream counting is simple You already have the building blocks Yet many wait for batch execution Or go through estimation hoops

4. Accurate counting Server Bus Bucketiser Bucketiser Bucketiser Aggregator Server Server Server ● Straightforward, with some plumbing. ● Heavier than you need.

5. Now or later? Exact or rough? Approximation now >> accurate later

6. Basic scenarios ● How many distinct items in last x minutes? ● What are the top k items in last x minutes? ● How many Ys in last x minutes? These base techniques are sufficient for implementing e.g. personalisation and recommendation algorithms.

7. Cardinality - distinct stream count ● Naive: Set of hashes. X bits per item.

8. Cardinality - distinct stream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter.

9. Counting in context ● Look backward, different time windows, compare. ● Count for a small time quantum, keep history. ● Aggregate old windows. ● Monoid representations are desirable.

10. Cardinality - distinct stream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter. ● Naive 3: Hash to bitmap. Count bits.

11. Cardinality - distinct stream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter. ● Naive 3: Hash to bitmap. Count bits. ● Attempt 4: Hash, bitmap, count + collision compensation. Linear Probabilistic Counter.

12. Cardinality - distinct stream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter. ● Naive 3: Hash to bitmap. Count bits. ● Attempt 4: Hash, bitmap, count + collision compensation. Linear Probabilistic Counter. ● Read papers… -> HyperLogLog counter

13. Cardinality - distinct stream count Source: Shakespeare, highscalability.com

14. Top K counting U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 18 U2 65 Gaga 46 Avicii 23 Eminem 21 Peps 19 U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 20 ● Keep k items, assume absentees have lowest value ● Accurate at top, overcounting in bottom

15. Approx counting - Count-Min Sketch ● Compute n hashes for key. ● Increment once on each row, col by mod (hash) ● Retrieve by min() over rows 3 7 20 3 11 6 3+1 4 1 1 3 8 6 2+1 17 13 1 0 4 5 12 7 6 14 2 0 2 3 6+1 7 3 2 12 8+1 10 2 7 2 11 2

16. Top K with Count-Min Sketch U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 18 U2 65 Gaga 46 Avicii 23 Eminem 21 Peps 2 U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 19 ● Keep Heavy Hitters list. ● Lookup absentees in CMS. ● Risk of overcount is smaller and spread out.

17. Cubic CMS ● Decorate song with geo, age, etc. Pour into CMS. ● Keep heavy hitters per geo, age group. *:*:<U2> SE:*:<U2> *:31-40:<U2> SE:31-40:<U2> +1 +1 +1 +1

18. Machinery O(104) messages / s per machine. You probably only need one. If not, use Storm. Read and write to pub/sub channel, e.g. Kafka or ZeroMQ.

19. Brute force alternative Dump every single message into ElasticSearch. Suitable for high dimensionality cubes.

20. Recommendations, you said? ● Collaborative filtering - similarity matrix Users 2 4 1 1 5 2 0 1 7 1 0 6 5 2 9 0 3 0 3 8 0 6 0 7 Items

21. Shave the matrix Users Items 0,0 3 0,1 5 0,2 0 0,3 2 1,0 8 ... ... 2,1 9 1,0 8 2,2 7 5,0 7 5,2 6 ... ... Flip Sort 2,1 9 1,0 8 2,2 7 5,0 7 5,2 6 Cut 0 0 0 0 0 0 0 0 7 0 0 6 0 0 9 0 0 0 0 8 0 0 0 7 Noise removed - fine for recommendations 2 4 1 1 5 2 0 1 7 1 0 6 5 2 9 0 3 0 3 8 0 6 0 7

22. Hungry for more? Mikio Braun: http://www.berlinbuzzwords.de/session/real-time-personalization-and- recommendation-stream-mining Ted Dunning on deep learning for real-time anomaly detection: http://www. berlinbuzzwords.de/session/deep-learning-high-performance-time-series-databases Ted Dunning on Storm: http://www.youtube.com/watch?v=7PcmbI5aC20 Open source: stream-lib, Algebird

23. Want to work in this area? lalle@schibsted.com