Stream processing

•Download as PPTX, PDF•

0 likes•454 views

Slides for my talk at Oredev 2016. Introduces stream processing, some techniques, and example uses. Also introduces technologies like Kafka, Cassandra, Spark, with their pros and cons. Video available at https://vimeo.com/191056269 .

Software

STREAM PROCESSING
@ASHIC
HTTP://WWW.HEARTYSOFT.COM

BIG DATA
• Optimisations
• Parquet, etc.

STREAMING DATA
• Cheaper?
• Timely results?
• Approximations?

EXAMPLES
• Statistical Summaries
Mean, Standard Deviation

EXAMPLES
• Statistical Summaries
Hold n, sum, and sum of
squares =>
Mean, Standard Deviation

EXAMPLES
• Statistical Summaries
Approximation of Median

EXAMPLES
• Statistical Summaries
* Start with a value
* If item > value, add learning
rate
* If item < value, subtract
learning rate
=>
Approximation of Median

EXAMPLES
• Taking Representative Samples
- From weblogs (i.e. ip-timestamp tuples) approximate average
percentage of users who have revisited.

EXAMPLES
• Filtering Streams
Filter Out (or In) Things That May Not Be
Needed

EXAMPLES
• Filtering Streams
Bloom Filter
• Hash based on criterion
• Matching hash means entry may be in
there
• Non matching hash means it’s
definitely not

EXAMPLES
How Many Distinct Things Did We Get?

EXAMPLES
• Approximate Distinct Elements
Flajolet-Martin Algorithm
• Hash element (or identifier) to longs using many
hash functions. Count trailing zeroes of hash. Let
it be r.
• Approximation for distinct elements = 2^R where
R = max(r)
• Combine groups of hashes: Take average for each
group, then take median of the averages.

EXAMPLES
• Clustering
• Bradley, Fattad, Reina (BFR) approach.
• BDMO algorithm.

USEFUL TECHNOLOGY
• Apache Kafka
• Apache Cassandra
• Apache Spark

KAFKA
• Scale out, clustered, durable message broker.
• Fault tolerant, replicated.
• Uses topics, which have partitions.
• Messages within partitions have guaranteed ordering.

KAFKA
• Kafka Streams: Lightweight Kafka => [x] library
• Kafka Connect: Enables streaming large amounts
of data reliability between Kafka and other
systems
• Schema Registry: Well…registry for schemas

KAFKA - GOTCHAS
• Messages in a partition are ordered, message
processing may not be.
• At least once… downstream idempotence
required.
• Disk.
• Rebalances.

CASSANDRA
• Partitioned row store.
• Fault tolerant, Masterless.
• Very fast writes, fast reads.
• Tunable consistency.
• Multi-datacentre aware.
• OLTP + OLAP (via Spark).

CASSANDRA – SCHEMA
• Collection Types
• User defined types
• Static Columns
• Materialised Views

CASSANDRA
– DATA MODELLING
• NOT a relational database
• KNOW YOUR QUERIES
• Model for queries, not normalisation
• Consolidate to minimal number of tables that get the job done
• Unbound partition growth will bring down nodes, then quorum

SPARK
• General purpose data processing
• Ability to cache things in memory, and re-use across steps.

SPARK STREAMING
• Microbatches
• Similar API to non-streaming Spark

SPARK + CASSANDRA
* rdd.saveToCassandra
* sc.cassandraTable

KAFKA + CASSANDRA
* Cassandra Sink
* Cassandra Connect

STREAM PROCESSING
• Lots of open problems
• RISE Labs (Real-time, Intelligent, and Secure Execution

THANK YOU
@ashic
http://github/Heartysoft/cassy-up

Viewers also liked

Logarska Valley (Logarska dolina), Slovenia imagesDaria Perse

W T S Resume Workshop 03lecipollo

The BuckboardJustin Sanchez

D+c 2011 03 – focus – robles why filipinos have reason to fear their nation’...hotmanila

Bágyi Péter: CT protokollok, dózis-csökkentés lehetőségei. MRAE Országos Radi...Péter Bágyi M.D.

AdjectivesGihan Lahoud

Merlin Pc1Tom Durkee

How It WorksnuResume

120116 workforce development pull-up banner - 0987Gihan Lahoud

Cqrs, Event SourcingAshic Mahtab

Growth Strategies Across the Product LifecyclePaul Morgan

Marjorieppguestad468

Mobile futures ppt intro getting mobile in educationGihan Lahoud

Entorgcorp academy power point courtesyDr. George Taylor III, SPHR,SCP

Aan de slag met social mediahallofryslan

HomophonesGihan Lahoud

V1mobile futures enable presentation v1Gihan Lahoud

Presentacionfosky

Pictures And MusicBless_India

Web 2.0 alkalmazások az egészségügyben, képalkotó diagnosztikában - II. rész ...Péter Bágyi M.D.

Viewers also liked (20)

Logarska Valley (Logarska dolina), Slovenia images

W T S Resume Workshop 03

The Buckboard

D+c 2011 03 – focus – robles why filipinos have reason to fear their nation’...

Bágyi Péter: CT protokollok, dózis-csökkentés lehetőségei. MRAE Országos Radi...

Adjectives

Merlin Pc1

How It Works

120116 workforce development pull-up banner - 0987

Cqrs, Event Sourcing

Growth Strategies Across the Product Lifecycle

Marjoriepp

Mobile futures ppt intro getting mobile in education

Entorgcorp academy power point courtesy

Aan de slag met social media

Homophones

V1mobile futures enable presentation v1

Presentacion

Pictures And Music

Web 2.0 alkalmazások az egészségügyben, képalkotó diagnosztikában - II. rész ...

Recently uploaded

Direct Style Effect Systems -The Print[A] Example- A Comprehension AidPhilip Schwarz

AI & Machine Learning Presentation TemplatePresentation.STUDIO

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

TECUNIQUE: Success Stories: IT Service providermohitmore19

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems

5 Signs You Need a Fashion PLM Software.pdfWave PLM

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Define the academic and professional writing..pdfPearlKirahMaeRagusta1

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

Recently uploaded (20)

Direct Style Effect Systems -The Print[A] Example- A Comprehension Aid

AI & Machine Learning Presentation Template

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf

Microsoft AI Transformation Partner Playbook.pdf

TECUNIQUE: Success Stories: IT Service provider

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

10 Trends Likely to Shape Enterprise Technology in 2024

5 Signs You Need a Fashion PLM Software.pdf

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Define the academic and professional writing..pdf

Diamond Application Development Crafting Solutions with Precision

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

How To Use Server-Side Rendering with Nuxt.js

8257 interfacing 2 in microprocessor for btech students

The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

HR Software Buyers Guide in 2024 - HRSoftware.com

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

Stream processing

1. STREAM PROCESSING @ASHIC HTTP://WWW.HEARTYSOFT.COM

2. BIG DATA • What?

3. BIG DATA • Hadoop • Map-Reduce • Spark

4. BIG DATA • Optimisations • Parquet, etc.

5. BIG DATA • Problems?

6. BIG DATA • Problems?

7. BIG DATA • Problems?

8. STREAMING DATA • What?

9. STREAMING DATA • Cheaper? • Timely results? • Approximations?

10. STREAMING DATA

11. EXAMPLES • Statistical Summaries Mean, Standard Deviation

12. EXAMPLES • Statistical Summaries Hold n, sum, and sum of squares => Mean, Standard Deviation

13. EXAMPLES • Statistical Summaries Approximation of Median

14. EXAMPLES • Statistical Summaries * Start with a value * If item > value, add learning rate * If item < value, subtract learning rate => Approximation of Median

15. EXAMPLES • Taking Representative Samples - From weblogs (i.e. ip-timestamp tuples) approximate average percentage of users who have revisited.

16. EXAMPLES • Filtering Streams Filter Out (or In) Things That May Not Be Needed

17. EXAMPLES • Filtering Streams Bloom Filter • Hash based on criterion • Matching hash means entry may be in there • Non matching hash means it’s definitely not

18. EXAMPLES How Many Distinct Things Did We Get?

19. EXAMPLES • Approximate Distinct Elements Flajolet-Martin Algorithm • Hash element (or identifier) to longs using many hash functions. Count trailing zeroes of hash. Let it be r. • Approximation for distinct elements = 2^R where R = max(r) • Combine groups of hashes: Take average for each group, then take median of the averages.

20. EXAMPLES • Clustering • Bradley, Fattad, Reina (BFR) approach. • BDMO algorithm.

21. BACK TO…

22. USEFUL TECHNOLOGY • Apache Kafka • Apache Cassandra • Apache Spark

23. KAFKA • Scale out, clustered, durable message broker. • Fault tolerant, replicated. • Uses topics, which have partitions. • Messages within partitions have guaranteed ordering.

24. KAFKA • Kafka Streams: Lightweight Kafka => [x] library • Kafka Connect: Enables streaming large amounts of data reliability between Kafka and other systems • Schema Registry: Well…registry for schemas

25. KAFKA

26. KAFKA - GOTCHAS • Messages in a partition are ordered, message processing may not be. • At least once… downstream idempotence required. • Disk. • Rebalances.

27. CASSANDRA • Partitioned row store. • Fault tolerant, Masterless. • Very fast writes, fast reads. • Tunable consistency. • Multi-datacentre aware. • OLTP + OLAP (via Spark).

28. CASSANDRA - DATACENTRES

29. CASSANDRA – SCHEMA • Collection Types • User defined types • Static Columns • Materialised Views

30. CASSANDRA - CQL

31. CASSANDRA – DATA MODELLING • NOT a relational database • KNOW YOUR QUERIES • Model for queries, not normalisation • Consolidate to minimal number of tables that get the job done • Unbound partition growth will bring down nodes, then quorum

32. CASSANDRA + SPARK

33. SPARK • General purpose data processing • Ability to cache things in memory, and re-use across steps.

34. SPARK

35. SPARK STREAMING • Microbatches • Similar API to non-streaming Spark

36. SPARK STREAMING WC

37. SPARK + KAFKA Kafka Direct Stream

38. SPARK + CASSANDRA * rdd.saveToCassandra * sc.cassandraTable

39. KAFKA + CASSANDRA * Cassandra Sink * Cassandra Connect

40. STREAM PROCESSING • Lots of open problems • RISE Labs (Real-time, Intelligent, and Secure Execution

41. THANK YOU @ashic http://github/Heartysoft/cassy-up