Slides for my talk at Oredev 2016. Introduces stream processing, some techniques, and example uses. Also introduces technologies like Kafka, Cassandra, Spark, with their pros and cons.
Video available at https://vimeo.com/191056269 .
14. EXAMPLES
• Statistical Summaries
* Start with a value
* If item > value, add learning
rate
* If item < value, subtract
learning rate
=>
Approximation of Median
15. EXAMPLES
• Taking Representative Samples
- From weblogs (i.e. ip-timestamp tuples) approximate average
percentage of users who have revisited.
17. EXAMPLES
• Filtering Streams
Bloom Filter
• Hash based on criterion
• Matching hash means entry may be in
there
• Non matching hash means it’s
definitely not
19. EXAMPLES
• Approximate Distinct Elements
Flajolet-Martin Algorithm
• Hash element (or identifier) to longs using many
hash functions. Count trailing zeroes of hash. Let
it be r.
• Approximation for distinct elements = 2^R where
R = max(r)
• Combine groups of hashes: Take average for each
group, then take median of the averages.
23. KAFKA
• Scale out, clustered, durable message broker.
• Fault tolerant, replicated.
• Uses topics, which have partitions.
• Messages within partitions have guaranteed ordering.
24. KAFKA
• Kafka Streams: Lightweight Kafka => [x] library
• Kafka Connect: Enables streaming large amounts
of data reliability between Kafka and other
systems
• Schema Registry: Well…registry for schemas
26. KAFKA - GOTCHAS
• Messages in a partition are ordered, message
processing may not be.
• At least once… downstream idempotence
required.
• Disk.
• Rebalances.
27. CASSANDRA
• Partitioned row store.
• Fault tolerant, Masterless.
• Very fast writes, fast reads.
• Tunable consistency.
• Multi-datacentre aware.
• OLTP + OLAP (via Spark).
31. CASSANDRA
– DATA MODELLING
• NOT a relational database
• KNOW YOUR QUERIES
• Model for queries, not normalisation
• Consolidate to minimal number of tables that get the job done
• Unbound partition growth will bring down nodes, then quorum