The document discusses the evolution of architectures for building real-time analytics systems. It describes moving from a single node architecture using Node.js and Socket.io to a distributed architecture using Erlang processes and libraries like GProc (pub/sub) and Bullet (Cowboy). The latest architecture introduced is called Swirl, which is a lightweight distributed stream processing system that uses Erlang terms and processes to filter, aggregate, and reduce streams of events in real-time across multiple nodes.
2. AdGear is full-stack ad platform for publishers and advertisers, with advanced
analytics, attribution measurement, ad serving, and real-time bidding technology.
4. Real-time reporting... why?
•
•
•
help clients to make informed decisions
•
•
should I increase the bid price?
should I bid on exchange X?
inventory control (brand safety)
debugging (bots detection, creatives audits)
8. Problems
•
•
•
•
•
no SMP support
•
•
each process needs to be monitored
requires load-balancing (nginx)
duplicated state (per process)
duplicated work (de-serialization)
bad error handling (event loop explodes)
callbacks...
11. Architecture #2
1. receive buffered events, split
and de-serialize
2. each event is sent to a
collector process (3) using
gproc (pubsub) for filtering
3. collector (gen_server)
aggregates message using ETS
counters and flush every
second
4. bullet handler serializes the
aggregates (tab2list to json)
12. Problems
•
ssh_channel process and collector process are
bottlenecks
•
number of messages increases with the number of
clients
•
•
requires lots of bandwidth for large streams
limited filtering (match specs)
13. Improvements...
(6 months ago)
•
•
optimize collector’s msg loop (gen_server to proc_lib)
use ssh compression
•
•
added support for openssh zlib compression *
R16B02
* https://github.com/lpgauth/otp/tree/openssh_zlib
17. What did I just agree too...
•
•
I only have 3 days to build this...
bid requests stream is too large to aggregate in a
central location (1+ Gbit/s - 80K+/s)
18. Strategy for demo
1. move aggregation upstream
2. use ETS match select to find table ids (filtering)
3. increment counters in process (no message!)
4. periodically flush aggregates via message to
collector node
5. collector node increments local counters and
periodically flush aggregates to bullet handler
25. Mapper Node
1. process “emits” event
2. lookup in ETS if there’s a
flow that matches the
stream name and filter
3. if there’s a match, call
flow_mod:map/4
4. if map returns counters,
increment in ETS
5. swirl_mapper periodically
flush aggregates to
reducer node
26. Reducer Node
1. swirl_tracker receives
mapper aggregates and
forwards it to reducer
2. reducer increments
counters in ets
3. reducer flushes counters
to flow_mod:reduce/4
27. Swirl-ql
•
•
sql where clause like syntax
supported operators:
•
•
•
•
AND / OR
<, <=, =, >, <>
IN (x, y) / NOT IN (x, y, z)
IS NULL / IS NOT NULL (undefined)
* https://github.com/lpgauth/swirl-ql