Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Joining Infinity — Windowless Stream Processing with Flink
Sanjar Akhmedov, Software Engineer, ResearchGate
It started when two researchers discovered first-
hand that collaborating with a friend or colleague on
the other side of ...
Connect the world of science.
Make research open to all.
Structured system
There are many variations of
passages of Lorem Ipsum
We have, and are
continuing to change
how scientifi...
11,000,000+
Members
110,000,000+
Publications
1,300,000,000+
Citations
Feature: Research Timeline
Feature: Research Timeline
Diverse data sources
Proxy
Frontend
Services
memcache MongoDB Solr PostgreSQL
Infinispan HBaseMongoDB Solr
Big data pipeline
Change
data
capture
Import
Hadoop cluster
Export
Data Model
Account Publication
Claim
1 *
Author
Authorship
1*
Hypothetical SQL
Publication
Authorship
1*
CREATE TABLE publications (
id SERIAL PRIMARY KEY,
author_ids INTEGER[]
);
Acco...
• Data sources are distributed across different DBs
• Dataset doesn’t fit in memory on a single machine
• Join process mus...
Change data capture (CDC)
User Microservice DB
Request Write
Cache
Sync
Solr/ES
Sync
HBase/HDFS
Sync
Change data capture (CDC)
User Microservice DB
Request Write
Log
K2
1
K1
4
Extract
Change data capture (CDC)
User Microservice DB
Request Write
Log
K2
1
K1
4
K1
Ø
Extract
Change data capture (CDC)
User Microservice DB
Request Write
Log
K2
1
K1
4
K1
Ø
KN
42
…
Extract
Change data capture (CDC)
User Microservice DB
Cache
Request Write
Log
K2
1
K1
4
K1
Ø
KN
42
…
Extract
Sync
Change data capture (CDC)
User Microservice DB
Cache
Request Write
Log
K2
1
K1
4
K1
Ø
KN
42
…
Extract
Sync
HBase/HDFSSolr/...
Join two CDC streams into one
NoSQL1
SQL Kafka
Kafka
Flink Streaming
Join
Kafka NoSQL2
Flink job topology
Accounts
Stream
Join(CoFlatMap)
Account
Publications
Publications
Stream
…
Author 2
Author 1
Author N
DataStream<Account> accounts = kafkaTopic("accounts");
DataStream<Publication> publications = kafkaTopic("publications");
...
DataStream<Account> accounts = kafkaTopic("accounts");
DataStream<Publication> publications = kafkaTopic("publications");
...
DataStream<Account> accounts = kafkaTopic("accounts");
DataStream<Publication> publications = kafkaTopic("publications");
...
DataStream<Account> accounts = kafkaTopic("accounts");
DataStream<Publication> publications = kafkaTopic("publications");
...
DataStream<Account> accounts = kafkaTopic("accounts");
DataStream<Publication> publications = kafkaTopic("publications");
...
DataStream<Account> accounts = kafkaTopic("accounts");
DataStream<Publication> publications = kafkaTopic("publications");
...
DataStream<Account> accounts = kafkaTopic("accounts");
DataStream<Publication> publications = kafkaTopic("publications");
...
Example dataflow
Account Publications
Accounts
Alice 2
Publications
Accounts
Stream
Join
Account
Publications
Publications...
Example dataflow
Account Publications
Accounts
Alice 2
Publications
Accounts
Stream
Join
Account
Publications
Publications...
Example dataflow
Account Publications
Accounts
Alice 2
Publications
Accounts
Stream
Join
Account
Publications
Publications...
Example dataflow
Account Publications
Accounts
Alice 2
Bob 1
Publications
Accounts
Stream
Join
Account
Publications
Public...
Example dataflow
Account Publications
Accounts
Alice 2
Bob 1
Publications
Accounts
Stream
Join
Account
Publications
Public...
Example dataflow
Account Publications
Accounts
Alice 2
Bob 1
Publications
Accounts
Stream
Join
Account
Publications
Public...
Example dataflow
Account Publications
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Accounts
Stream
Join
Account
Publicatio...
Example dataflow
Account Publications
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Accounts
Stream
Join
Account
Publicatio...
Example dataflow
Account Publications
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Accounts
Stream
Join
Account
Publicatio...
Example dataflow
Account Publications
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Accounts
Stream
Join
Account
Publicatio...
Example dataflow
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Accounts
Stream
Join
A...
• ✔ Data sources are distributed across different DBs
• ✔ Dataset doesn’t fit in memory on a single machine
• ✔ Join proce...
Paper1 gets deleted
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 Ø
Accounts
S...
Paper1 gets deleted
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 Ø
Accounts
S...
Paper1 gets deleted
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 Ø
Accounts
S...
Paper1 gets deleted
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 Ø
Accounts
S...
Paper1 gets deleted
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 Ø
Accounts
S...
Paper1 gets deleted
Account Publications
K1 (Bob, Paper1)
K1 Ø
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 Ø
Accou...
Paper1 gets deleted
Account Publications
K1 (Bob, Paper1)
K1 Ø
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 Ø
Accou...
Paper1 gets updated
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 2
Accounts
S...
Paper1 gets updated
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 2
Accounts
S...
Paper1 gets updated
Account Publications
K1 (Bob, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Paper1 2
Accounts
S...
Paper1 gets updated
Account Publications
K1 (Bob, Paper1)
(Alice, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
Pap...
Paper1 gets updated
Account Publications
K1 (Bob, Paper1)
?? (Alice, Paper1)
Accounts
Alice 2
Bob 1
Publications
Paper1 1
...
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
Accounts
Alice 2
Bob 1
P...
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
Accounts
Alice 2
Bob 1
B...
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
Accounts
Alice 2
Bob 1
B...
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
Accounts
Alice 2
Bo...
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
Accounts
Alice 2
Bo...
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
Accounts
Alice 2
Bo...
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
Accounts
Alice 2
Bo...
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
Accounts
Alice 2
Bo...
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
Accounts
Alice 2
Bo...
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
K2 (Alice, Paper1)
...
Alice claims Paper1 via different author
Account Publications
K1 (Bob, Paper1)
K2 (Alice, Paper1)
K1 Ø
K3 (Alice, Paper1)
...
• Keep previous element state to update
previous join result
• Stream elements are not domain entities
but commands such a...
Generic join graph
Account
Publications
Accounts
Stream
Publications
Stream
Diff
Alice
Bob
…
Diff
Paper1
PaperN
…
Join
Aut...
Generic join graph
Operate on commands
Account
Publications
Accounts
Stream
Publications
Stream
Diff
Alice
Bob
…
Diff
Pape...
Memory requirements
Account
Publications
Accounts
Stream
Publications
Stream
Diff
Alice
Bob
…
Diff
Paper1
PaperN
…
Join
Au...
Network load
Account
Publications
Accounts
Stream
Publications
Stream
Diff
Alice
Bob
…
Diff
Paper1
PaperN
…
Join
Author1
A...
• In addition to handling Kafka traffic we need to reshuffle all
data twice over the network
• We need to keep two full co...
Questions
We are hiring - www.researchgate.net/careers
Apache Flink Meetup:  Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Prochain SlideShare
Chargement dans…5
×

Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

415 vues

Publié le

The extensive set of high-level Flink primitives makes it easy to join windowed streams. However, use cases that don’t have windows can prove to be more complicated, making it necessary to leverage operator state and low-level primitives to manually implement a continuous join. This talk will focus on the anomalies that present themselves when performing streaming joins with infinite windows, and the problems encountered operating topologies that back user-facing data. We will describe the approach taken at ResearchGate to implement and maintain a consistent join result of change data capture streams.

Publié dans : Données & analyses
  • Soyez le premier à commenter

Apache Flink Meetup: Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink

  1. 1. Joining Infinity — Windowless Stream Processing with Flink Sanjar Akhmedov, Software Engineer, ResearchGate
  2. 2. It started when two researchers discovered first- hand that collaborating with a friend or colleague on the other side of the world was no easy task. There are many variations of passages of Lorem Ipsum ResearchGate is a social network for scientists.
  3. 3. Connect the world of science. Make research open to all.
  4. 4. Structured system There are many variations of passages of Lorem Ipsum We have, and are continuing to change how scientific knowledge is shared and discovered.
  5. 5. 11,000,000+ Members 110,000,000+ Publications 1,300,000,000+ Citations
  6. 6. Feature: Research Timeline
  7. 7. Feature: Research Timeline
  8. 8. Diverse data sources Proxy Frontend Services memcache MongoDB Solr PostgreSQL Infinispan HBaseMongoDB Solr
  9. 9. Big data pipeline Change data capture Import Hadoop cluster Export
  10. 10. Data Model Account Publication Claim 1 * Author Authorship 1*
  11. 11. Hypothetical SQL Publication Authorship 1* CREATE TABLE publications ( id SERIAL PRIMARY KEY, author_ids INTEGER[] ); Account Claim 1 * Author CREATE TABLE accounts ( id SERIAL PRIMARY KEY, claimed_author_ids INTEGER[] ); CREATE MATERIALIZED VIEW account_publications REFRESH FAST ON COMMIT AS SELECT accounts.id AS account_id, publications.id AS publication_id FROM accounts JOIN publications ON ANY (accounts.claimed_author_ids) = ANY (publications.author_ids);
  12. 12. • Data sources are distributed across different DBs • Dataset doesn’t fit in memory on a single machine • Join process must be fault tolerant • Deploy changes fast • Up-to-date join result in near real-time • Join result must be accurate Challenges
  13. 13. Change data capture (CDC) User Microservice DB Request Write Cache Sync Solr/ES Sync HBase/HDFS Sync
  14. 14. Change data capture (CDC) User Microservice DB Request Write Log K2 1 K1 4 Extract
  15. 15. Change data capture (CDC) User Microservice DB Request Write Log K2 1 K1 4 K1 Ø Extract
  16. 16. Change data capture (CDC) User Microservice DB Request Write Log K2 1 K1 4 K1 Ø KN 42 … Extract
  17. 17. Change data capture (CDC) User Microservice DB Cache Request Write Log K2 1 K1 4 K1 Ø KN 42 … Extract Sync
  18. 18. Change data capture (CDC) User Microservice DB Cache Request Write Log K2 1 K1 4 K1 Ø KN 42 … Extract Sync HBase/HDFSSolr/ES
  19. 19. Join two CDC streams into one NoSQL1 SQL Kafka Kafka Flink Streaming Join Kafka NoSQL2
  20. 20. Flink job topology Accounts Stream Join(CoFlatMap) Account Publications Publications Stream … Author 2 Author 1 Author N
  21. 21. DataStream<Account> accounts = kafkaTopic("accounts"); DataStream<Publication> publications = kafkaTopic("publications"); DataStream<AccountPublication> result = accounts.connect(publications) .keyBy("claimedAuthorId", "publicationAuthorId") .flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() { transient ValueState<String> authorAccount; transient ValueState<String> authorPublication; public void open(Configuration parameters) throws Exception { authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null)); authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null)); } public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception { authorAccount.update(account.id); if (authorPublication.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception { authorPublication.update(publication.id); if (authorAccount.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } }); Prototype implementation
  22. 22. DataStream<Account> accounts = kafkaTopic("accounts"); DataStream<Publication> publications = kafkaTopic("publications"); DataStream<AccountPublication> result = accounts.connect(publications) .keyBy("claimedAuthorId", "publicationAuthorId") .flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() { transient ValueState<String> authorAccount; transient ValueState<String> authorPublication; public void open(Configuration parameters) throws Exception { authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null)); authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null)); } public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception { authorAccount.update(account.id); if (authorPublication.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception { authorPublication.update(publication.id); if (authorAccount.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } }); Prototype implementation
  23. 23. DataStream<Account> accounts = kafkaTopic("accounts"); DataStream<Publication> publications = kafkaTopic("publications"); DataStream<AccountPublication> result = accounts.connect(publications) .keyBy("claimedAuthorId", "publicationAuthorId") .flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() { transient ValueState<String> authorAccount; transient ValueState<String> authorPublication; public void open(Configuration parameters) throws Exception { authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null)); authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null)); } public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception { authorAccount.update(account.id); if (authorPublication.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception { authorPublication.update(publication.id); if (authorAccount.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } }); Prototype implementation
  24. 24. DataStream<Account> accounts = kafkaTopic("accounts"); DataStream<Publication> publications = kafkaTopic("publications"); DataStream<AccountPublication> result = accounts.connect(publications) .keyBy("claimedAuthorId", "publicationAuthorId") .flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() { transient ValueState<String> authorAccount; transient ValueState<String> authorPublication; public void open(Configuration parameters) throws Exception { authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null)); authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null)); } public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception { authorAccount.update(account.id); if (authorPublication.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception { authorPublication.update(publication.id); if (authorAccount.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } }); Prototype implementation
  25. 25. DataStream<Account> accounts = kafkaTopic("accounts"); DataStream<Publication> publications = kafkaTopic("publications"); DataStream<AccountPublication> result = accounts.connect(publications) .keyBy("claimedAuthorId", "publicationAuthorId") .flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() { transient ValueState<String> authorAccount; transient ValueState<String> authorPublication; public void open(Configuration parameters) throws Exception { authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null)); authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null)); } public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception { authorAccount.update(account.id); if (authorPublication.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception { authorPublication.update(publication.id); if (authorAccount.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } }); Prototype implementation
  26. 26. DataStream<Account> accounts = kafkaTopic("accounts"); DataStream<Publication> publications = kafkaTopic("publications"); DataStream<AccountPublication> result = accounts.connect(publications) .keyBy("claimedAuthorId", "publicationAuthorId") .flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() { transient ValueState<String> authorAccount; transient ValueState<String> authorPublication; public void open(Configuration parameters) throws Exception { authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null)); authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null)); } public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception { authorAccount.update(account.id); if (authorPublication.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception { authorPublication.update(publication.id); if (authorAccount.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } }); Prototype implementation
  27. 27. DataStream<Account> accounts = kafkaTopic("accounts"); DataStream<Publication> publications = kafkaTopic("publications"); DataStream<AccountPublication> result = accounts.connect(publications) .keyBy("claimedAuthorId", "publicationAuthorId") .flatMap(new RichCoFlatMapFunction<Account, Publication, AccountPublication>() { transient ValueState<String> authorAccount; transient ValueState<String> authorPublication; public void open(Configuration parameters) throws Exception { authorAccount = getRuntimeContext().getState(new ValueStateDescriptor<>("authorAccount", String.class, null)); authorPublication = getRuntimeContext().getState(new ValueStateDescriptor<>("authorPublication", String.class, null)); } public void flatMap1(Account account, Collector<AccountPublication> out) throws Exception { authorAccount.update(account.id); if (authorPublication.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } public void flatMap2(Publication publication, Collector<AccountPublication> out) throws Exception { authorPublication.update(publication.id); if (authorAccount.value() != null) { out.collect(new AccountPublication(authorAccount.value(), authorPublication.value())); } } }); Prototype implementation
  28. 28. Example dataflow Account Publications Accounts Alice 2 Publications Accounts Stream Join Account Publications Publications Stream Author 1 Author 2 Author N …
  29. 29. Example dataflow Account Publications Accounts Alice 2 Publications Accounts Stream Join Account Publications Publications Stream Author 1 Author 2 Author N …
  30. 30. Example dataflow Account Publications Accounts Alice 2 Publications Accounts Stream Join Account Publications Publications Stream Author 1 Author 2 Alice Author N …
  31. 31. Example dataflow Account Publications Accounts Alice 2 Bob 1 Publications Accounts Stream Join Account Publications Publications Stream Author 1 Author 2 Alice Author N …
  32. 32. Example dataflow Account Publications Accounts Alice 2 Bob 1 Publications Accounts Stream Join Account Publications Publications Stream Author 1 Author 2 Alice Author N … (Bob, 1)
  33. 33. Example dataflow Account Publications Accounts Alice 2 Bob 1 Publications Accounts Stream Join Account Publications Publications Stream Author 1 Bob Author 2 Alice Author N … (Bob, 1)
  34. 34. Example dataflow Account Publications Accounts Alice 2 Bob 1 Publications Paper1 1 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Author 2 Alice Author N …
  35. 35. Example dataflow Account Publications Accounts Alice 2 Bob 1 Publications Paper1 1 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Author 2 Alice Author N …
  36. 36. Example dataflow Account Publications Accounts Alice 2 Bob 1 Publications Paper1 1 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  37. 37. Example dataflow Account Publications Accounts Alice 2 Bob 1 Publications Paper1 1 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  38. 38. Example dataflow Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  39. 39. • ✔ Data sources are distributed across different DBs • ✔ Dataset doesn’t fit in memory on a single machine • ✔ Join process must be fault tolerant • ✔ Deploy changes fast • ✔ Up-to-date join result in near real-time • ? Join result must be accurate Challenges
  40. 40. Paper1 gets deleted Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 Ø Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  41. 41. Paper1 gets deleted Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 Ø Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  42. 42. Paper1 gets deleted Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 Ø Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N … ?
  43. 43. Paper1 gets deleted Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 Ø Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N … Need previous value
  44. 44. Paper1 gets deleted Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 Ø Accounts Stream Join Account Publications Diff with Previous State Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  45. 45. Paper1 gets deleted Account Publications K1 (Bob, Paper1) K1 Ø Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 Ø Accounts Stream Join Account Publications Diff with Previous State Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  46. 46. Paper1 gets deleted Account Publications K1 (Bob, Paper1) K1 Ø Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 Ø Accounts Stream Join Account Publications Diff with Previous State Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N … Need K1 here, e.g. K1 = 𝒇(Bob, Paper1)
  47. 47. Paper1 gets updated Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 2 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  48. 48. Paper1 gets updated Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 2 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Author N …
  49. 49. Paper1 gets updated Account Publications K1 (Bob, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 2 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Paper1 Author N …
  50. 50. Paper1 gets updated Account Publications K1 (Bob, Paper1) (Alice, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 2 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Paper1 Author N … (Alice, Paper1)
  51. 51. Paper1 gets updated Account Publications K1 (Bob, Paper1) ?? (Alice, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 2 Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Paper1 Author N …
  52. 52. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) Accounts Alice 2 Bob 1 Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Paper1 Author N …
  53. 53. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) Accounts Alice 2 Bob 1 Bob Ø Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Paper1 Author N …
  54. 54. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) Accounts Alice 2 Bob 1 Bob Ø Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Bob Paper1 Author 2 Alice Paper1 Author N …
  55. 55. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø Accounts Alice 2 Bob 1 Bob Ø Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Ø Paper1 Author 2 Alice Paper1 Author N …
  56. 56. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø Accounts Alice 2 Bob 1 Bob Ø Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Ø Paper1 Author 2 Alice Paper1 Author N …
  57. 57. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø Accounts Alice 2 Bob 1 Bob Ø Alice 1 Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Ø Paper1 Author 2 Alice Paper1 Author N …
  58. 58. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø Accounts Alice 2 Bob 1 Bob Ø Alice 1 Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Ø Paper1 Author 2 Alice Paper1 Author N … 2. (Alice, 1)
  59. 59. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø Accounts Alice 2 Bob 1 Bob Ø Alice 1 Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Alice Paper1 Author 2 Ø Paper1 Author N … 2. (Alice, 1)
  60. 60. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø Accounts Alice 2 Bob 1 Bob Ø Alice 1 Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Alice Paper1 Author 2 Ø Paper1 Author N … 2. (Alice, 1) (Alice, Paper1)
  61. 61. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø K2 (Alice, Paper1) K2 Ø Accounts Alice 2 Bob 1 Bob Ø Alice 1 Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Alice Paper1 Author 2 Ø Paper1 Author N … 2. (Alice, 1) (Alice, Paper1)
  62. 62. Alice claims Paper1 via different author Account Publications K1 (Bob, Paper1) K2 (Alice, Paper1) K1 Ø K3 (Alice, Paper1) K2 Ø Accounts Alice 2 Bob 1 Bob Ø Alice 1 Publications Paper1 1 Paper1 (1, 2) Accounts Stream Join Account Publications Publications Stream Author 1 Alice Paper1 Author 2 Ø Paper1 Author N … 2. (Alice, 1) (Alice, Paper1) Pick correct natural IDs e.g. K3 = 𝒇(Alice, Author1, Paper1)
  63. 63. • Keep previous element state to update previous join result • Stream elements are not domain entities but commands such as delete or upsert • Joined stream must have natural IDs to propagate deletes and updates How to solve deletes and updates
  64. 64. Generic join graph Account Publications Accounts Stream Publications Stream Diff Alice Bob … Diff Paper1 PaperN … Join Author1 AuthorM …
  65. 65. Generic join graph Operate on commands Account Publications Accounts Stream Publications Stream Diff Alice Bob … Diff Paper1 PaperN … Join Author1 AuthorM …
  66. 66. Memory requirements Account Publications Accounts Stream Publications Stream Diff Alice Bob … Diff Paper1 PaperN … Join Author1 AuthorM … Full copy of Accounts stream Full copy of Publications stream Full copy of Accounts stream on left side Full copy of Publications stream on right side
  67. 67. Network load Account Publications Accounts Stream Publications Stream Diff Alice Bob … Diff Paper1 PaperN … Join Author1 AuthorM … Reshuffle Reshuffle Network Network
  68. 68. • In addition to handling Kafka traffic we need to reshuffle all data twice over the network • We need to keep two full copies of each joined stream in memory Resource considerations
  69. 69. Questions We are hiring - www.researchgate.net/careers

×