Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
@nicolas_frankel
A CDC use-case:
Designing an Evergreen
Cache
Nicolas Fränkel
@nicolas_frankel
Me, myself and I
 Former developer, team lead, architect,
blah-blah
 Developer Advocate
 Interested in...
@nicolas_frankel
Hazelcast
HAZELCAST IMDG is an operational,
in-memory, distributed computing
platform that manages data u...
@nicolas_frankel
Agenda
1. Why cache?
2. Alternatives to keeping the cache in sync
3. Change-Data-Capture (CDC)
4. Debeziu...
@nicolas_frankel
The caching trade-off
 Improved performance/availability
 Stale data
@nicolas_frankel
The initial state
1. The application
2. The RDBMS
3. The cache
@nicolas_frankel
Aye, there’s the rub!
 A new component
writes to the
database
 E.g.: a table holding
references needs t...
@nicolas_frankel
How to keep the cache in sync with the DB?
@nicolas_frankel
Cache invalidation
“There are two hard things in computer
science:
1. Naming things
2. Cache invalidation...
@nicolas_frankel
Cache eviction vs Time-To-Live
 Cache eviction: which entities to evict
when the cache is full
• Least R...
@nicolas_frankel
Choosing the “correct” TTL
 Less frequent than the update frequency
• Miss updates
 More frequent than ...
@nicolas_frankel
Polling process
Same issue regarding the frequency
@nicolas_frankel
Event-driven for the win!
1. If no writes happen, there's no need to
update the cache
2. If a write happe...
@nicolas_frankel
RDMBS triggers
 Not all RDBMS implement triggers
 How to call an external process from the
trigger?
@nicolas_frankel
The example of MySQL: User-defined function
 Functions must be written in C++
 The OS must support dyna...
@nicolas_frankel
lib_mysqludf_sys
UDF library with functions to interact with the operating system
CREATE TRIGGER MyTrigge...
@nicolas_frankel
Cons
 Implementation-dependent
 Fragile
 Who maintains/debugs it?
 Resource-consuming if done frequen...
@nicolas_frankel
Change-Data-Capture
“In databases, Change Data Capture is a set
of software design patterns used to deter...
@nicolas_frankel
CDC implementation options
1. Timestamps on rows
2. Version numbers on rows
3. Status indicators on rows
...
@nicolas_frankel
“Turning the database inside out” - Martin Kleppman
-- https://www.confluent.io/blog/turning-the-database...
@nicolas_frankel
What is a transaction/binary/etc. log?
“The binary log contains ‘events’ that
describe database changes s...
@nicolas_frankel
Reasons for the log
1. Data recovery
2. Replication
@nicolas_frankel
What if we “hacked” the log?
@nicolas_frankel
Sample MySQL binlog
### UPDATE `test`.`t`
### WHERE
### @1=1 /* INT meta=0 nullable=0 is_null=0 */
### @2...
@nicolas_frankel
Kind reminder…
 Implementation-dependent
 Fragile
 Who maintains/debugs it?
@nicolas_frankel
Debezium to the rescue
 Java-based abstraction layer for CDC
 Provided by Red Hat
 Apache v2 licensed
...
@nicolas_frankel
Debezium connector plugins
 Production-ready
• MySQL
• PostrgreSQL
• MongoDB
• SQL Server
 Incubating
•...
@nicolas_frankel
Debezium
“Debezium records all
row-level changes
within each database
table in a change
event stream”
-- ...
@nicolas_frankel
Hazelcast Jet
 Stream Processing Engine (SPE)
 Distributed
 In-memory
 Embeds Hazelcast IMDG
 Apache...
@nicolas_frankel
Jet overview
Stream Processor
Data SinkData Source
Hazelcast IMDG
Map, Cache, List,
Change Events
Live St...
@nicolas_frankel
Deployment modes
// Create new cluster member
JetInstance jet = Jet.newJetInstance();
// Connect to runni...
@nicolas_frankel
Pipeline Job
 Declarative code that
defines and links sources,
transforms, and sinks
 Platform-specific...
@nicolas_frankel
Back to our use-case
A Jet job:
1. Watches change events in the
database
2. Analyzes the change event
3. ...
@nicolas_frankel
@nicolas_frankel
Recap
 The caching trade-off
 Event-based architectures FTW
 Change-Data-Capture
• Integration through...
@nicolas_frankel
Thanks for your attention!
 https://blog.frankel.ch/
 @nicolas_frankel
 https://jet-start.sh/docs/tuto...
Vous avez terminé ce document.
Télécharger et lire hors ligne.
Prochain SlideShare
What to Upload to SlideShare
Suivant
Prochain SlideShare
What to Upload to SlideShare
Suivant
Télécharger pour lire hors ligne et voir en mode plein écran

1

Partager

JOnConf - A CDC use-case: designing an Evergreen Cache

Télécharger pour lire hors ligne

When one’s app is challenged with poor performances, it’s easy to set up a cache in front of one’s SQL database. It doesn’t fix the root cause (e.g. bad schema design, bad SQL query, etc.) but it gets the job done. If the app is the only component that writes to the underlying database, it’s a no-brainer to update the cache accordingly, so the cache is always up-to-date with the data in the database.

Things start to go sour when the app is not the only component writing to the DB. Among other sources of writes, there are batches, other apps (shared databases exist unfortunately), etc. One might think about a couple of ways to keep data in sync i.e. polling the DB every now and then, DB triggers, etc. Unfortunately, they all have issues that make them unreliable and/or fragile.

You might have read about Change-Data-Capture before. It’s been described by Martin Kleppmann as turning the database inside out: it means the DB can send change events (SELECT, DELETE and UPDATE) that one can register to. Just opposite to Event Sourcing that aggregates events to produce state, CDC is about getting events out of states. Once CDC is implemented, one can subscribe to its events and update the cache accordingly. However, CDC is quite in its early stage, and implementations are quite specific.

In this talk, I’ll describe an easy-to-setup architecture that leverages CDC to have an evergreen cache.

Livres associés

Gratuit avec un essai de 30 jours de Scribd

Tout voir

JOnConf - A CDC use-case: designing an Evergreen Cache

  1. 1. @nicolas_frankel A CDC use-case: Designing an Evergreen Cache Nicolas Fränkel
  2. 2. @nicolas_frankel Me, myself and I  Former developer, team lead, architect, blah-blah  Developer Advocate  Interested in CDC and data streaming
  3. 3. @nicolas_frankel Hazelcast HAZELCAST IMDG is an operational, in-memory, distributed computing platform that manages data using in-memory storage and performs execution for breakthrough and scale. HAZELCAST JET is the ultra fast, application embeddable, 3rd generation stream processing engine for low latency batch and stream processing.
  4. 4. @nicolas_frankel Agenda 1. Why cache? 2. Alternatives to keeping the cache in sync 3. Change-Data-Capture (CDC) 4. Debezium, a CDC implementation 5. Hazelcast Jet + Debezium 6. Demo!
  5. 5. @nicolas_frankel The caching trade-off  Improved performance/availability  Stale data
  6. 6. @nicolas_frankel The initial state 1. The application 2. The RDBMS 3. The cache
  7. 7. @nicolas_frankel Aye, there’s the rub!  A new component writes to the database  E.g.: a table holding references needs to be updated every now and then
  8. 8. @nicolas_frankel How to keep the cache in sync with the DB?
  9. 9. @nicolas_frankel Cache invalidation “There are two hard things in computer science: 1. Naming things 2. Cache invalidation 3. And off-by-one errors”
  10. 10. @nicolas_frankel Cache eviction vs Time-To-Live  Cache eviction: which entities to evict when the cache is full • Least Recently Used • Least Frequently Used  TTL: how long will an entity be kept in the cache
  11. 11. @nicolas_frankel Choosing the “correct” TTL  Less frequent than the update frequency • Miss updates  More frequent than the update frequency • Waste resources
  12. 12. @nicolas_frankel Polling process Same issue regarding the frequency
  13. 13. @nicolas_frankel Event-driven for the win! 1. If no writes happen, there's no need to update the cache 2. If a write happens, then the relevant cache item should be updated accordingly
  14. 14. @nicolas_frankel RDMBS triggers  Not all RDBMS implement triggers  How to call an external process from the trigger?
  15. 15. @nicolas_frankel The example of MySQL: User-defined function  Functions must be written in C++  The OS must support dynamic loading  Becomes part of the running server • Bound by all constraints that apply to writing server code  Etc. -- https://dev.mysql.com/doc/refman/8.0/en/adding-udf.html
  16. 16. @nicolas_frankel lib_mysqludf_sys UDF library with functions to interact with the operating system CREATE TRIGGER MyTrigger AFTER INSERT ON MyTable FOR EACH ROW BEGIN DECLARE cmd CHAR(255); DECLARE result INT(10); SET cmd = CONCAT('update_row', '1'); SET result = sys_exec(cmd); END; -- https://github.com/mysqludf/lib_mysqludf_sys
  17. 17. @nicolas_frankel Cons  Implementation-dependent  Fragile  Who maintains/debugs it?  Resource-consuming if done frequently
  18. 18. @nicolas_frankel Change-Data-Capture “In databases, Change Data Capture is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. CDC is an approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources.” -- https://en.wikipedia.org/wiki/Change_data_capture
  19. 19. @nicolas_frankel CDC implementation options 1. Timestamps on rows 2. Version numbers on rows 3. Status indicators on rows 4. Triggers on tables 5. Log scanners
  20. 20. @nicolas_frankel “Turning the database inside out” - Martin Kleppman -- https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/
  21. 21. @nicolas_frankel What is a transaction/binary/etc. log? “The binary log contains ‘events’ that describe database changes such as table creation operations or changes to table data.” -- https://dev.mysql.com/doc/refman/8.0/en/binary-log.html
  22. 22. @nicolas_frankel Reasons for the log 1. Data recovery 2. Replication
  23. 23. @nicolas_frankel What if we “hacked” the log?
  24. 24. @nicolas_frankel Sample MySQL binlog ### UPDATE `test`.`t` ### WHERE ### @1=1 /* INT meta=0 nullable=0 is_null=0 */ ### @2='apple' /* VARSTRING(20) meta=20 nullable=0 is_null=0 */ ### @3=NULL /* VARSTRING(20) meta=0 nullable=1 is_null=1 */ ### SET ### @1=1 /* INT meta=0 nullable=0 is_null=0 */ ### @2='pear' /* VARSTRING(20) meta=20 nullable=0 is_null=0 */ ### @3='2009:01:01' /* DATE meta=0 nullable=1 is_null=0 */ # at 569 #150112 21:40:14 server id 1 end_log_pos 617 CRC32 0xf134ad89 #Table_map: `test`.`t` mapped to number 251 # at 617 #150112 21:40:14 server id 1 end_log_pos 665 CRC32 0x87047106 #Delete_rows: table id 251 flags: STMT_END_F
  25. 25. @nicolas_frankel Kind reminder…  Implementation-dependent  Fragile  Who maintains/debugs it?
  26. 26. @nicolas_frankel Debezium to the rescue  Java-based abstraction layer for CDC  Provided by Red Hat  Apache v2 licensed  Very skewed toward Kafka
  27. 27. @nicolas_frankel Debezium connector plugins  Production-ready • MySQL • PostrgreSQL • MongoDB • SQL Server  Incubating • Oracle • DB2 (!) • Cassandra
  28. 28. @nicolas_frankel Debezium “Debezium records all row-level changes within each database table in a change event stream” -- https://debezium.io/
  29. 29. @nicolas_frankel Hazelcast Jet  Stream Processing Engine (SPE)  Distributed  In-memory  Embeds Hazelcast IMDG  Apache v2 licensed  (Hazelcast Jet Enterprise offering)
  30. 30. @nicolas_frankel Jet overview Stream Processor Data SinkData Source Hazelcast IMDG Map, Cache, List, Change Events Live Streams Kafka, JMS, Sensors, Feeds Databases JDBC, Relational, NoSQL, Change Events Files HDFS, Flat Files, Logs, File watcher Applications Sockets Ingest In-Memory Operational Storage Combine Join, Enrich, Group, Aggregate Stream Windowing, Event-Time Processing Compute Distributed and Parallel Computations Transform Filter, Clean, Convert Publish In-Memory, Subscriber Notifications Stream Stream
  31. 31. @nicolas_frankel Deployment modes // Create new cluster member JetInstance jet = Jet.newJetInstance(); // Connect to running cluster JetInstance jet = Jet.newJetClient(); Client/ServerEmbedded Java API Application Java API Application Java API Application Client API Application Client API Application Client API Application Client API Application
  32. 32. @nicolas_frankel Pipeline Job  Declarative code that defines and links sources, transforms, and sinks  Platform-specific SDK  Client submits pipeline to the SPE  Running instance of pipeline in SPE  SPE executes the pipeline • Code execution • Data routing • Flow control
  33. 33. @nicolas_frankel Back to our use-case A Jet job: 1. Watches change events in the database 2. Analyzes the change event 3. Updates the cache accordingly
  34. 34. @nicolas_frankel
  35. 35. @nicolas_frankel Recap  The caching trade-off  Event-based architectures FTW  Change-Data-Capture • Integration through Hazelcast Jet
  36. 36. @nicolas_frankel Thanks for your attention!  https://blog.frankel.ch/  @nicolas_frankel  https://jet-start.sh/docs/tutorials/cdc  https://bit.ly/evergreen-cache
  • nfrankel

    Jun. 26, 2020

When one’s app is challenged with poor performances, it’s easy to set up a cache in front of one’s SQL database. It doesn’t fix the root cause (e.g. bad schema design, bad SQL query, etc.) but it gets the job done. If the app is the only component that writes to the underlying database, it’s a no-brainer to update the cache accordingly, so the cache is always up-to-date with the data in the database. Things start to go sour when the app is not the only component writing to the DB. Among other sources of writes, there are batches, other apps (shared databases exist unfortunately), etc. One might think about a couple of ways to keep data in sync i.e. polling the DB every now and then, DB triggers, etc. Unfortunately, they all have issues that make them unreliable and/or fragile. You might have read about Change-Data-Capture before. It’s been described by Martin Kleppmann as turning the database inside out: it means the DB can send change events (SELECT, DELETE and UPDATE) that one can register to. Just opposite to Event Sourcing that aggregates events to produce state, CDC is about getting events out of states. Once CDC is implemented, one can subscribe to its events and update the cache accordingly. However, CDC is quite in its early stage, and implementations are quite specific. In this talk, I’ll describe an easy-to-setup architecture that leverages CDC to have an evergreen cache.

Vues

Nombre de vues

136

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

0

Actions

Téléchargements

2

Partages

0

Commentaires

0

Mentions J'aime

1

×