This document discusses how a financial technology company achieved resiliency in its microservices architecture for processing money transfers and transactions. It describes how the company moved to an event-driven architecture using message queues to decouple services and ensure transactions are processed reliably even if an individual service fails. It provides examples of how this approach was implemented for interest payment processing and automatic investment diversification. The document emphasizes the importance of reliability, auditability, and flexibility to recover from errors in financial systems where mistakes cannot be tolerated.
2. Overview
• Introduction
• How is working in Financial Services different?
• General Processes
• How did we achieve resiliency?
• Case Studies
• Conclusion
1
3. Overview
• Introduction
• How is working in Financial Services different?
• General Processes
• How did we achieve resiliency?
• Case Studies
• Conclusion
1
4. $ whoami
• Thomas Badie
• Software Engineer @ Landbay since January
• Worked mainly in Advertising before
• From Marseille, France
• linkedin.com/in/thomasbadie
3
6. Overview
• Introduction
• How is working in Financial Services different?
• General Processes
• How did we achieve resiliency?
• Case Studies
• Conclusion
1
7. In A Previous Life…
• Advertising, 250M requests a day
• Service Oriented Architecture
• Each request has a very low value associated with it (<£0.01)
• Communication over HTTP (behind a proper HTTPS gateway)
• A server crashing would lose about ~250 requests
• Consequences:
• Monetary: ~£2.50
• Customer: Negligible
6
8. Life at Landbay
• Consequences in a similar scenario if similar setup:
• Could lose money movements
• Customer loses trust
• A big mistake => Lose FCA license
• Long story short: We cannot make any mistakes and
environment failures are not acceptable
7
9. Overview
• Introduction
• How is working in Financial Services different?
• General Processes
• How did we achieve resiliency?
• Case Studies
• Conclusion
1
10. Before Reaching Prod
• Agile/Scrum
• Design each ticket with at least one more person
• Unit + Service testing (minimum coverage required for build to
pass)
• Code review
• Continuous deployment
• More about our overall infrastructure and pipelines with Chris on
Tuesday at 2pm
9
11. Ensuring That It Works As Intended
• Monitoring (Dynatrace)
• Alerting (Dynatrace, slack, email)
• Reconciliation (Business side)
• Reconciliation (tech side – across services)
• Reconciliation (third parties)
10
12. Overview
• Introduction
• How is working in Financial Services different?
• General Processes
• How did we achieve resiliency?
• Case Studies
• Conclusion
1
13. Model of our Money Transfers
12
Cash Balance Investment
14. About Transactions (The Money Kind)
• We store everything in pennies (+10000)
• Each transaction represents a change to the account
(+301.50000)
• Concurrent inserts are fine but need to check for consistent
grouped selects (locks)
• We have several daily snapshots to prevent the system from
slowing down over time and archive
13
15. How Do We Ensure Resiliency?
• Services are replicated and hosted in different availability zones
• Database transaction
• Event driven
• RabbitMQ transactions (DB commit)
• In case of failure, each event retries 5 times before going in a dead
queue (requiring manual process from us)
• Cluster of rabbits (colony?) in different AWS availability zones
14
17. On The Way To Resiliency – Simple Case
16
API Gateway Cash Balance
18. On The Way To Resiliency
• HTTP, only call investment if Cash Balance worked
• Obviously doesn’t work
17
API Gateway
Cash Balance
Investment
19. On The Way To Resiliency
• Distributed transaction Management
• Open same transaction in multiple services simultaneously.
• Hard to setup and maintain
18
API Gateway
Cash Balance
Investment
20. On The Way To Resiliency
• Cash Balance sends message to investment only if its
transaction works
19
API Gateway Cash Balance Investment
21. On The Way To Resiliency
• API Gateway may send messages listened to by different
services
20
Service 1 API Gateway Service 2
22. On The Way To Resiliency - Downsides
• The user may not immediately see the result of the operation he
ran, e.g. Registration and AML checks
• Require human intervention to recover
• Need to keep transactions small, leading to code generating
many concurrent transactions
21
23. Overview
• Introduction
• How is working in Financial Services different?
• General Processes
• How did we achieve resiliency?
• Case Studies
• Conclusion
1
24. Case Study 1 - Interest Payment
Version 1
23
Investment
Interest
Calculation
Cash Balance
25. Case Study 1 - Interest Payment
Version 1
• Triggered 1st of the month Cash Balance service receives the
amounts that need to go to each account
• Overall process estimated at a few hundred thousand
messages
• Lot of strain on network/platform
• It was working, but wasn’t the most optimal
24
26. Case Study 1 - Interest Payment Version
2
25
InterestInvestment Cash Balance
27. Case Study 1 - Interest Payment
Version 2
• Simpler and easier to maintain
• Fewer (bigger) messages
• In case of failure, we need to replay the whole message.
Heavier DB transaction, but light on the whole infrastructure.
• Use daily snapshots to figure out the amount to pay back
26
28. Case Study 2 - Diversification
27
DiversificationInvestment
29. Case Study 2 - Diversification
• Preventing our investors from ever saying “Where is my money
gone?!” is not only a tech problem
• We want to diversify everybody’s investments as much as
possible
• An average investor is funding 25 loans
• Built to accept failure:
• Works off stale data. If it’s incorrect we ignore the transaction
• Process runs at night, it’s unlikely that we will see many user generated
transactions running in parallel
• Runs every night, if we reached equilibrium, process does nothing
28
30. In Case Of Emergency
• Daily snapshots of DB
• Messages can be sent by hand through the RabbitMQ UI
• We have full audit of our messages stored in ElasticSearch (raw
messages that can be replayed)
• Each financial transaction represents a change, so we can
always modify them (even though very risky)
29
31. Conclusion
• Lightweight processes to ensure quality
• Transactions over messaging
• Sometimes we can bubble up the error to the user using HTTP,
but most of the time we work in the background
• It is useful to be flexible (ignoring some errors, …) in order to
achieve a simpler design
30