SlideShare une entreprise Scribd logo
1  sur  73
Living in a Distributed World
An intro to design considerations for distributed systems
Hello!
Vishal Bardoloi
@bardoloi
Lead Dev at ThoughtWorks
medium.com/@v.bardoloi
Overview
1. A real world scenario
2. Distributed data
3. Distributed transactions
4. Client-side considerations
5. Feedback / Next Topics
1. A Real World Scenario
<Disclaimer>
This scenario is a work of fiction. Names, characters, businesses, places, events,
locales, and incidents are either the products of the author’s imagination or used in
a fictitious manner. Any resemblance to actual persons, living or dead, or actual
events is purely coincidental.
Case study: TicketMeister.com
About TicketMeister.com
Online ticket sales company based in NY, with operations in many countries
Clients: event organizers - music festivals, sports stadiums, halls, theatres etc.
TicketMeister acts as an agent, selling tickets that the clients make available, and
charging a service fee.
Clients manage their available inventory via a Client Portal
Users (ticket buyers) purchase tickets through an e-commerce portal
To prevent ticket scalping: customers must have an account with a verified email
address
TicketMeister.com: “We need a new e-commerce portal!”
Why?
● Customers complain that the current
portal is too slow, especially during
big sales events
● We’ve had incidents of inadvertent
overselling (bad customer
experience; plus, refunds need to be
manually handled)
● We’re launching in EU, Brazil, China
and South Africa soon, and are
expecting 10x traffic after go-live
● Our existing data center (running
MySQL) is barely handling the
current load
Our customers don’t love the buying experience
Our customers really don’t love the buying experience
Our customers really really don’t love the buying experience
Our customers really really really don’t love the buying experience
Your Scope: only 2 API endpoints
1. Allow customers to look up current ticket prices & availability for an event
a. Must be fast, accurate and reliable
b. Must be globally available (i.e. you can look up US events from Australia)
2. Allow customers to purchase tickets for an event
a. Add the order to the Customer’s account in our Customer Information System (CIS)
b. Take payment
c. Send a confirmation email
d. Update inventory database
e. Notify the event organizer’s systems (3rd party)
f. Log the transaction
Easy job for you, SuperDev!
Questions so far?
1. Allow customers to look up current ticket prices & availability for an event
2. Allow customers to purchase ticket(s) for an event
Your Scope: only 2 API endpoints
2. Working with Distributed Data
Request #1: Look up current availability & price
MySQL DB
Problem: our primary data center in NY can’t handle the expected user load
We’d like to scale to multiple global Data Centers to handle the extra load
Define Success:
Priority # Goal Target Metric
1 Reliable ~99.999% (five 9’s) uptime
2 Fast <800ms response time
3 Accurate No overselling
Let’s review the requirements
Reliable
(upto five 9’s uptime)
Fast
(<500ms response time)
Accurate
(no overselling)
Global Scale
(multiple replicated data centers)
Can’t Do It
Pick 2 out of 3
Can NoSQL help?
<link: Distributed Databases talk slides>
Early NoSQL systems: you had to make explicit tradeoffs
Modern alternatives give you way more flexibility
Example: Azure Cosmos DB
Cosmos DB provides 5 different consistency model choices
Banking systems, Payment apps, etc.
Product reviews, Social media “wall” posts, etc.
Baseball scores, Blog comments, etc.
Shopping carts, User profile updates, etc.
Flight status tracking, Package tracking, etc.
Questions so far?
3. Working with Distributed Transactions
Request #2: API endpoint to purchase ticket(s)
In a monolith, the database provides the transaction with ACID guarantees
A - atomicity
C - consistency
I - isolation
D - durability
Distributed transactions can’t do that!
3 things to consider
1. What could fail?
2. How important is the failure?
3. If it fails, how should we respond?
What could fail?
What could fail?
Result: system is consistent
What could fail?
Result: system is inconsistent
What could fail?
Result: system is inconsistent
What could fail?
Result: system is inconsistent
What could fail?
Result: system is inconsistent
What could fail?
Result: system is inconsistent
If something fails, what can we do?
Write off the error in resource B, and
proceed as if normal
Good option when:
● B is non-critical (e.g. logging metrics)
● There are decent alternatives to B
(e.g. if customer can re-print their
confirmation email from a user portal)
Option 1: Ignore
Option 2: Retry
If resource B fails, retry a few # of times
Good option when:
● Retries are safe** on B
● Actions can be queued (i.e. time constraints
on “being done” are not strict)
Option 3: Undo
If resource B fails, perform an “undo”
(compensating action) on resource A
Good option when:
● An “undo” or compensating action exists
● There is no penalty for the undo
operation
Option 4: Coordinate
Coordinate the 2 actions between A
and B using a separate coordinator.
Prepare, coordinate, then commit (or
rollback on failure)
Good option when:
● A reliable coordinator is available
● Action can be broken into 2
“prepare” and “commit” phases
Ignore: write off the error in resource B
If something fails, what can we do?
Retry: if resource B fails, retry on B until it
succeeds
Coordinate: 2 phases between A and B: prepare,
coordinate, then commit (or rollback on failure)
Undo: if resource B fails, undo action on A
Decision Matrix: options for each point of failure
Best option available?
Step Order ? Ignore Retry Undo Coordinate
CIS
Payment
Events Ctr
Email
Logging
Inventory DB
General considerations
1. Business model constraints
a. Amazon.com: inventory >> demand, can process things offline later (asynchronously)
b. TicketMeister: limited time-bound inventory, must process everything right now
2. Alternative definitions of success
a. e.g. if emailing the receipt fails: can customer self-serve and print it from a user portal?
b. e.g. if calling the Events Center API fails - is there a batch job to “true-up” failures?
c. e.g. Logging may not be considered “critical” until you realize your Disaster Recovery system
depends on the logs being accurate and complete
3. Easier to undo? Call it first!
4. Research all 3rd party APIs: assume nothing!
4. Client-side Design Considerations
How does the user see a failure?
All these failure scenarios look the same to a client
Retry is safe
Retry may be safe
Retry likely NOT safe
The client doesn’t know the server’s internal state
Retry is safe
Retry may be safe
Retry likely NOT safe
Clients (especially humans) will almost always retry
POST succeeded! +$500
POST succeeded! +$500
Can’t we rely on the client to do the right thing?
● “Or else what?”
● “But what if my Wi-Fi goes down?”
● “Is there another way?”
● “How will I know it’s OK to bail?”
● “Should I call customer care?”
● “Should I just go on Twitter?”
Idempotency: guaranteeing “Exactly Once” semantics
POST succeeded! +$500
Oops! You already did that action!
One option: send an Idempotency Key
The client (front end, or another API) uses a unique identifier on its end for the
“transaction” - so that retries can be safely rejected
Other client-side considerations: exponential backoff
Exponential backoff: prevent clients from causing Denial-of-Service on a struggling service
...
Thundering herds: Avoiding resource contention
Solution: exponential back-off with random jitter
Example: Stripe.rb client
Further reading
● Two Generals Problem
● Byzantine Consensus Protocols
○ Achieving quorum
○ 2-phase commit
○ Paxos
○ Blockchain
● “Distributed Systems Observability” - Cindy Sridharan
Thank You!
Questions?
Feedback / Ideas for next time?
● NoSQL deep dive
● Monitoring & Observability in distributed systems
APPENDIX
3 Characteristics of a Distributed System
1. Operate concurrently
2. Can fail independently
3. Don’t share a global clock
1. Reads >> Writes?
a. Read replication: 1 master, many replicated slaves
i. Broken: consistency (new: eventual consistency) - even with re
ii. Writes are still a bottleneck and will take over the master
b. Sharding (break up 1 write database into many based some key)
i. Each one is read-replicated
ii. Broken: data model, completely isolated instances (can’t join tables across shards)
c. Adding indexes?
i. As writes scale up, you’ll lose the benefits of this
ii. May end up denormalizing the relational database (BAD!)
2. Why not go NoSQL anyway?
Distributed Data: scaling
Byzantine Generals Problem: https://en.wikipedia.org/wiki/Two_Generals%27_Problem
Byzantine agreement protocols:
https://en.wikipedia.org/wiki/Byzantine_fault_tolerance
https://en.wikipedia.org/wiki/Quantum_Byzantine_agreement
https://medium.com/loom-network/understanding-blockchain-fundamentals-part-1-byzantine-fault-toleranc
e-245f46fe8419
https://medium.com/all-things-ledger/the-byzantine-generals-problem-168553f31480
Further reading

Contenu connexe

Similaire à 2018-05-16 Geeknight Dallas - Distributed Systems Talk

Ticket Management Solution - astCRM
Ticket Management Solution - astCRMTicket Management Solution - astCRM
Ticket Management Solution - astCRMRajesh Erri
 
E - C O M M E R C E
E - C O M M E R C EE - C O M M E R C E
E - C O M M E R C Emonoaziz
 
Software Engineering Testing & Research
Software Engineering Testing & Research Software Engineering Testing & Research
Software Engineering Testing & Research Vrushali Lanjewar
 
Connecting Apache Kafka to Cash
Connecting Apache Kafka to CashConnecting Apache Kafka to Cash
Connecting Apache Kafka to Cashconfluent
 
JUG Amsterdam - Orchestration of microservices
JUG Amsterdam - Orchestration of microservicesJUG Amsterdam - Orchestration of microservices
JUG Amsterdam - Orchestration of microservicesBernd Ruecker
 
Building a financial data warehouse: A lesson in empathy
Building a financial data warehouse: A lesson in empathyBuilding a financial data warehouse: A lesson in empathy
Building a financial data warehouse: A lesson in empathySolmaz Shahalizadeh
 
Machine Learning in e commerce - Reboot
Machine Learning in e commerce - RebootMachine Learning in e commerce - Reboot
Machine Learning in e commerce - RebootMarion DE SOUSA
 
Introduction to Blockchain and BitCoin New Business Opportunties
Introduction to Blockchain and BitCoin New Business OpportuntiesIntroduction to Blockchain and BitCoin New Business Opportunties
Introduction to Blockchain and BitCoin New Business OpportuntiesValue Amplify Consulting
 
AI, Bitcoin, and the Future of Mortgage Webinar
AI, Bitcoin, and the Future of Mortgage WebinarAI, Bitcoin, and the Future of Mortgage Webinar
AI, Bitcoin, and the Future of Mortgage WebinarKristin Messerli
 
Software for Payment Cards: Choosing Wisely
Software for Payment Cards: Choosing WiselySoftware for Payment Cards: Choosing Wisely
Software for Payment Cards: Choosing WiselyCognizant
 
Bba401 e-commerce
Bba401  e-commerceBba401  e-commerce
Bba401 e-commercesmumbahelp
 
MuCon London 2017: Break your event chains
MuCon London 2017: Break your event chainsMuCon London 2017: Break your event chains
MuCon London 2017: Break your event chainsBernd Ruecker
 
IRJET - Analysis & Study of E-Procurement System in Current Scenario
IRJET -  	  Analysis & Study of E-Procurement System in Current ScenarioIRJET -  	  Analysis & Study of E-Procurement System in Current Scenario
IRJET - Analysis & Study of E-Procurement System in Current ScenarioIRJET Journal
 
REV2 - E2E Ticketing whitepaper
REV2 - E2E Ticketing whitepaperREV2 - E2E Ticketing whitepaper
REV2 - E2E Ticketing whitepaperMyles Kennedy
 
DDD Belgium Meetup 2017: Events, flows and long running services
DDD Belgium Meetup 2017: Events, flows and long running servicesDDD Belgium Meetup 2017: Events, flows and long running services
DDD Belgium Meetup 2017: Events, flows and long running servicesBernd Ruecker
 
Digital travel summit channel attribution 2014 04-01
Digital travel summit channel attribution 2014 04-01Digital travel summit channel attribution 2014 04-01
Digital travel summit channel attribution 2014 04-01Jonathan Isernhagen
 

Similaire à 2018-05-16 Geeknight Dallas - Distributed Systems Talk (20)

Ticket Management Solution - astCRM
Ticket Management Solution - astCRMTicket Management Solution - astCRM
Ticket Management Solution - astCRM
 
E - C O M M E R C E
E - C O M M E R C EE - C O M M E R C E
E - C O M M E R C E
 
Software Engineering Testing & Research
Software Engineering Testing & Research Software Engineering Testing & Research
Software Engineering Testing & Research
 
Evaluation strategies for dealing with partially labelled or unlabelled data
Evaluation strategies for dealing with partially labelled or unlabelled dataEvaluation strategies for dealing with partially labelled or unlabelled data
Evaluation strategies for dealing with partially labelled or unlabelled data
 
E commerce
E commerceE commerce
E commerce
 
Introduction to BDD
Introduction to BDD Introduction to BDD
Introduction to BDD
 
Connecting Apache Kafka to Cash
Connecting Apache Kafka to CashConnecting Apache Kafka to Cash
Connecting Apache Kafka to Cash
 
JUG Amsterdam - Orchestration of microservices
JUG Amsterdam - Orchestration of microservicesJUG Amsterdam - Orchestration of microservices
JUG Amsterdam - Orchestration of microservices
 
Building a financial data warehouse: A lesson in empathy
Building a financial data warehouse: A lesson in empathyBuilding a financial data warehouse: A lesson in empathy
Building a financial data warehouse: A lesson in empathy
 
Machine Learning in e commerce - Reboot
Machine Learning in e commerce - RebootMachine Learning in e commerce - Reboot
Machine Learning in e commerce - Reboot
 
Introduction to Blockchain and BitCoin New Business Opportunties
Introduction to Blockchain and BitCoin New Business OpportuntiesIntroduction to Blockchain and BitCoin New Business Opportunties
Introduction to Blockchain and BitCoin New Business Opportunties
 
AI, Bitcoin, and the Future of Mortgage Webinar
AI, Bitcoin, and the Future of Mortgage WebinarAI, Bitcoin, and the Future of Mortgage Webinar
AI, Bitcoin, and the Future of Mortgage Webinar
 
Software for Payment Cards: Choosing Wisely
Software for Payment Cards: Choosing WiselySoftware for Payment Cards: Choosing Wisely
Software for Payment Cards: Choosing Wisely
 
Bba401 e-commerce
Bba401  e-commerceBba401  e-commerce
Bba401 e-commerce
 
MuCon London 2017: Break your event chains
MuCon London 2017: Break your event chainsMuCon London 2017: Break your event chains
MuCon London 2017: Break your event chains
 
IRJET - Analysis & Study of E-Procurement System in Current Scenario
IRJET -  	  Analysis & Study of E-Procurement System in Current ScenarioIRJET -  	  Analysis & Study of E-Procurement System in Current Scenario
IRJET - Analysis & Study of E-Procurement System in Current Scenario
 
Atm project
Atm projectAtm project
Atm project
 
REV2 - E2E Ticketing whitepaper
REV2 - E2E Ticketing whitepaperREV2 - E2E Ticketing whitepaper
REV2 - E2E Ticketing whitepaper
 
DDD Belgium Meetup 2017: Events, flows and long running services
DDD Belgium Meetup 2017: Events, flows and long running servicesDDD Belgium Meetup 2017: Events, flows and long running services
DDD Belgium Meetup 2017: Events, flows and long running services
 
Digital travel summit channel attribution 2014 04-01
Digital travel summit channel attribution 2014 04-01Digital travel summit channel attribution 2014 04-01
Digital travel summit channel attribution 2014 04-01
 

Dernier

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 

Dernier (20)

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 

2018-05-16 Geeknight Dallas - Distributed Systems Talk

  • 1. Living in a Distributed World An intro to design considerations for distributed systems
  • 2. Hello! Vishal Bardoloi @bardoloi Lead Dev at ThoughtWorks medium.com/@v.bardoloi
  • 3.
  • 4. Overview 1. A real world scenario 2. Distributed data 3. Distributed transactions 4. Client-side considerations 5. Feedback / Next Topics
  • 5. 1. A Real World Scenario
  • 6. <Disclaimer> This scenario is a work of fiction. Names, characters, businesses, places, events, locales, and incidents are either the products of the author’s imagination or used in a fictitious manner. Any resemblance to actual persons, living or dead, or actual events is purely coincidental.
  • 8. About TicketMeister.com Online ticket sales company based in NY, with operations in many countries Clients: event organizers - music festivals, sports stadiums, halls, theatres etc. TicketMeister acts as an agent, selling tickets that the clients make available, and charging a service fee. Clients manage their available inventory via a Client Portal Users (ticket buyers) purchase tickets through an e-commerce portal To prevent ticket scalping: customers must have an account with a verified email address
  • 9. TicketMeister.com: “We need a new e-commerce portal!”
  • 10. Why? ● Customers complain that the current portal is too slow, especially during big sales events ● We’ve had incidents of inadvertent overselling (bad customer experience; plus, refunds need to be manually handled) ● We’re launching in EU, Brazil, China and South Africa soon, and are expecting 10x traffic after go-live ● Our existing data center (running MySQL) is barely handling the current load
  • 11. Our customers don’t love the buying experience
  • 12. Our customers really don’t love the buying experience
  • 13. Our customers really really don’t love the buying experience
  • 14. Our customers really really really don’t love the buying experience
  • 15. Your Scope: only 2 API endpoints 1. Allow customers to look up current ticket prices & availability for an event a. Must be fast, accurate and reliable b. Must be globally available (i.e. you can look up US events from Australia) 2. Allow customers to purchase tickets for an event a. Add the order to the Customer’s account in our Customer Information System (CIS) b. Take payment c. Send a confirmation email d. Update inventory database e. Notify the event organizer’s systems (3rd party) f. Log the transaction
  • 16. Easy job for you, SuperDev!
  • 18. 1. Allow customers to look up current ticket prices & availability for an event 2. Allow customers to purchase ticket(s) for an event Your Scope: only 2 API endpoints
  • 19. 2. Working with Distributed Data
  • 20. Request #1: Look up current availability & price MySQL DB
  • 21. Problem: our primary data center in NY can’t handle the expected user load
  • 22. We’d like to scale to multiple global Data Centers to handle the extra load
  • 23. Define Success: Priority # Goal Target Metric 1 Reliable ~99.999% (five 9’s) uptime 2 Fast <800ms response time 3 Accurate No overselling
  • 24. Let’s review the requirements Reliable (upto five 9’s uptime) Fast (<500ms response time) Accurate (no overselling) Global Scale (multiple replicated data centers)
  • 25. Can’t Do It Pick 2 out of 3
  • 28. Early NoSQL systems: you had to make explicit tradeoffs
  • 29. Modern alternatives give you way more flexibility
  • 31. Cosmos DB provides 5 different consistency model choices Banking systems, Payment apps, etc. Product reviews, Social media “wall” posts, etc. Baseball scores, Blog comments, etc. Shopping carts, User profile updates, etc. Flight status tracking, Package tracking, etc.
  • 33. 3. Working with Distributed Transactions
  • 34. Request #2: API endpoint to purchase ticket(s)
  • 35.
  • 36. In a monolith, the database provides the transaction with ACID guarantees A - atomicity C - consistency I - isolation D - durability
  • 38. 3 things to consider 1. What could fail? 2. How important is the failure? 3. If it fails, how should we respond?
  • 40. What could fail? Result: system is consistent
  • 41. What could fail? Result: system is inconsistent
  • 42. What could fail? Result: system is inconsistent
  • 43. What could fail? Result: system is inconsistent
  • 44. What could fail? Result: system is inconsistent
  • 45. What could fail? Result: system is inconsistent
  • 46. If something fails, what can we do?
  • 47. Write off the error in resource B, and proceed as if normal Good option when: ● B is non-critical (e.g. logging metrics) ● There are decent alternatives to B (e.g. if customer can re-print their confirmation email from a user portal) Option 1: Ignore
  • 48. Option 2: Retry If resource B fails, retry a few # of times Good option when: ● Retries are safe** on B ● Actions can be queued (i.e. time constraints on “being done” are not strict)
  • 49. Option 3: Undo If resource B fails, perform an “undo” (compensating action) on resource A Good option when: ● An “undo” or compensating action exists ● There is no penalty for the undo operation
  • 50. Option 4: Coordinate Coordinate the 2 actions between A and B using a separate coordinator. Prepare, coordinate, then commit (or rollback on failure) Good option when: ● A reliable coordinator is available ● Action can be broken into 2 “prepare” and “commit” phases
  • 51. Ignore: write off the error in resource B If something fails, what can we do? Retry: if resource B fails, retry on B until it succeeds Coordinate: 2 phases between A and B: prepare, coordinate, then commit (or rollback on failure) Undo: if resource B fails, undo action on A
  • 52. Decision Matrix: options for each point of failure Best option available? Step Order ? Ignore Retry Undo Coordinate CIS Payment Events Ctr Email Logging Inventory DB
  • 53. General considerations 1. Business model constraints a. Amazon.com: inventory >> demand, can process things offline later (asynchronously) b. TicketMeister: limited time-bound inventory, must process everything right now 2. Alternative definitions of success a. e.g. if emailing the receipt fails: can customer self-serve and print it from a user portal? b. e.g. if calling the Events Center API fails - is there a batch job to “true-up” failures? c. e.g. Logging may not be considered “critical” until you realize your Disaster Recovery system depends on the logs being accurate and complete 3. Easier to undo? Call it first! 4. Research all 3rd party APIs: assume nothing!
  • 54. 4. Client-side Design Considerations
  • 55. How does the user see a failure?
  • 56. All these failure scenarios look the same to a client Retry is safe Retry may be safe Retry likely NOT safe
  • 57. The client doesn’t know the server’s internal state Retry is safe Retry may be safe Retry likely NOT safe
  • 58. Clients (especially humans) will almost always retry POST succeeded! +$500 POST succeeded! +$500
  • 59. Can’t we rely on the client to do the right thing? ● “Or else what?” ● “But what if my Wi-Fi goes down?” ● “Is there another way?” ● “How will I know it’s OK to bail?” ● “Should I call customer care?” ● “Should I just go on Twitter?”
  • 60. Idempotency: guaranteeing “Exactly Once” semantics POST succeeded! +$500 Oops! You already did that action!
  • 61. One option: send an Idempotency Key The client (front end, or another API) uses a unique identifier on its end for the “transaction” - so that retries can be safely rejected
  • 62. Other client-side considerations: exponential backoff
  • 63. Exponential backoff: prevent clients from causing Denial-of-Service on a struggling service ...
  • 64. Thundering herds: Avoiding resource contention
  • 65. Solution: exponential back-off with random jitter Example: Stripe.rb client
  • 66. Further reading ● Two Generals Problem ● Byzantine Consensus Protocols ○ Achieving quorum ○ 2-phase commit ○ Paxos ○ Blockchain ● “Distributed Systems Observability” - Cindy Sridharan
  • 68. Feedback / Ideas for next time? ● NoSQL deep dive ● Monitoring & Observability in distributed systems
  • 69.
  • 71. 3 Characteristics of a Distributed System 1. Operate concurrently 2. Can fail independently 3. Don’t share a global clock
  • 72. 1. Reads >> Writes? a. Read replication: 1 master, many replicated slaves i. Broken: consistency (new: eventual consistency) - even with re ii. Writes are still a bottleneck and will take over the master b. Sharding (break up 1 write database into many based some key) i. Each one is read-replicated ii. Broken: data model, completely isolated instances (can’t join tables across shards) c. Adding indexes? i. As writes scale up, you’ll lose the benefits of this ii. May end up denormalizing the relational database (BAD!) 2. Why not go NoSQL anyway? Distributed Data: scaling
  • 73. Byzantine Generals Problem: https://en.wikipedia.org/wiki/Two_Generals%27_Problem Byzantine agreement protocols: https://en.wikipedia.org/wiki/Byzantine_fault_tolerance https://en.wikipedia.org/wiki/Quantum_Byzantine_agreement https://medium.com/loom-network/understanding-blockchain-fundamentals-part-1-byzantine-fault-toleranc e-245f46fe8419 https://medium.com/all-things-ledger/the-byzantine-generals-problem-168553f31480 Further reading