TicketMeister.com is an online ticket sales company that needs to build a new e-commerce portal to handle increasing traffic from customers globally. The new system must support two API endpoints - one to look up ticket availability and prices, and one to purchase tickets. This involves distributed data across multiple data centers and distributed transactions across several systems. The document discusses challenges of distributed systems including reliability, performance, data consistency, and transaction handling across failures. It also covers client-side considerations like idempotency and backoff strategies to prevent overloading services.
6. <Disclaimer>
This scenario is a work of fiction. Names, characters, businesses, places, events,
locales, and incidents are either the products of the author’s imagination or used in
a fictitious manner. Any resemblance to actual persons, living or dead, or actual
events is purely coincidental.
8. About TicketMeister.com
Online ticket sales company based in NY, with operations in many countries
Clients: event organizers - music festivals, sports stadiums, halls, theatres etc.
TicketMeister acts as an agent, selling tickets that the clients make available, and
charging a service fee.
Clients manage their available inventory via a Client Portal
Users (ticket buyers) purchase tickets through an e-commerce portal
To prevent ticket scalping: customers must have an account with a verified email
address
10. Why?
● Customers complain that the current
portal is too slow, especially during
big sales events
● We’ve had incidents of inadvertent
overselling (bad customer
experience; plus, refunds need to be
manually handled)
● We’re launching in EU, Brazil, China
and South Africa soon, and are
expecting 10x traffic after go-live
● Our existing data center (running
MySQL) is barely handling the
current load
15. Your Scope: only 2 API endpoints
1. Allow customers to look up current ticket prices & availability for an event
a. Must be fast, accurate and reliable
b. Must be globally available (i.e. you can look up US events from Australia)
2. Allow customers to purchase tickets for an event
a. Add the order to the Customer’s account in our Customer Information System (CIS)
b. Take payment
c. Send a confirmation email
d. Update inventory database
e. Notify the event organizer’s systems (3rd party)
f. Log the transaction
18. 1. Allow customers to look up current ticket prices & availability for an event
2. Allow customers to purchase ticket(s) for an event
Your Scope: only 2 API endpoints
22. We’d like to scale to multiple global Data Centers to handle the extra load
23. Define Success:
Priority # Goal Target Metric
1 Reliable ~99.999% (five 9’s) uptime
2 Fast <800ms response time
3 Accurate No overselling
24. Let’s review the requirements
Reliable
(upto five 9’s uptime)
Fast
(<500ms response time)
Accurate
(no overselling)
Global Scale
(multiple replicated data centers)
31. Cosmos DB provides 5 different consistency model choices
Banking systems, Payment apps, etc.
Product reviews, Social media “wall” posts, etc.
Baseball scores, Blog comments, etc.
Shopping carts, User profile updates, etc.
Flight status tracking, Package tracking, etc.
47. Write off the error in resource B, and
proceed as if normal
Good option when:
● B is non-critical (e.g. logging metrics)
● There are decent alternatives to B
(e.g. if customer can re-print their
confirmation email from a user portal)
Option 1: Ignore
48. Option 2: Retry
If resource B fails, retry a few # of times
Good option when:
● Retries are safe** on B
● Actions can be queued (i.e. time constraints
on “being done” are not strict)
49. Option 3: Undo
If resource B fails, perform an “undo”
(compensating action) on resource A
Good option when:
● An “undo” or compensating action exists
● There is no penalty for the undo
operation
50. Option 4: Coordinate
Coordinate the 2 actions between A
and B using a separate coordinator.
Prepare, coordinate, then commit (or
rollback on failure)
Good option when:
● A reliable coordinator is available
● Action can be broken into 2
“prepare” and “commit” phases
51. Ignore: write off the error in resource B
If something fails, what can we do?
Retry: if resource B fails, retry on B until it
succeeds
Coordinate: 2 phases between A and B: prepare,
coordinate, then commit (or rollback on failure)
Undo: if resource B fails, undo action on A
52. Decision Matrix: options for each point of failure
Best option available?
Step Order ? Ignore Retry Undo Coordinate
CIS
Payment
Events Ctr
Email
Logging
Inventory DB
53. General considerations
1. Business model constraints
a. Amazon.com: inventory >> demand, can process things offline later (asynchronously)
b. TicketMeister: limited time-bound inventory, must process everything right now
2. Alternative definitions of success
a. e.g. if emailing the receipt fails: can customer self-serve and print it from a user portal?
b. e.g. if calling the Events Center API fails - is there a batch job to “true-up” failures?
c. e.g. Logging may not be considered “critical” until you realize your Disaster Recovery system
depends on the logs being accurate and complete
3. Easier to undo? Call it first!
4. Research all 3rd party APIs: assume nothing!
59. Can’t we rely on the client to do the right thing?
● “Or else what?”
● “But what if my Wi-Fi goes down?”
● “Is there another way?”
● “How will I know it’s OK to bail?”
● “Should I call customer care?”
● “Should I just go on Twitter?”
61. One option: send an Idempotency Key
The client (front end, or another API) uses a unique identifier on its end for the
“transaction” - so that retries can be safely rejected
71. 3 Characteristics of a Distributed System
1. Operate concurrently
2. Can fail independently
3. Don’t share a global clock
72. 1. Reads >> Writes?
a. Read replication: 1 master, many replicated slaves
i. Broken: consistency (new: eventual consistency) - even with re
ii. Writes are still a bottleneck and will take over the master
b. Sharding (break up 1 write database into many based some key)
i. Each one is read-replicated
ii. Broken: data model, completely isolated instances (can’t join tables across shards)
c. Adding indexes?
i. As writes scale up, you’ll lose the benefits of this
ii. May end up denormalizing the relational database (BAD!)
2. Why not go NoSQL anyway?
Distributed Data: scaling