Many web businesses enjoy a spike in traffic at some point in the year. Whether it's Black Friday, the NFL draft day, or Mother’s Day, your app needs to be able to scale and capture customer value when it is most needed. Downtime is not an option.
For a database, that means having enough capacity to ensure transaction latency stays within acceptable limits. For high capacity apps using MySQL, this means you may need to deploy triple the normal capacity usage to sustain traffic for one day. But what do you do with that hardware for the rest of the year? Do you leave it idling? That unused capacity is costing you an arm and a leg, and wasted expenses make CFOs grumpy.
In Part 3 of our Tech Talk series, we discuss what the options are for scaling down MySQL, as well as explore answers to the following questions:
- How do I figure out the costs of not scaling down?
- How does ClustrixDB scale-down differently than MySQL?
- How real is elastically scaling in ClustrixDB? What are the catches?
View the webcast of this Tech Talk on our YouTube channel.
Tech Talk Series, Part 3: Why is your CFO right to demand you scale down MySQL?
1. Flexible transactional scale for the connected world.
Challenges to Scaling MySQL:
Scaling In and Down – The Costs
Dave A. Anselmi @AnselmiDave
Director of Product Management
2. Questions for Today
o Why and when is scaling down MySQL a good idea?
o What options are there to scale down MySQL?
o How do I figure out the costs of not scaling down?
o How does ClustrixDB scale-down differently than MySQL?
o How real is elastically scaling in ClustrixDB? What are the
catches?
PROPRIETARY & CONFIDENTIAL 2
4. The Typical Path to Scale…SCALE
(GROWTH/SUCCESS)
T I M E
LAMP Stack
AWS, Azure,
RAX, GCE, etc
Private Cloud
REACH LIMIT
App too slow;
Lost users
REACH LIMIT
(AGAIN)
App too slow;
Lost users
Migrate
to Bigger
Machine
• Read slaves, then
Sharding, etc:
• Add more hardware &
DBAs
• Refactor Code
/Hardwired App
More Expensive
Higher Risk
Lost Revenue
ONGOING:
• Refactoring
Hardware
• Data Balancing
• Shard
Maintenance
REPEAT
Migrate
to Bigger
Machine
PROPRIETARY & CONFIDENTIAL 4
7. Peak or Periodic Workloads Waste Resources
o Many workloads have some periodicity
o Maintaining capacity for peaks while undersubscribed
results in wasted resources
o This bugs the CFO… & affects DevOps budgets
PROPRIETARY & CONFIDENTIAL 7
8. Why Costs Should Matter to Tech People
o DevOps, DBAs, and data architects focus on product features and
technical feasibilities. “TCO ain’t our TLA”
o However, at some point the ‘business side’ of your company has to
authorize purchase of the actual system(s).
o Whether it’s licensing, support, or cloud solutions (AWS, etc), all of
them have a price, and all of them have to be ‘justified.’
o Knowing how to frame your implementation recommendations into
pros/cons-based ‘business cases’ greatly affects your resources
requests:
– Either with your team lead/department head
– or the guys and gals in finance
PROPRIETARY & CONFIDENTIAL 8
9. Peak or Periodic Workloads by Sector
o E-Commerce
– Black Friday/Cyber Monday, Single’s Day, ‘Back to School,’
flash sales, etc
– 80% of Revenue in 2 months
– Provisioning > 3x capacity for 2 months
o Gaming
– New game released, new update
– Need ability to quickly scale either out (game servers
oversubscribed) or in (less gamers than estimated)
– Cannibalization: New game causes migration of previous
subscribed base from old game(s)
PROPRIETARY & CONFIDENTIAL 9
10. Peak or Periodic Workloads by Sector
o Social Media
– Some events are periodic/predictable (e.g. Awards Season,
movie releases, Hallmark holidays, TV shows)
– Some events much less so (current events, ‘hot trends,’ politics,
social outrage, etc)
o Sports
– Playoffs, Super Bowl, March Madness, etc
– Provisioning for these requires quickly available additional
resources
– After the main event, sports app utilization can fall severely,
leaving server arrays overprovisioned
PROPRIETARY & CONFIDENTIAL 10
12. Scaling Down/In: The Work Required
Approach How Scale Up/Out Scale Down/In
Scale-Up Keep increasing the size
of the (single) database
server
• Console: Click for larger server, until
largest available
• EC2: Bring up larger (redundant)
server with backup, use replication to
catch up, then change application to
new DB endpoint
• Console: Click for smaller server. Works
well if max workload fits in DBaaS offering
• EC2: Bring up smaller (redundant) server
with backup, use replication to catch up,
then change application to new DB
endpoint
Console: DBaaS with shared storage. EC2: Instance, or bare metal
13. Scaling Down/In: The Work Required
Approach How Scale Up/Out Scale Down/In
Scale-Up Keep increasing the size
of the (single) database
server
• Console: Click for larger server, until
largest available
• EC2: Bring up larger (redundant)
server with backup, use replication to
catch up, then change application to
new DB endpoint
• Console: Click for smaller server. Works
well if max workload fits in DBaaS offering
• EC2: Bring up smaller (redundant) server
with backup, use replication to catch up,
then change application to new DB
endpoint
Read Slaves Add a ‘Slave’ read-
server(s) to ‘Master’
database server
• Console: Click to add ‘Read Replicas’
• EC2: Bring up redundant server(s)
with backup, & turn on replication
• Setup read/write fan-out in app or at
the proxy level
• Console: Click to remove ‘Read Replicas’
• EC2: Bring down read slave(s)
• Change read/write fan-out in app or at the
proxy level
Console: DBaaS with shared storage. EC2: Instance, or bare metal
14. Scaling Down/In: The Work Required
Approach How Scale Up/Out Scale Down/In
Scale-Up Keep increasing the size
of the (single) database
server
• Console: Click for larger server, until
largest available
• EC2: Bring up larger (redundant)
server with backup, use replication to
catch up, then change application to
new DB endpoint
• Console: Click for smaller server. Works
well if max workload fits in DBaaS offering
• EC2: Bring up smaller (redundant) server
with backup, use replication to catch up,
then change application to new DB
endpoint
Read Slaves Add a ‘Slave’ read-
server(s) to ‘Master’
database server
• Console: Click to add ‘Read Replicas’
• EC2: Bring up redundant server(s)
with backup, & turn on replication
• Setup read/write fan-out in app or at
the proxy level
• Console: Click to remove ‘Read Replicas’
• EC2: Bring down read slave(s)
• Change read/write fan-out in app or at the
proxy level
Master-Master Add additional
‘Master’(s) to ‘Master’
database server
• Console: No native support. Can
deploy larger instances, but must
setup master/master yourself
• EC2: Provision 2 new larger masters
via backup, use replication to catch
up, then change application to new
DB’s endpoints
• Console: No native support. Can deploy
smaller instances, but must setup
master/master yourself
• EC2: Provision 2 new smaller masters via
backup, use replication to catch up, then
change application to new DB’s endpoints
Console: DBaaS with shared storage. EC2: Instance, or bare metal
15. Scaling Down/In: The Work Required
Approach How Scale Up/Out Scale Down/In
Vertical
Sharding
Separating tables across
separate database
servers
• Console: No native support. Can
deploy additional instances, but must
setup table distribution yourself
• EC2: Provision additional instances via
backup, (manually) re-distribute tables
across shards, then change
application to include new shards
• Console: No native support. Can
deprovision instances, but must
consolidate tables yourself
• EC2: Consolidate tables from redundant
shards, deprovision redundant shards, and
change application/table mapping to match
new data distribution
Console: DBaaS with shared storage. EC2: Instance, or bare metal
16. Scaling Down/In: The Work Required
Approach How Scale Up/Out Scale Down/In
Vertical
Sharding
Separating tables across
separate database
servers
• Console: No native support. Can
deploy additional instances, but must
setup table distribution yourself
• EC2: Provision additional instances via
backup, (manually) re-distribute tables
across shards, then change
application to include new shards
• Console: No native support. Can
deprovision instances, but must
consolidate tables yourself
• EC2: Consolidate tables from redundant
shards, deprovision redundant shards, and
change application/table mapping to match
new data distribution
Horizontal
Sharding
Partitioning tables across
separate database
servers
• Console: No native support. Can
deploy additional instances, but must
setup partition distribution yourself
• EC2: Provision additional instances via
backup, (manually) re-distribute
partitions across shards, then change
application to include new shards
• Console: No native support. Can deploy
smaller instances, but must consolidate
partitions yourself
• EC2: Consolidate tables from redundant
shards, deprovision redundant shards, and
change application/table mapping to match
new data distribution
Console: DBaaS with shared storage. EC2: Instance, or bare metal
18. o Idle Server/Overcapacity cost
– CAPEX budget wasted on unused resources
– OPEX budget probably OK: idle servers need less DevOps
o Low Overall Impact to DevOps Infra:
– “Everything’s Working” / “Not broken; Don’t Fix it”
– Low CAPEX budget means low budgets for replacements; so instead
cannibalize underutilized infra. “No Problem”
PROPRIETARY & CONFIDENTIAL 18
DevOps Impact #1 from Overprovisioning
19. 1. 1-way scaling to handle peaks =>> Idle resources at non-peak,
often most of the time
2. Idle resources =>> Blown/Shrunk DevOps Budgets
1. Both CAPEX and OPEX
2. Finance team pays attention!
3. Blown/Shrunk DevOps Budgets =>> Hard to get Approval for
further capacity
4. No Budget =>> Can’t Scale for Growing Peaks
5. Higher risk of site slowdowns or outages at next peak(s)
PROPRIETARY & CONFIDENTIAL 19
DevOps Impact #2 from Overprovisioning
20. Black Friday/Cyber Monday Outage Highlights
o 2011: PC Mall, Newegg, Toys R’Us, Avon: 30+min outages. Walmart:
3hr outage
o 2012: Kohl’s: repeated multi-hour outages
o 2013: Urban Outfitters, Motorola: offline most of Cyber Monday
o 2014: Best Buy: 2hrs+ total outages. HP, Nike: site crashes
o 2015: Neiman Marcus: 4hr+ outage
o 2016: Old Navy, Macy’s: multi-hour outages
2016 Black Friday/Cyber Monday
Total Online Sales: $5.27B, 21.6% increase over 2015
PROPRIETARY & CONFIDENTIAL 20
21. Even Larger Business Impact of Outages
o Opportunity cost
– Each missed visitor was potentially a customer or referral
o Single Sale cost
– Each missed sale is a tangible missed $-value
o Customer Lifetime cost
– Unhappy customers who find sites they like better, won’t return
o Market/Brand cost
– All customers use social media: communication ‘force multiplier’
– “If you make customers unhappy in the physical world, they might each tell six
friends. If you make customers unhappy on the internet, they can each tell 6,000”. –
Jeff Bezos
– W. Edwards Deming said “5” and “20”…
– Call it “Customer Satisfaction at Web-Scale”
PROPRIETARY & CONFIDENTIAL 21
23. ClustrixDB:
PROPRIETARY & CONFIDENTIAL 23
ClustrixDB
ACID Compliant
Transactions & Joins
Optimized for OLTP
Built-In Fault Tolerance
Flex-Up and Flex-Down
Minimal DB Admin
• Write + Read Linear Scale-Out
• Click to Elastically Add/Remove Servers
• MySQL-Compatible
24. PROPRIETARY & CONFIDENTIAL 24
Adding + Removing Nodes: Scaling Out + In
o Easy and simple Flex Up (or Flex Down)
– Single minimal ‘database pause’
o All servers handle writes and reads
– Workload is spread across more servers
after Flex Up
o Data is automatically rebalanced across
the cluster
– Tables are online for reads and writes
– MVCC for lockless reads while writing
S1
S2
S3
S3
S4
S4
S5
S1
ClustrixDB
S2
S5
25. Review: Questions for Today
o Why and When is scaling down MySQL a good idea?
– Periodic workloads, Flash Sales, new Releases, etc
o What options are there to scale down MySQL?
– Single Node: Shrink single node
– Master/Slave: Remove read slaves, shrink master
– Master/Master: Drop and/or shrink a master
– Sharding: Drop and combine shards
o How do I figure out the costs of not scaling down?
– Cost 1: Undersubscribed resources
– Cost 2: Budget impact to ability to scale for peaks
PROPRIETARY & CONFIDENTIAL 25
26. Review: Questions for Today
o How does ClustrixDB scale-down differently than MySQL?
– Shared-nothing scale-out RDBMS clustered database
– Simply add or drop nodes to scale-out or scale-in
o How real is elastically scaling in ClustrixDB? What are the
catches?
– Add nodes via IP. Add IP to Load Balancer. No app changes.
– Remove nodes via IP. Remove IP from Load Balancer. No app
changes.
– Minor ‘database pause’ for multi-node ‘group change’
PROPRIETARY & CONFIDENTIAL 26