9. over 40% increase from last year in QPS (25K last year)
additional 30K moving over from postgres
1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
10. 4TBInnoDBbufferpool
over 40% increase from last year in QPS (25K last year)
additional 30K moving over from postgres
1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
17. DATA
sync prod to dev, until prod data gets too big
http://www.flickr.com/photos/uwwresnet/6280880034/sizes/l/in/
photostream/
18. SomeApproaches
subsets have to end somewhere (a shop has favorites that are connected
to people, connected to shops, etc)
generated data can be time consuming to fake
19. SomeApproaches
subsetsofdata
subsets have to end somewhere (a shop has favorites that are connected
to people, connected to shops, etc)
generated data can be time consuming to fake
22. EdgeCases
what about testing edge cases, difficult to diagnose bugs?
hard to model the same data set that produced a user facing bug
http://www.flickr.com/photos/sovietuk/141381675/sizes/l/in/
photostream/
23. Perspective
another issue is testing problems at scale, complex and large gobs of
data
real social network ecosystem can be difficult to generate (favorites,
follows)
(activity feed, “similar items” search gives better results)
http://www.flickr.com/photos/donsolo/2136923757/sizes/l/in/
photostream/
24. Prod Dev?
what most people do before data gets too big,
almost 2 days to sync 20Tb over 1Gbps link, 5 hrs over 10Gbps
bringing prod dataset to dev was expensive hardware/maint,
keeping parity with prod, and applying schema changes would take at least
as long
25. UseProduction
so we did what we saw as the last resort - used production
not for greenfield development, more for mature features and diagnosing bugs
we still have a dev database but the data is sparse and unreliable
26. UseProduction
(sometimes)
so we did what we saw as the last resort - used production
not for greenfield development, more for mature features and diagnosing bugs
we still have a dev database but the data is sparse and unreliable
27. goes without saying this can be dangerous
also difficult if done right, we’ve been working on this for a year
http://www.flickr.com/photos/stuckincustoms/432361985/sizes/l/in/
photostream/
53. dangerous/unnecessaryqueries
(DEV) etsy_rw@jgoulah [test]>
select * from fred_test;
ERROR 9001 (E9001): Selects from
tables must have where clauses
-- filter dangerous queries - (queries without a WHERE)
-- remove unnecessary queries - (instead of DELETE, have a flag, ALTER
statements don’t run from dev)
54. knownin/egressfunnel
we know where all of the queries from dev originate from
http://www.flickr.com/photos/medevac71/4875526920/sizes/l/in/
photostream/
60. stealthdata
hiding data from users
(favorites go on dev and prod shard, making sure test user/shops don’t
show up)
http://www.flickr.com/photos/davidyuweb/8063097077/sizes/h/in/
photostream/
84. DelayedSlaves
pt-slave-delay watches a slave and starts and stops its replication SQL thread as
necessary to hold it
http://www.flickr.com/photos/xploded/141295823/sizes/o/in/
photostream/
85. DelayedSlaves
role of the delayed slave
also source of BCP
(business continuity planning - prevention and recovery of threats)
99. > SHOW SLAVE STATUS
Relay_Log_File: dbslave-relay.007178
Relay_Log_Pos: 8666654
ondelayedslave
get the relay position
100. mysql> show relaylog events in "dbslave-relay.007178"
from 8666654 limit 1G
*************************** 1. row *******************
Log_name: dbslave-relay.007178
Pos: 8666654
Event_type: Query
Server_id: 1016572
End_log_pos: 8666565
Info: use `etsy_shard`; /*
[CVmkWxhD7gsatX8hLbkDoHk29iKo] [etsy_shard_001_B] [/
your/activity/index.php] */ UPDATE `news_feed_stats`
SET `time_last_viewed` = 1366406780, `update_time` =
1366406780 WHERE `owner_id` = 30793071 AND
`owner_type_id` = 2 AND `feed_type` = 'owner'
2 rows in set (0.00 sec)
ondelayedslave
show relaylog events will show statements from relay log
pass relay log and position to start
101. filterbadqueries
cycle through all the logs, analyze Query events
rotate events - next log file
last relay log points to binlog master
(server_id is masters, binlog coord matches master_log_file/pos)
http://www.flickr.com/photos/chriswaits/6607823843/sizes/l/in/
photostream/
110. testing“dry”writes
testing how application runs a “dry” write --
r/o mode, exception is thrown with the exact query it would have attempted to run,
the values it tried to use, etc.
111. searchadscampaign
consistency
starting campaigns and maintaining consistency for entire ad system is nearly impossible in
dev
Search ads data is stored in more than a dozen DB tables and state changes are driven by a
combination of browsers triggering ads,
sellers managing their campaigns, and a slew of crons running anywhere from once per 5 minutes
to once a month
eg) to test pausing campaigns that run out of money mid-day,
can pull large numbers of campaigns from prod and operate on those to verify that the data will
still be consistent
112. googleproductlistingads
GPLA is where we syndicate our listings to google to be used in google product search ads
we can test edge cases in GPLA syndication where it would be difficult to recreate the
state in dev
113. testingprototypes
features like similar items search gives better results in production because of the
amount of data,
allowed us to test the quality of listings a prototype was displaying
114. performancetesting
need a real data set to test pages like treasury search with lots of threads/avatars/etc
the dev data is too sparse, xhprof traces don’t mean anything, missing avatars change
perf characteristics
115. hadoopgenerated
datasets
dataset produced from hadoop (recommendations for users, or statistics
about usage)
but since hadoop is prod data its for prod users/listings/shops, so have to
check against prod
--- sync to dev would fill dev dbs and data wouldn’t line up (b/c prod data)