Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

MongoDB at Sailthru: Scaling and Schema Design

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Prochain SlideShare
PyCon 2011 Scaling Disqus
PyCon 2011 Scaling Disqus
Chargement dans…3
×

Consultez-les par la suite

1 sur 43 Publicité

MongoDB at Sailthru: Scaling and Schema Design

Télécharger pour lire hors ligne

Sailthru provides all your website email delivery needs, ensuring Inbox delivery for transactional and mass mail. Sailthru started out as a MySQL-powered transactional-mail service. Starting in 2009, we migrated to the document-oriented "nosql" database MongoDB. Moving entirely to MongoDB has allowed us to build complex user profiles to power behavioral-targeted mass emails and onsite recommendations. How and why we made the move, and how we use MongoDB today.

Sailthru provides all your website email delivery needs, ensuring Inbox delivery for transactional and mass mail. Sailthru started out as a MySQL-powered transactional-mail service. Starting in 2009, we migrated to the document-oriented "nosql" database MongoDB. Moving entirely to MongoDB has allowed us to build complex user profiles to power behavioral-targeted mass emails and onsite recommendations. How and why we made the move, and how we use MongoDB today.

Publicité
Publicité

Plus De Contenu Connexe

Similaire à MongoDB at Sailthru: Scaling and Schema Design (20)

Plus par DATAVERSITY (20)

Publicité

Plus récents (20)

MongoDB at Sailthru: Scaling and Schema Design

  1. 1. MongoDB at Sailthru Scaling and Schema Design Ian White @eonwhite NoSQL Now! 8/25/11 Sunday, August 7, 2011
  2. 2. Sailthru • API-based transactional email led to... • Mass campaign email led to... • Intelligence and user behavior • Three engineers built the ESP we always wanted to use • Some Clients: Huffpo-AOL, Thrillist, Refinery 29, Flavorpill, Business Insider, Fab, Totsy, New York Observer Sunday, August 7, 2011
  3. 3. How We Got To MongoDB from SQL • JSON was part of Sailthru infrastructure from start (SQL columns and S3) • Kept a close eye on CouchDB project • MongoDB felt like natural fit • Used for user profiles and analytics initially • Migrated one table at a time (very, very carefully) Sunday, August 7, 2011
  4. 4. Sailthru Architecture • User interface to display stats, build campaigns and templates, etc (PHP/EC2) • API, link rewriting, and onsite endpoints (PHP/EC2) • Core mailer engine (Java/EC2 and colo) • Modified-postfix SMTP servers (colo) • 11 database servers on EC2 (for now) Sunday, August 7, 2011
  5. 5. MongoDB Overview • 13 instances on EC2 (6 two-member replica sets, 1 backup server) • About 40 collections • About 1TB • Largest single collection is 500m docs Sunday, August 7, 2011
  6. 6. Users are Documents • Users aren’t records split among multiple tables • End user’s lists, clickstream interests, geolocation, browser, time of day, purchase history becomes one ever-growing document Sunday, August 7, 2011
  7. 7. Profiles Accessible Everywhere • Put abandoned shopping cart notifications within a mass email {if profile.purchase_incomplete} <p>This is what’s in your cart:</p> {foreach profile.purchase_incomplete.items as item} {item.qty} <a href=”{item.url}”>{item.title}</a><br/> {/foreach} {/if} Sunday, August 7, 2011
  8. 8. Profiles Accessible Everywhere • Show a section of content conditional on the user’s location {if profile.geo.city[‘New York, NY US’]} <div>Come to the New York Meetup on the 27th!</div> {/if} Sunday, August 7, 2011
  9. 9. Profiles Accessible Everywhere • Show different content depending on user interests as measured by on-site behavior {select} {case horizon_interest('black,dark')} <img src="http://example.com/dress-image-black.jpg" /> {/case} {case horizon_interest('green')} <img src="http://example.com/dress-image-green.jpg" /> {/case} {case horizon_interest('purple,polka_dot,pattern')} <img src="http://example.com/dress-image-polkadot.jpg" /> {/case} {/select} Sunday, August 7, 2011
  10. 10. Profiles Accessible Everywhere • Pick top content from a data feed based on tags {content = horizon_select(content,10)} {foreach content as c} <a href=”{c.url}”>{c.title}</a><br/> {/foreach} Sunday, August 7, 2011
  11. 11. Other Advantages of MongoDB • High performance • Take any parameters from our clients • Really flexible development • Great for analytics (internal and external) • No more downtime for schema migrations or reindexing Sunday, August 7, 2011
  12. 12. How We Run mongod • mongod --dbpath /path/to/db --logpath /path/to/log/ mongodb.log --logappend --fork --rest --replSet main1 --journal • Don’t ever run without replication • Don’t ever kill -9 • Don’t run without writing to a log • Run behind a firewall • Use journaling now that it’s there • Use --rest, it’s handy Sunday, August 7, 2011
  13. 13. Separate DBs By Collections • Lower-effort than auto-sharding • Separate databases for different usage patterns • Consider consequences of database failure/ unavailability • But make sure your backup and monitoring strategy is prepared for multiple DBs Sunday, August 7, 2011
  14. 14. Our Five Replica Sets • main: most of the stuff on the UI, lots of small/medium collections • horizon: realtime onsite browsing data • profile: user profile data (60m user docs) • message: last three months of emails • archive: emails older than three months Sunday, August 7, 2011
  15. 15. Monitoring • Some stuff to monitor: faults/sec, index misses, % locked, queue size, load average • we check basic status once/minute on all database servers (SMS alerts if down), email warnings on thresholds every 10 minutes • have been beta-ing 10gen’s MMS product Sunday, August 7, 2011
  16. 16. Backups • Used to use mongodump - don’t do that anymore • Have single node of each replica set on a backup server • Two-hour slave delay • fsync/lock, freeze xfs file system, EBS snapshot, unfreeze, unlock Sunday, August 7, 2011
  17. 17. The Great EC2 EBS Outage Adventure • We survived • Most of our nodes unavailable for 2-4 days • Were able to spin up new instances from backup server, snapshots, and get operational within hours • Wasn’t fun Sunday, August 7, 2011
  18. 18. DESIGN Sunday, August 7, 2011
  19. 19. Develop Your Mental Model of MongoDB • You don’t need to look at the internals • But try to gain a working understanding of how MongoDB operates, especially RAM and indexes Sunday, August 7, 2011
  20. 20. Big-Picture Design Questions • What is the data I want to store? • How will I want to use that data later? • How big will the data get? • If the answers are “I don’t know yet”, guess with your best YAGNI Sunday, August 7, 2011
  21. 21. “But premature optimization is evil” • Knuth said that about code, which is flexible and easy to optimize later • Data is not as flexible as code • So doing some planning for performance is usually good when it comes to your data Sunday, August 7, 2011
  22. 22. Specific MongoDB Design Questions • Embed vs top-level collection? • Denormalize (double-store data)? • How many/which indexes? • Arrays vs hashes for embedding? • Implicit schema (field names and types) Sunday, August 7, 2011
  23. 23. Short Field Names? • Disk space: cheap • RAM: not cheap • Developer Time: expensive • Err towards compact, readable fieldnames • Might be worth writing a mapper • Probably wish we’d used c instead of client_id Sunday, August 7, 2011
  24. 24. Favor Human-Readable Foreign Keys • DBRefs are a bit cumbersome • Referencing by MongoId often means doing extra lookups • Build human-readable references to save you doing lookups and manual joins Sunday, August 7, 2011
  25. 25. Example • Store the Template and the Email as strings on the message object • { template: “Internal - Blast Notify”, email: “support-alerts@sailthru.com” } • No external reference lookups required • The tradeoff is basically just disk space Sunday, August 7, 2011
  26. 26. Embed vs Top-Level Collections? • Major question of MongoDB schema design • If you can ask the question at all, you might want to err on the side of embedding • Don’t embed if the embedding could get huge • Don’t feel too bad about denormalizing by embedding AND storing in a top-level collection Sunday, August 7, 2011
  27. 27. Typical Properties of Top-Level Collections • Independence: They don’t “belong” conceptually to another collection • Nouns: the building blocks of your system • Easily referenceable and updatable Sunday, August 7, 2011
  28. 28. Embedding Pros • Super-fast retrieval of document with related data • Atomic updates • “Ownership” of embedded document is obvious • Usually maps well to code structures Sunday, August 7, 2011
  29. 29. Embedding Cons • Harder to get at, do mass queries • Does not size up infinitely, will hit 16MB limit • Hard to create references to embedded object • Limited ability to indexed-sort the embedded objects Sunday, August 7, 2011
  30. 30. If You Think You Can Embed • You probably should • I take advantage of embedding in my designs more often now than I did three years ago • It’s a gift MongoDB gives you in exchange for giving up your joins Sunday, August 7, 2011
  31. 31. Design Example: User Permissions • Users can have various broad permission levels for any number of clients • For example, user ‘ploki’ might have permission level ‘admin’ for client 76 and permission level ‘reports_only’ for client 450 Sunday, August 7, 2011
  32. 32. How Will We Use This Data? • Retrieve all clients for a given user • Retrieve all users for a given client • Retrieve a permission level for a given client for a given user Sunday, August 7, 2011
  33. 33. How Will This Data Grow? • In the medium term, it will stay small • Number of clients and number of users can both grow infinitely Sunday, August 7, 2011
  34. 34. Back in SQL-land • There’s a fairly standard way to do it • It’s a many-many relationship, so • Use a join table (client_user) Sunday, August 7, 2011
  35. 35. Should We Use a New Top-Level Collection? db.client.user.save( { client_id: 76, username: ‘ploki’, permission: ‘admin’, }); db.client.user.save( { client_id: 450, username: ‘ploki’, permission: ‘reports_only’, }); db.client.user.ensureIndex( { client_id: 1 } ); db.client.user.ensureIndex( { username: 1 } ); // get all users belonging to a client db.client.user.find( { client_id: 76 } ); // get all clients a user has access to db.client.user.find( { username: ‘ibwhite’ } ); // get permissions for our current user db.client.user.findOne( { username: user.name } ); Sunday, August 7, 2011
  36. 36. Probably Not • Only needed if we have lots of clients per user AND lots of users per client • This is a case where we can embed, so let’s do so Sunday, August 7, 2011
  37. 37. Three Ways to Embed ‘clients’: { ‘76’: ‘admin’, Not good: Object ‘450’: ‘reports_only’, can’t do a multikeys index }, on the keys of a hash index:??? Okay: Array ‘clients’: [ {‘_id’: 76, ‘access’: ‘admin’}, but have to search through array of objects }, {‘_id’: 450, ‘access’: ‘reports_only’} to find by _id index: { ‘clients._id’: 1 } on retrieved doc ‘clients’: [ 76, 450 ], Our approach: Array ‘clients_access’: { ’76’: ‘admin’, Fields next to each other alphabetically and object ‘450’: ‘reports_only’, } index: { clients: 1 } Sunday, August 7, 2011
  38. 38. Indexes • Index all highly frequent queries • Do less-indexed queries only on secondaries • Reduce the size of indexes whereever you can on big collections • Don’t sweat the medium-sized collections, focus on the big wins Sunday, August 7, 2011
  39. 39. Take Advantage of Multiple-Field Indexes • Order matters • If you have an index on {client_id: 1, email: 1 } • Then you also have the {client_id: 1} index “for free” • but not { email: 1} Sunday, August 7, 2011
  40. 40. Use your _id • You must use an _id for every collection, which will cost you index size • So do something useful with _id Sunday, August 7, 2011
  41. 41. Take advantage of fast ^indexes • Messages have _ids like: 32423.00000341 • Need all messages in blast 32423: • db.message.blast.find( { _id: /^32423./ } ); • (Yeah, I know the . is ugly. Don’t use a dot if you do this.) Sunday, August 7, 2011
  42. 42. Manual Range Partioning • We moved a big message.blast collection into per-day collections: • message.blast.20110605 message.blast.20110606 message.blast.20110607 etc... • Keeps working set indexes smaller • When we move data into the archive, drop() is much faster than remove() Sunday, August 7, 2011
  43. 43. Questions? Looking for a job? ian@sailthru.com twitter.com/eonwhite Sunday, August 7, 2011

×