SlideShare une entreprise Scribd logo
1  sur  19
Lessons Learned from Migrating 2+ Billion Documents at Craigslist Jeremy Zawodny jzawodn@craigslist.org Jeremy@Zawodny.com http://blog.zawodny.com/
Outline Recap last year’s MongoSV Talk The Archive, Why MongoDB, etc. http://www.10gen.com/video/mongosv2010/craigslist The Infrastructure The Lessons Wishlist Q&A
Craigslist Numbers 2 data centers ~500 servers ~100 MySQL servers ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day
Archive: Where Data Goes To Die Live Numbers ~1.75M posts/day ~14 day avg. lifetime ~60 day retention ~100M  posts We keep all postings Users reuse postings Daily archive migration Internal query tools
Archive Pain Coupled Schemas Big Indexes Hardware Failures Replication Lag Poor Search Human Time Costs
MongoDB Wins Scalable Fast Friendly Proven Pragmatic Approachable
MongoDB Details Plan for 5 billion documents Average size: 2KB 3 Replica sets, 3 Servers each Deploy to 2 datacenters Same deployment in each datacenter Posting ID is sharding key
MongoDB Architecture Typical Sharding with Replica Sets (external sphinx full-text indexers not pictured) config client client client client config config mongos mongos mongos shard001 shard003 shard002 replica set replica set replica set
Lesson: Know Your Hardware MongoDB on blades really sucks Single 10k RPM disks can’t take it when data is noticeably larger than RAM Mongo operations can hit the client timeout (30 sec default) Even minutely cron jobs start to spew Lots of time wasted in development environment, trying different kernels, tuning, etc. Most noticeable during heavy writes but can happen if pages fall out of RAM for other reasons
Lesson: Replica Sets Rock Lots of reboots happened during dev environment troubleshooting Each time, one of the remaining nodes took over No “reclone” no config file or DNS changes Stuff “just worked” while nodes bounced up and down
Lesson: Know Your Data MongoDB is UTF-8 Some of our older data is decidedly NOT UTF-8 We have lots of sloppy encoding issues to clean up.  But we had to clean them all up. Start data load.  Wait 12-36 hours.  Witness fail.  Fix code.  Start over.  Sigh. This is a combination of having been sloppy and having old data.  Even with a lot less history, this can bite you.  Get your encoding house in order!
Lesson: Know Your Data Size MongoDB has a doc size limits 4MB in 1.6.x, 16MB in 1.8.x What to do with outliers? In our case, trim off some useless data. But going from relational to document means this sort of problem is easy to have.  One parent, many children. It’d be nice if this was easier to change, but clients have it hard-coded too. Compression would help, of course.
Lesson: Know Your Data Types Field Types and Conversions can be expensive to do after the fact! MongoDB treats strings and numbers differently, but some programming languages (such as Perl) don’t make that distinction obvious This has indexing implications when you later look for 123456789 but had unknowingly stored “123456789” http://search.cpan.org/dist/MongoDB/lib/MongoDB/DataTypes.pod
Data Types, continued “If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending.” Do you know how to do that in your language of choice? Some drivers may make a “guess” that gets it right most of the time.
Lesson: Know SomeSharding The Balancer can be your frenemy Initial insert rate: 8,000/sec Later drops to 200/sec Too much time spent waiting to page in data that’s going to be sent to another node and never looked at (locally) again Pre-split your data if possible http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/
Lesson: Know Some Replica Sets Replica Set re-sync requires index rebuilds on the secondary Most painful when a slave is down too long and can’t catch up using the oplog Typically during high write volumes In a large data set, the index rebuilding can take a couple of days w/out many indexes What if you lose another while that is happening?
MongoDBWishlist Replica set node re-sync without out index rebuilding Record (or field) compression (not everyone uses a filesystem that offers compression) Method to tap into the oplog so that changes can be fed to external indexers (Sphinx, Redis, etc.) Hash-based sharding (coming soon?) Cluster snapshot/backup tool
craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Front-end Engineering HTML, CSS, JavaScript, jQuery (Mobile too) Network Administration Routers, switches, load balancers, etc. Back-end Engineering Linux, Apache, Perl, MySQL, MongoDB, Redis, Gearman, etc. Systems Administration Help keep all those systems running.
craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Laid back, non-corporateenvironment Engineering driven culture Lots of interesting technical challenges Easy SF commute Excellent benefits and pay High-impact work Millions use craigslist daily

Contenu connexe

En vedette

Managing Big Data with MySQL
Managing Big Data with MySQLManaging Big Data with MySQL
Managing Big Data with MySQL
mwasaha mwagambo
 

En vedette (20)

MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Migrating from MySQL to MongoDB at Wordnik
Migrating from MySQL to MongoDB at WordnikMigrating from MySQL to MongoDB at Wordnik
Migrating from MySQL to MongoDB at Wordnik
 
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
 
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQLWebinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
 
Redis and Groovy and Grails - gr8conf 2011
Redis and Groovy and Grails - gr8conf 2011Redis and Groovy and Grails - gr8conf 2011
Redis and Groovy and Grails - gr8conf 2011
 
Tayra
TayraTayra
Tayra
 
Fusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistFusion-io and MySQL at Craigslist
Fusion-io and MySQL at Craigslist
 
SphinxSearch
SphinxSearchSphinxSearch
SphinxSearch
 
MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016
 
Production deployment
Production deploymentProduction deployment
Production deployment
 
Managing Big Data with MySQL
Managing Big Data with MySQLManaging Big Data with MySQL
Managing Big Data with MySQL
 
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
 
Migrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMigrating to MongoDB: Best Practices
Migrating to MongoDB: Best Practices
 
Social Media Trends - Content Curation
Social Media Trends - Content CurationSocial Media Trends - Content Curation
Social Media Trends - Content Curation
 
Sphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQLSphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQL
 
Benchmark slideshow
Benchmark slideshowBenchmark slideshow
Benchmark slideshow
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
 
Why Your MongoDB Needs Redis
Why Your MongoDB Needs RedisWhy Your MongoDB Needs Redis
Why Your MongoDB Needs Redis
 

Similaire à Lessons Learned Migrating 2+ Billion Documents at Craigslist

MongoDB Knowledge Shareing
MongoDB Knowledge ShareingMongoDB Knowledge Shareing
MongoDB Knowledge Shareing
Philip Zhong
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb Cluster
Chris Henry
 
From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)
MongoSF
 

Similaire à Lessons Learned Migrating 2+ Billion Documents at Craigslist (20)

MongoDB Knowledge Shareing
MongoDB Knowledge ShareingMongoDB Knowledge Shareing
MongoDB Knowledge Shareing
 
MongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewMongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of view
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Cons
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relational
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
Look Ma! No more blobs
Look Ma! No more blobsLook Ma! No more blobs
Look Ma! No more blobs
 
Mongo db transcript
Mongo db transcriptMongo db transcript
Mongo db transcript
 
Open source Technology
Open source TechnologyOpen source Technology
Open source Technology
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Scaling with mongo db (with notes)
Scaling with mongo db (with notes)Scaling with mongo db (with notes)
Scaling with mongo db (with notes)
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb Cluster
 
MongoDB 2.4 and spring data
MongoDB 2.4 and spring dataMongoDB 2.4 and spring data
MongoDB 2.4 and spring data
 
Silicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDBSilicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDB
 
MongoDB
MongoDBMongoDB
MongoDB
 
how_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptxhow_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptx
 
Mdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_searchMdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_search
 
disertation
disertationdisertation
disertation
 
From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)
 
Whynosql
WhynosqlWhynosql
Whynosql
 
how_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdfhow_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdf
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Lessons Learned Migrating 2+ Billion Documents at Craigslist

  • 1. Lessons Learned from Migrating 2+ Billion Documents at Craigslist Jeremy Zawodny jzawodn@craigslist.org Jeremy@Zawodny.com http://blog.zawodny.com/
  • 2. Outline Recap last year’s MongoSV Talk The Archive, Why MongoDB, etc. http://www.10gen.com/video/mongosv2010/craigslist The Infrastructure The Lessons Wishlist Q&A
  • 3. Craigslist Numbers 2 data centers ~500 servers ~100 MySQL servers ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day
  • 4. Archive: Where Data Goes To Die Live Numbers ~1.75M posts/day ~14 day avg. lifetime ~60 day retention ~100M posts We keep all postings Users reuse postings Daily archive migration Internal query tools
  • 5. Archive Pain Coupled Schemas Big Indexes Hardware Failures Replication Lag Poor Search Human Time Costs
  • 6. MongoDB Wins Scalable Fast Friendly Proven Pragmatic Approachable
  • 7. MongoDB Details Plan for 5 billion documents Average size: 2KB 3 Replica sets, 3 Servers each Deploy to 2 datacenters Same deployment in each datacenter Posting ID is sharding key
  • 8. MongoDB Architecture Typical Sharding with Replica Sets (external sphinx full-text indexers not pictured) config client client client client config config mongos mongos mongos shard001 shard003 shard002 replica set replica set replica set
  • 9. Lesson: Know Your Hardware MongoDB on blades really sucks Single 10k RPM disks can’t take it when data is noticeably larger than RAM Mongo operations can hit the client timeout (30 sec default) Even minutely cron jobs start to spew Lots of time wasted in development environment, trying different kernels, tuning, etc. Most noticeable during heavy writes but can happen if pages fall out of RAM for other reasons
  • 10. Lesson: Replica Sets Rock Lots of reboots happened during dev environment troubleshooting Each time, one of the remaining nodes took over No “reclone” no config file or DNS changes Stuff “just worked” while nodes bounced up and down
  • 11. Lesson: Know Your Data MongoDB is UTF-8 Some of our older data is decidedly NOT UTF-8 We have lots of sloppy encoding issues to clean up. But we had to clean them all up. Start data load. Wait 12-36 hours. Witness fail. Fix code. Start over. Sigh. This is a combination of having been sloppy and having old data. Even with a lot less history, this can bite you. Get your encoding house in order!
  • 12. Lesson: Know Your Data Size MongoDB has a doc size limits 4MB in 1.6.x, 16MB in 1.8.x What to do with outliers? In our case, trim off some useless data. But going from relational to document means this sort of problem is easy to have. One parent, many children. It’d be nice if this was easier to change, but clients have it hard-coded too. Compression would help, of course.
  • 13. Lesson: Know Your Data Types Field Types and Conversions can be expensive to do after the fact! MongoDB treats strings and numbers differently, but some programming languages (such as Perl) don’t make that distinction obvious This has indexing implications when you later look for 123456789 but had unknowingly stored “123456789” http://search.cpan.org/dist/MongoDB/lib/MongoDB/DataTypes.pod
  • 14. Data Types, continued “If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending.” Do you know how to do that in your language of choice? Some drivers may make a “guess” that gets it right most of the time.
  • 15. Lesson: Know SomeSharding The Balancer can be your frenemy Initial insert rate: 8,000/sec Later drops to 200/sec Too much time spent waiting to page in data that’s going to be sent to another node and never looked at (locally) again Pre-split your data if possible http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/
  • 16. Lesson: Know Some Replica Sets Replica Set re-sync requires index rebuilds on the secondary Most painful when a slave is down too long and can’t catch up using the oplog Typically during high write volumes In a large data set, the index rebuilding can take a couple of days w/out many indexes What if you lose another while that is happening?
  • 17. MongoDBWishlist Replica set node re-sync without out index rebuilding Record (or field) compression (not everyone uses a filesystem that offers compression) Method to tap into the oplog so that changes can be fed to external indexers (Sphinx, Redis, etc.) Hash-based sharding (coming soon?) Cluster snapshot/backup tool
  • 18. craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Front-end Engineering HTML, CSS, JavaScript, jQuery (Mobile too) Network Administration Routers, switches, load balancers, etc. Back-end Engineering Linux, Apache, Perl, MySQL, MongoDB, Redis, Gearman, etc. Systems Administration Help keep all those systems running.
  • 19. craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Laid back, non-corporateenvironment Engineering driven culture Lots of interesting technical challenges Easy SF commute Excellent benefits and pay High-impact work Millions use craigslist daily