Docstoc.com (founded in 2007, acquired by Intuit in 2013) is one of the largest online repositories of documents. A critical component of our product is our text file service, which delivers text documents to both humans and crawlers. In early 2013 this service, which was file system based, became a prohibitive bottleneck. To meet our scaling needs, we replaced it with one backed by a sharded MongoDB cluster. This talk will cover:
Our traffic load (5:1 bots:humans ratio) How we implemented the system in our SOA environment How MongoDB fit our use case out of the box How we load tested peak time traffic before hardware purchase How we loaded the system and how we rolled it out live Performance metrics and gains in stability and reliability
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Scalable Text File Service with MongoDB (Intuit)
1. Scalable File System
In 14 Days
Jeff Hoffer, Software Architect
Alex Zherdev, Sr. Software Engineer
2. Our Background
In the beginning...
“YouTube” for Documents
Today
“Make every small business better”
Professional Documents
Custom Documents
Business Licenses
Jason Nazar
Alon Shwartz
The Team
4. Initial Approach
Pros:
• Existing libraries used
• Reliable storage
• Replication
Cons:
• Hard to scale out
• Replication can’t keep up
• Taxed all data
SELECT `text_data` FROM `documents` WHERE `doc_id` = 8675309;
5. IIS HTTP Based Solution
Pros:
• HTTP GET
• IIS Static Content Cache
• 5TB = Years of Growth
• Easy Setup & Deploy
Cons:
• Not scalable
• NTFS & 30M small files
• Replication In-House
HTTP GET http://docs.api/text/160717/8675309.txt
6. Importance of Performance
• IIS Source Failed early
2013
• Page speed heavily
influenced our traffic
and SEO
• MongoDB solution
implemented within 2
weeks and results
immediately felt
0
5
10
15
20
25
Speed
0
1
2
3
4
Views
7. Requirements
Sharded – horizontal scale out of reads and writes
Replication – no single point of failure for core business data
Doc Page Peak Read Load of 200 / second < 4s
REST Interface – switch only requires changing URL
Easy to Maintain – maintenance cost of no more than 1 FTE / day
/ month
99.9% uptime
Can handle # of our current set of text files 43 M
Production Rollout within 3 weeks
8. Requirements
Sharded – horizontal scale out of reads and writes
Replication – no single point of failure for core business data
Doc Page Peak Read Load of 200 / second < 4s
REST Interface – switch only requires changing URL
Easy to Maintain – maintenance cost of no more than 1 FTE /
day / month
99.9% uptime
Can handle # of our current set of text files 43 M
Production Rollout within 3 weeks
9. Requirements
Sharded – horizontal scale out of reads and writes
Replication – no single point of failure for core business data
Doc Page Peak Read Load of 200 / second < 4s
REST Interface – switch only requires changing URL
Easy to Maintain – maintenance cost of no more than 1 FTE /
day / month
99.9% uptime
Can handle # of our current set of text files 43 M
Production Rollout within 3 weeks
10. Requirements
Sharded – horizontal scale out of reads and writes
Replication – no single point of failure for core business data
Doc Page Peak Read Load of 200 / second < 4s
REST Interface – switch only requires changing URL
Easy to Maintain – maintenance cost of no more than 1 FTE /
day / month
99.9% uptime
Can handle # of our current set of text files 43 M
Production Rollout within 3 weeks
23. In Conclusion…
It’s Good Enough, It’s Fast Enough, and Doggone It, Developers Like It!
• Fast Prototype
• Low Maintenance
• Quick Deployment
• Scale Out
• Stable
• Linux, Windows, Mac
• Excellent Support