- Coursera is an ed-tech startup providing massive open online courses from top universities to over 2.5 million users, with around 9 million course enrollments.
- They needed a search solution for their forums due to the limitations of MySQL full text search in handling natural language queries and relevance at scale.
- CloudSearch was selected as it provided fast and relevant searches with low maintenance compared to alternatives like Solr due to its ease of use and integration on AWS. It currently indexes around 1.5 million documents to power searches of their forums.
2. About
• Ed-Tech startup providing MOOCs
o Massive Open Online Courses
• New company -- launched 4/18/12
o Less than a year old.
• 215 free courses from 33 top universities
o Princeton, Stanford, Penn, Duke, etc...
o From Cryptography to Modern and Contemporary American
Poetry
• 2.5+ million users
o We reached a million users faster than Facebook and
Pinterest.
• ~9 million course enrollments
3. Platform Scale
• Moderate-sized (>10,000 concurrent users)
• 65 concurrent courses running now, each with tens of
thousands of enrollments each
• >600 "pretty heavy" PHP/Python dynamic pages served
per second sustained
o Might make backend calls to services (e.g. CloudSearch or SES -->
want low latencies)
• Various other services (70 instances+ on EC2 running
at the moment)
• Spiky traffic
o People procrastinate on deadlines - spiky on the weekends
4. Stack
• PHP / Python / Scala backed by MySQL
• Runs on AWS completely
• Utilizes lots of AWS services
o EC2 / ELB for servers
o MySQL RDS for databases
o S3 for video and static hosting
o Cloudfront for video / asset hosting
o SES for emails (>1 million emails everyday)
o SQS for long running tasks (video encoding, gradebook generation,
etc...)
o SNS for notification services
o Route53 for DNS
o CloudSearch for forum search
5. Why CloudSearch?
• Big issue for us back in March / April. Solution then
didn't work
o MySQL Full Text Search
§ LIKE %x% AS NATURAL LANGUAGE?
§ Really terrible results
§ MyISAM (eww...)
• Requirements:
o Fast searches (we call backend APIs - don't want to keep the users
waiting too long)
o Good results (need to be relevant - don't waste the students' time)
o Low/no maintenance (we have enough instances to manage as is)
6. Why CloudSearch?
• Alternatives we looked at:
o Apache Solr, Sphinx, fiddling with MySQL
• Then CloudSearch was announced...
• Early general adopter - we started using
CloudSearch ~10 days after announcement
o We didn't get any heads-up about CS before the public
announcement
o Wrote the code to use CloudSearch and import over our
existing forum posts / comments in 2 or 3 days.
§ From decision to production!
§ Easy to use and great documentation
8. CloudSearch Uses
• Analytics
o Most frequent searches and other statistics about their courses
§ Informing instructors about this so they can clarify
information
o Finding posts across forums
§ Easy for CloudSearch, hard normally because of sharded
scatter-gather problems
• Old way: Querying 600 databases on 4 RDS servers? Not fun
§ Usage analysis
§ Unexpected use: Instructors often want to find all their own
posts so they can save / archive common answers
9. CloudSearch Scale
• Moderate scale
• ~1.5 million documents indexed
o All forum posts and comments
• 50,000+ searches a day
o Spikey! Depends on when homeworks are due.
11. We Want...
• "Did you mean..."
o Lots of typos from non-native speakers
• Multilingual Tokenization / Search
o We are starting to run courses in other languages...
• Find Similar Documents