2. Thursday, July 7, 2011 2
APIdock.com is one of the services we’ve created for the Ruby community: a social
documentation site.
3. Thursday, July 7, 2011 3
- We did some “research” about real-time web back in 2008.
- At the same time, did software consulting for large companies.
- Flowdock is a product spinoff from our consulting company. It’s Google Wave done right,
with focus on technical teams.
4. Thursday, July 7, 2011 4
Flowdock combines a group chat (on the right) to a shared team inbox (on the left).
Our promise: Teams stay up-to-date, react in seconds instead of hours, and never forget
anything.
5. Thursday, July 7, 2011 5
Flowdock gets messages from various external sources (like JIRA, Twitter, Github, Pivotal
Tracker, emails, RSS feeds) and from the Flowdock users themselves.
6. Thursday, July 7, 2011 6
All of the highlighted areas are objects in the “messages” collection. MongoDB’s document
model is perfect for our use case, where various data formats (tweets, emails, ...) are stored
inside the same collection.
7. Thursday, July 7, 2011 6
All of the highlighted areas are objects in the “messages” collection. MongoDB’s document
model is perfect for our use case, where various data formats (tweets, emails, ...) are stored
inside the same collection.
8. Thursday, July 7, 2011 6
All of the highlighted areas are objects in the “messages” collection. MongoDB’s document
model is perfect for our use case, where various data formats (tweets, emails, ...) are stored
inside the same collection.
9. Thursday, July 7, 2011 6
All of the highlighted areas are objects in the “messages” collection. MongoDB’s document
model is perfect for our use case, where various data formats (tweets, emails, ...) are stored
inside the same collection.
11. {
"_id":ObjectId("4de92cd0097580e29ca5b6c2"),
"id":NumberLong(45967),
"app":"chat",
"flow":"demo:demoflow",
"event":"comment",
"sent":NumberLong("1307126992832"),
"attachments":[
],
"_keywords":[
"good",
"point", ...
],
"uuid":"hC4-09hFcULvCyiU",
"user":"1",
"content":{
"text":"Good point, I'll mark it as deprecated.",
"title":"Updated JIRA integration API"
},
"tags":[
"influx:45958"
]
}
Thursday, July 7, 2011 7
This is how a typical message looks like.
12. Browser
jQuery (+UI)
Comet impl.
MVC impl.
Rails app Scala backend
Website Messages
Admin Who’s online
Payments API
Account mgmt RSS feeds
SMTP server
Twitter feed
PostgreSQL MongoDB
Thursday, July 7, 2011 8
An overview of the Flowdock architecture: most of the code is JavaScript and runs inside the
browser.
The Scala (+Akka) backend does all the heavy lifting (mostly related to messages and online
presence), and the Ruby on Rails application handles all the easy stuff (public website,
account management, administration, payments etc).
We used PostgreSQL in the beginning, and migrated messages to MongoDB. Otherwise there
is no particular reason why we couldn’t use MongoDB for everything.
13. Thursday, July 7, 2011 9
One of the key features in Flowdock is tagging. For example, if I’m doing some changes to
our production environment, I mention it in the chat and tag it as #production. Production
deployments are automatically tagged with the same tag, so I can easily get a log of
everything that’s happened.
It’s very easy to implement with MongoDB, since we just index the “tags” array and search
using it.
14. db.messages.ensureIndex({flow: 1, tags: 1, id: -1});
Thursday, July 7, 2011 9
One of the key features in Flowdock is tagging. For example, if I’m doing some changes to
our production environment, I mention it in the chat and tag it as #production. Production
deployments are automatically tagged with the same tag, so I can easily get a log of
everything that’s happened.
It’s very easy to implement with MongoDB, since we just index the “tags” array and search
using it.
15. db.messages.ensureIndex({flow: 1, tags: 1, id: -1});
db.messages.find({flow: 123,
tags: {$all: [“production”]})
.sort({id: -1});
Thursday, July 7, 2011 9
One of the key features in Flowdock is tagging. For example, if I’m doing some changes to
our production environment, I mention it in the chat and tag it as #production. Production
deployments are automatically tagged with the same tag, so I can easily get a log of
everything that’s happened.
It’s very easy to implement with MongoDB, since we just index the “tags” array and search
using it.
17. Library support
• Stemming
• Ranked probabilistic search
• Synonyms
• Spelling corrections
• Boolean, phrase, word proximity queries
Thursday, July 7, 2011 11
These are some of the features you might see in an advanced full-text search
implementation. There are libraries to do some parts of this (like libraries specific to
stemming), and more advanced search libraries like Lucene and Xapian.
Lucene is a Java library (also ported to C++ etc.), and Xapian is a C++ library.
Many of these are hackable with MongoDB by expanding the query.
18. Standalone server Standalone server Standalone server
Lucene based Lucene queries MySQL integration
Rich document REST/JSON API Real-time indexing
support Real-time indexing Distributed
Result highlighting Distributed searching
Distributed
Thursday, July 7, 2011 12
You can use the libraries directly, but they don’t do anything to guarantee replication &
scaling.
Standalone implementations usually handle clustering, query processing and some more
advanced features.
19. Things to consider
• Data access patterns
• Technology stack
• Data duplication
• Use cases: need to search Word
documents? Need to support boolean
queries? ...
Thursday, July 7, 2011 13
When choosing your solution, you’ll want to keep it simple, consider how write-heavy your
app is, what special features do you need, can you afford to store the data 3 times in a
MongoDB replica set + 2 times in a search server etc.
20. Real-time sear
ch
Performance
Thursday, July 7, 2011 14
There are tons of use cases where search doesn’t need to be real-time. It’s a requirement
that will heavily impact your application.
21. KISS
Thursday, July 7, 2011 15
As a lean startup, we can’t afford to spend a lot of time with technology adventures. Need to
measure what customers want.
Many of the features are possible to achieve with MongoDB.
Facebook messages search also searches exact word matches (=it sucks), and people don’t
complain.
So we built a minimal implementation with MongoDB. No stemming or anything, just a
keyword search, but it needs to be real-time.
22. KISS
Even Facebook does.
Thursday, July 7, 2011 15
As a lean startup, we can’t afford to spend a lot of time with technology adventures. Need to
measure what customers want.
Many of the features are possible to achieve with MongoDB.
Facebook messages search also searches exact word matches (=it sucks), and people don’t
complain.
So we built a minimal implementation with MongoDB. No stemming or anything, just a
keyword search, but it needs to be real-time.
23. “Good point. I’ll mark it as deprecated.”
_keywords: [“good”, “point”, “mark”, “deprecated”]
Thursday, July 7, 2011 16
You need client-side code for this transformation.
What’s possible: stemming, search by beginning of the word
What’s not possible: intelligent ranking on the DB side (which is ok for us, since we want to
sort results by time anyway)
24. db.messages.ensureIndex({
flow: 1,
_keywords: 1,
id: -1});
Thursday, July 7, 2011 17
Simply build the _keywords index the same way we already had our tags indexed.
25. db.messages.find({
flow: 123,
_keywords: {
$all: [“hello”, “world”]}
}).sort({id: -1});
Thursday, July 7, 2011 18
Search is also trivial to implement. As said, our users want the messages in chronological
order, which makes this a lot easier.
26. That’s it! Let’s take it to production.
Thursday, July 7, 2011 19
A minimal search implementation is the easy part. We faced quite a few operational issues
when deploying it to production.
27. Index size:
2500 MB per 1M messages
Thursday, July 7, 2011 20
As it turns out, the _keywords index is pretty big.
28. 10M messages: Size in gigabytes
20.00
15.00
10.00
5.00
0
Messages Index: Keywords Index: Tags Index: Others
Thursday, July 7, 2011 21
Would be great to fit indices to the memory. Now it obviously doesn’t. Stemming will reduce
the index size.
Has implications for example to insert/update performance.
29. 10M messages: Size in gigabytes
20.00
15.00
10.00
5.00
0
Messages Index: Keywords Index: Tags Index: Others
Thursday, July 7, 2011 21
Would be great to fit indices to the memory. Now it obviously doesn’t. Stemming will reduce
the index size.
Has implications for example to insert/update performance.
30. Option #1:
Just generate _keywords and build
the index in background.
Thursday, July 7, 2011 22
The naive solution: try to do it with no downtime. Didn’t work, site slowed down too much.
31. Option #2:
Try to do it during a 6 hour
service break.
Thursday, July 7, 2011 23
It worked much faster when our users weren’t constantly accessing the data. But 6 hours
during a weekend wasn’t enough, and we had to cancel the migration.
32. Option #3:
Delete _keywords, build the index
and re-generate keywords in the background.
Thursday, July 7, 2011 24
Generating an index is much faster when there is no data to index. Building the index was
fine, but generating keywords was very slow and took the site down.
33. Option #4:
As previously, but add sleep(5).
Thursday, July 7, 2011 25
You know you’re a great programmer when you’re adding sleep()s to your production code.
34. Option #5:
As previously, but add Write Concerns.
Thursday, July 7, 2011 26
Let the queries block, so that if MongoDB slows down, the migration script doesn’t flood the
server.
Yeah, it would’ve taken a month, or it would’ve slowed down the service.
35. Option #6:
Shard.
Thursday, July 7, 2011 27
Would have been a solution, but we didn’t want to host all that data in-memory, since it’s not
accessed that often.
36. Option #7:
SSD!
Thursday, July 7, 2011 28
We had the possibility to try it on a SSD disk.
This is not a viable alternative to AWS users, but AWS users could temporarily shard their data
to a large number of high-memory instances.
37. Option #7:
SSD!
Thursday, July 7, 2011 28
We had the possibility to try it on a SSD disk.
This is not a viable alternative to AWS users, but AWS users could temporarily shard their data
to a large number of high-memory instances.
38. Option #7:
SSD!
Thursday, July 7, 2011 28
We had the possibility to try it on a SSD disk.
This is not a viable alternative to AWS users, but AWS users could temporarily shard their data
to a large number of high-memory instances.
39. Thursday, July 7, 2011 29
My reactions to using SSD. Decided to benchmark it.
40. 10M messages
in 100 “flows”,
Messages 100k each
Total size 19.67 GB
_id: 1
flow: 1, app: 1, id: -1
flow: 1, event: 1, id: -1
flow: 1, id: -1
Indices flow: 1, tags: 1, id: -1
flow: 1, _keywords: 1, id: -1
Total size 22.03 GB
Thursday, July 7, 2011 30
This is the starting point for my next benchmark. Wanted to test it with a real-size database,
instead of starting from scratch.
41. mongorestore time in minutes
300.00
225.00
150.00
75.00
0
SSD SATA
Thursday, July 7, 2011 31
First used mongorestore to populate the test database.
133 vs. 235 minutes, and index generation is mostly CPU-bound.
Doesn’t really benefit from the faster seek times.
42. Insert performance test
A total of 100 workspaces
And 3 workers each accessing 30 workspaces
Performing 1000 inserts to each
= 90 000 inserts, as quickly as possible
Thursday, July 7, 2011 32
43. insert benchmark: time in minutes
200.00
150.00
100.00
50.00
0
SSD SATA
Thursday, July 7, 2011 33
4.25 vs 155. That’s 4 minutes vs. 2.5 hours.
44. 9.67 inserts/sec
vs.
352.94 inserts/sec
Thursday, July 7, 2011 34
The same numbers as inserts/sec.
45. 36x
Thursday, July 7, 2011 35
36x performance improvement with SSD. So we ended up using it in production.
46. Thursday, July 7, 2011 36
Works well, searches from all kinds of content (here Git commit messages and deployment
emails), queries typically take only tens of milliseconds max.
47. Questions / Comments?
@flowdock / otto@flowdock.com
Thursday, July 7, 2011 37
This was a very specific full-text search implementation. The fact that we didn’t need to rank
search results made it trivial.
I’m happy to discuss other use cases. Please share your thoughts and experiences.