How do you determine whether your MongoDB Atlas cluster is over provisioned, whether the new feature in your next application release will crush your cluster, or when to increase cluster size based upon planned usage growth? MongoDB Atlas provides over a hundred metrics enabling visibility into the inner workings of MongoDB performance, but how do apply all this information to make capacity planning decisions? This presentation will enable you to effectively analyze your MongoDB performance to optimize your MongoDB Atlas spend and ensure smooth application operation into the future.
4. Braze empowers you to humanize your brand –
customer relationships at scale.
Tensof
BillionsofMessages
Sent
Monthly
Global
Customer
Presence
Morethan
1Billion
MAU
ON SIX CONTINENTS
5.
6.
7. How Does It All Work?
•Push, email, in-app messaging, and more for
our customers
•Integration via an SDK and REST API
•Real-time audience segmentation
•Batch, event-driven, and transactional API
messaging
8. What does this look like at scale?
•Nearly 11 billion user profiles
•Our customers’ end users
•Over 8 billion Sidekiq jobs per day
•Segmentation, messaging, analytics, data
processing
•Over 6 billion API calls per day
•User activity, transactional messaging
•Over 350k MongoDB IOPS across clusters
•Powered by over 1,200 MongoDB shards, 65
different MongoDB clusters
9. 9
TOC
Frequency Capping
What is it? How does it work at Braze?
The Original Design
How did it originally work? What were the issues?
Redesign using the Aggregation Pipeline
What does the new solution look like? Why is it
better?
Looking at the Results
Did it really improve performance? What’s next?
Today
30. Remove Ineligible
Campaigns
MongoDB Query
On “Users”
Frequency Capping Algorithm
MongoDB Sidekiq Worker
Data Transfer
Count Campaigns
and Check Rule
For Each Rule
(for each user)
Eligible Users
(in batches)
31. Remove Ineligible
Campaigns
MongoDB Query
On “Users”
Frequency Capping Algorithm
MongoDB Sidekiq Worker
Data Transfer
Count Campaigns
and Check Rule
For Each Rule
(for each user)
Eligible Users
(in batches)
Non-Frequency
Capped Users
33. Frequency Capping Problems
• User profiles can be HUGE
• 16 MB max doc size + batch processing
• Network IO & RAM usage
• Not particularly fast…
Frequency Capping in a flame graph of the Sidekiq job
(mostly spent waiting on queries!)
34. Frequency Capping Problems
• User profiles can be HUGE
• 16 MB max doc size + batch processing
• Network IO & RAM usage
• Not particularly fast…
35. Frequency Capping Problems
• User profiles can be HUGE
• 16 MB max doc size + batch processing
• Network IO & RAM usage
• Not particularly fast…
• What about the same campaign sent twice?
• “Last received” timestamps alone aren’t
enough data
campaign_summaries: {
"Coffee Addict Promo": {
last_received: Date('2019-06-
01T12:00:03Z'),
last_opened_email: Date('2019-06-
01T12:03:19Z')
}
}
37. Micro-optimizations
• What if we limit what parts of the user
profile document we bring back?
• We have aggregate stats, so we know
when certain campaigns were sent
Optimization Attempt #1
38. Micro-optimizations
• What if we limit what parts of the user
profile document we bring back?
• We have aggregate stats, so we know
when certain campaigns were sent
• However…
• What if the frequency capping window is
fairly large?
• What if the customer has hundreds of
millions of users?
43. Redesign Goals
• Less network IO
• Expensive!
• Less RAM usage
• For huge campaigns, occasional OOM
errors
OOMs in server logs
44. Redesign Goals
• Less network IO
• Expensive!
• Less RAM usage
• For huge campaigns, occasional OOM
errors
• Much faster execution
• Micro-optimizations are only going to
go so far
46. User Collection Example
{
_id: 123,
first_name: "Zach",
last_name: "McCormick",
email: "zach.mccormick@braze.com",
custom: {
twitter_handle: "zachmccormick"
favorite_food: "Greek",
loves_coffee: true
},
campaign_summaries: {
"Coffee Addict Promo": {
last_received: Date('2019-06-01T12:00:03Z'),
last_opened_email: Date('2019-06-01T12:03:19Z')
}
}
}
Campaign Summaries use a hash, not an array
47. What about a new supplementary document?
• We don’t want to store more data on User profiles – already too big in some cases
48. What about a new supplementary document?
• We don’t want to store more data on User profiles – already too big in some cases
• What if this new collection holds arrays of received campaigns
• We can use $slice to keep the arrays reasonably sized
• We can use the same IDs as User profiles to shard efficiently
49. What about a new supplementary document?
• We don’t want to store more data on User profiles – already too big in some cases
• What if this new collection holds arrays of received campaigns
• We can use $slice to keep the arrays reasonably sized
• We can use the same IDs as User profiles to shard efficiently
• What would that look like?
53. NEW Frequency Capping Algorithm
1. Match stage
2. First projection using $filter
1. Only look at the relevant time window
2. Don’t include the current dispatch (for
multi-channel sends)
3. Exclude campaigns that don’t count
toward frequency capping
Resulting document:
{
“Zach”: {
“email_86400”: [
{
“dispatch_id”: …,
“date”: …,
“campaign”: …
},
…
],
}
}
54. NEW Frequency Capping Algorithm
1. Match stage
2. First projection using $filter
1. Only look at the relevant time window
2. Don’t include the current dispatch (for
multi-channel sends)
3. Exclude campaigns that don’t count
toward frequency capping
3. Second projection
1. Only bring back dispatch IDs
Resulting document:
{
“Zach”: {
“email_86400”: [
“campaign-a-dispatch-id”,
“campaign-b-dispatch-id”,
],
}
}
55. UserCampaignInteractionData Query Example
first_projection[”email_86400"] = {
:$filter => {
:input => ”email_received",
:cond => {
:$and => [
# first make sure the tuple we care about is within rule's time window
{:$gte => [
"$$this.date", Date.new(2019,6,9,12,0,0)
]},
# next make sure we don't include transactional messages
{:$not => :$in => [
"$$this.campaign", ["Txn Message One", "Txn Message Two",]
]}
]
}
}
58. Frequency Capping – Network Bandwidth
MongoDB
Frequency Capping Version 1
Sidekiq
Transferring full
user profiles
MongoDB
Frequency Capping Version 2
Sidekiq
Transferring only
Dispatch IDsVS.
63. Deployment Strategies
• All functionality behind a feature flipper
• Easy to turn on/off by customer
• Lots of excess code
• Feature flipper logic is simple – use class X or class Y
64. Deployment Strategies
• All functionality behind a feature flipper
• Easy to turn on/off by customer
• Lots of excess code
• Feature flipper logic is simple – use class X or class Y
• Feature flipped on slowly
• Hourly and daily check-ins on Datadog
• Minimize impact if something goes wrong
66. Frequency Capping by Tag
first_projection[”email_marketing_86400"] = {
:$filter => {
:input => ”email_received",
:cond => {
:$and => [ …,
# only include campaigns tagged “marketing”
{:$in => [
"$$this.campaign", [”July 4 Promo", ”Memorial Day Sale", …]
]}
]
}
}
67. What else?
• Set the foundation for future expectations
• Customers are always going to want to send messages
• Faster and faster
• With more detailed segmentation
• With more complex inclusion/exclusion rules