The document describes the development of Triage, an open source tool for aggregating and analyzing errors from production systems. It discusses how the initial MapReduce approach did not scale and was replaced with an asynchronous API using message queues and MongoDB for aggregation. The outcomes achieved scalability but challenges remained around determining the importance of persistent errors. Future plans include easier installation, better language support, and integration with notification services.
11. Triage
• Improve signal to noise ratio by aggregating
similar errors
• Allow for claiming, resolving and ranking
errors in terms of importance
• Integration with github, build tools
• Play with new tools and technology
• Provide open source alternative to
commercial products in this space
Sunday, 19 August 12
13. Round 1(Fight!)
• Errors continue to log directly to mongo
• Aggregation via incremental MapReduce
• Deliver a prototype in one day
Sunday, 19 August 12
15. Scalability Fatality!
• Worked fine during development
• Production load caused the MapReduce to
asplode!
• (Not that we have a lot of errors, right?!)
Sunday, 19 August 12
17. (sub)zeroMQ
• Async error API using
zeroMQ pub/sub
sockets
• MessagePack as error
format (fast, binary)
• Aggregation in python
Sunday, 19 August 12
18. Aggregation Method
• Generate hash in python based on error
document
• Query mongo for error hash
• Create or update error document based
on outcome of query, incrementing
counters etc where appropriate
Sunday, 19 August 12
22. Scalability Fatality 2
• Multithreaded experiments
• Mongo optimisations
• There is no schema
• The cake is a lie
• Mongo ‘upsert’ rocks!
Sunday, 19 August 12
23. Updating like a boss
collection.update(criteria, document, upsert=False)
Sunday, 19 August 12
24. Updating like a boss
collection.update(criteria, document, upsert=False)
Sunday, 19 August 12
25. Updating like a boss
collection.update(criteria, document, upsert=False)
Sunday, 19 August 12
26. Updating like a boss
collection.update(criteria, document, upsert=False)
Sunday, 19 August 12
27. Updating like a boss
collection.update(criteria, document, upsert=False)
Sunday, 19 August 12
30. Outcomes
• Getting the ‘right’ level of grouping hard
• What to do with errors that just wont go
away?
• Error occurrence count - what does this
tell us?
Sunday, 19 August 12
31. Future
• Easier installation, package in pypi
• Better language support (plz halp)
• Drop in replacement for airbrake etc
• Client side logging (javascript)
• Email style filters & actions - ifttt.com
Sunday, 19 August 12
32. Thanks
• 99designs for research and development time
• Contributors:
• Luke Cawood - Project lead
• Josh Benham - Developer
• Jamison Lu - Developer
• Additional assistance
• Lars Yencken - Operations
• 99designs UX team
Sunday, 19 August 12