Kishore Jalleda's presentation on using Nagios in a continuous development environment.
The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna
3. About IMVU
Avatar based Social Entertainment destination
$50+ Million Annual Revenue
100+ Million Registered Users
10+ Million Items in Virtual Catalog
2012 3
4. IMVU Engineering and Continuous Deployment
►Doing the Impossible 50 times a day
►Continuous deployment (CD) is real
►IMVU has been one of the pioneers of CD
►DevOps culture is big
►No approval needed to ship to 1% of customers
Check out our engineering blog
http://engineering.imvu.com/
2012 4
5. What does this mean ?
►Things change quickly
►New features add up instantly
►Can break frequently
►Failures can cascade rapidly
►Things can fall through the cracks
►Many things change at the same time
►Etc
2012 5
8. Server Lifecycle Management
Purchase &
Asset DHCP, Preseed, Nagios, Decommiss
Manageme CFEngine Opspush Cacti, CFEngine Production
ion
DNS Istatd
nt
2012 8
9. [ Operations ] Continuous
Integration and Deployment
2012 9
10. IMVU Asset Database ( AssetDB )
►Built internally by IMVU
►Simple but powerful concept
►Source of truth for everything asset related
►Has information on
►Class ( mysql, standard-http-server, redis )
►Role ( customer shard, clientdynweb )
►Tag (available, no-update )
►Attributes (cpu-cores, memory-size, mysql-role )
►Much more …
2012 10
11. Auto generation of Nagios configuration files
#generate_nagios_conf.pl
( most configurations auto generated from AssetDB )
2012 11
13. Opspush ( Operations Push System )
# opspush --comment “xxxxxx” –role nagios
run “cfagent -v”
on the box
--use-last-green-rev
green
check status
opspush of “last build”
yes
red
--oncall-
override ?
No
exit
2012 13
14. Product Development
Ideation, UI Monitoring
Design, and Alerting
Tech Design Production Maintenance
Usability Coverage..
Testing, etc Nagios
2012 14
17. Big Data / De-Sharding
► Data freshness is critical to help make the right
business decisions
► Nagios used for ETL/DW status and error
checking
► Nagios and Ops embeds can help empower
your Data Infrastructure team
2012 17
19. How we try to prevent and catch failures
Automated 3rd party like
Local Manual QA
Cluster webmetrics,
Acceptance Hypo Builds Buildbot using roll- Nagios
Immunity customers,
Tests out
(CI) etc
2012 19
20. Cluster Immune System
Automated push monitoring and rollback !
Push to Monitor Good
X% of Critical Push to
servers Metrics rest
Bad
Bad Monitor
Critical
Auto Rollback Metrics
w00t!, my
change is Good
Live
22. Demystifying P1s ( Priority 1 )
P1: Priority 1 issue impacting live operations
Phases
► Identification (Nagios )
► Communication and Declaration
► Resolution
► Postmortem / 5 Whys / Root Cause Analysis
► P1 follow up
2012 22
23. 5 Why / Postmortem (PM) / Root Cause Analysis
► 5 Why process
► Amazing culture of running blameless
postmortems
► New Nagios checks are the most common
action Items .
► A lot of monitoring and alerting on business
and application level metrics was originally the
outcome of PMs
2012 23
27. Continuous Monitoring ( Istatd )
► Developed by IMVU
► Sub 10 sec resolution of data
► API to get average, SD, min, max sample count
for each data point in a graph
► Ability to stack multiple graphs on the fly
► Long retention times
► Releasing as open source this week !!!
https://github.com/imvu-open/istatd/wiki
2012 27
31. Our (Nagios) Strategy
► Human element of Monitoring and Alerting (
Nagios )
► Nagios & Test Driven Development ( TDD )
► Decouple ( Nagios )
► Aggregated Checks
2012 31
32. Human Element of Monitoring and Alerting
► Have zero tolerance towards False Positives.
You do not want your ops staff to walk into the
office next AM looking like zombies ;)
► Do not let people develop immunity to pages as
very soon real issues will be ignored
► All pages are Actionable policy: If there is no
action, it should not be paging
► Automatic enabling of alerting/notifications for
improperly silenced ones.
► Ownership and accountability of issues/alerts
2012 32
34. Nagios & Test Driven Development (TDD)
► Write tests for your Nagios Infrastructure
► Adopted heavily by Ops ( imp to keep pace
with eng, DevOps culture is awesome )
► High degree of confidence in pushing changes
► Things will eventually change ( OS, libraries,
logic, people, Nagios version, etc ). Tests will
make the change much smoother.
► Functional testing can still be a challenge
2012 34
36. Decouple Nagios
We do it using “Fact, Worker, Reporter & Aggregator” Model
Worker
fact
fact
Redis
Reporter
fact status
fact status
Aggregator
2012 36
37. Why Decouple ?
For scalability and efficiency
Our model was higher performing compared to
NRPE
Lets you make changes ( like thresholds ) in
one place instead of on like a 1000 machines (
if using NRPE )
Lets you do aggregated checks, which is again
a very simple but powerful concept to reduce
paging levels by a ton
2012 37
39. Closing Remarks
► Monitoring and Alerting (M&A) is mission critical for
any business, invest properly and smartly in it
► Don’t limit the usage of Nagios to just Ops. The secret
to wide spread adoption is to make things frictionless
► Bathroom breaks can take 5-10 minutes, so don’t fret
too much about Nagios performance
► Build some form of predictive monitoring and alerting
to catch and alert on change in trends
► Invest in configuration automation, validation and
compliance
► Finally, Nagios has been like a Honda, very reliable !!!
2012 39