6. •
Check the Dashboard and looks good.
•
Start work, write scripts or configurations
•
Suddenly, Receiving alert SMS/Email or problem
reported by CS.
•
Start work with event/problem/outage
7. You are the Fireman
http://www.flickr.com/photos/40699207@N05/3838012090/
8. Find the problem
•
take a look at Dashboard, Nagios, and monitor
•
grep logs from hundreds of host.
•
watch the network diagram
•
guess what is going wrong
15. Process logs
•
Realtime or near realtime take big benefit
•
You can’t waste 1 hour when problem really happen
•
You have to feel problem before too many users
blame.
19. Performance Measurement
•
How fast when end-user visit our website?
•
Where are they come from?
•
Which datacenter are they visited?
•
What the slow/fast user ratio?
21. Change/Release log
•
Many problem come with Change or Release
•
You have to watch those data after you did a change
or release.
•
Change/Release log have to visible on dashboard.