Hello. I'm @zembutsu. I work in a server hosting company in Japan. I am a solution engineer, and I am in charge of server and network operation mainly.
So, as for my presentation, it is a resource monitoring tool about Munin.
original version is here ( in Japanese )
http://www.slideshare.net/zembutsu/practical-resource-monitoring-with-munin
Munin User Group Japan http://munin.jp/
Masahito Zembutsu @zembutsu
September 8, 2012 OpenSource Conference 2012 Tokyo/Fall, Japan (#osc12tk)
Practical resource monitoring with munin (English editon)
1. Munin User Group Japan http://munin.jp/
Masahito Zembutsu @zembutsu
September 8, 2012 OpenSource Conference 2012 Tokyo/Fall (#osc12tk)
“Practical Resource Monitoring with Munin - English Edition”
2. Nice to meet you. I’m @zembutsu.
Thank you for giving an opportunity of the presentation to me!
They are characters of Touhou Project, and "Please take it easy!!"(yukkuri site itte ne!) is one of the famous slang in Japan.
http://en.wikipedia.org/wiki/Touhou_Project
7. Why Am I Here? this is me
• Masahito ZEMBUTSU @zembutsu
– Solutions Engineer ( fiery zeal Otaku mind engineer )
• Working as a server infrastructure engineer.
• I want to provide relaxation and rest for theengineers.(Operation/Monitoring/Automation)
– Communities of an opensource and the cloud computing
• My website http://pocketstudio.jp/
– Experience http://opencloud.jp/ http://jaws-ug.jp/
• April 2000 - Support engineer of server hosting and the ISP
• May 2008 - Company internal network management and support
• November 2010 - Service development and upper escalation operation
Don’t mind the careful thing!
• July 2012 – Operation, Development, Research
at datacenter somewhere.
9. ―Don't forget. always, somewhere,
someone is fighting for you.
―As long as you remember her.
you are not alone.
Operation (Reference: “Puella Magi Madoka Magica” Episode 12 “My Very Best Friend” )
Monitoring
10. This is an image photograph of the data center that I’m working.
This Photo is under creative commons license by torkildr
http://www.flickr.com/photos/torkildr/3462606643/sizes/l/in/photostream/
11. A Dedicated Hosting Services
A HUMAN WORK
Shutdown Attack, An Unfamiliar Specifications,
Cloudcomputing’s Arrival in Japan, Shape of Server,
A Business That’s Changing, My Purest Heart for Our Customers.
Troubleshooting
DECISIVE BATTLE
The Phone That Never Stop Ringing, The Day a Datacenter Stood Still,
The Choice of Priority, In sickness unto shutdown, and…, Sales
Representative’s Invasion, Customer’s office the Throne of Souls, Tears.
You’re a loser only when you fail to try
We can (not) advance.
The Birth of Special task force, The Value of Miracles, At Least, Be Human.
12. If you are a server administrator,
you will have thought once.
Perhaps…
13. My Little Servers Can't Be This Heavy...
But, it may work with Munin
and a solution of the problem.
14. This is that I want to do a share today.
• I think that it is necessary to adopt resource
monitoring for an operative flow.
• As a result, it may reduce the burden on
administrators. I'm extremely happy. XD
• We need is the culture to leave the office on time!!
(Only as for me?)
15. Agenda
1. What is Munin?
2. Munin’s Architectre
3. How to use Munin
4. Practical trobleshootings!
5. MY VERY BEST MONITORING TOOL
16. I hope…
1. Let's obtain a weapon
called “resource monitoring” for us.
Wille zur Macht
2. We improve the efficiency of
our working (server and network operations).
“Let's find happiness together.”
(Reference: Kiichi Goto, Patlabor: The Movie, 1989) I guess everybody's happy, that's fine.
18. Munin.jp
• Munin User Group Japan
– http://munin.jp/
• Wiki
– http://munin.jp/wiki/
• Demo
– http://demo.munin.jp/
• How to join us
– http://munin.jp/mailing-list/
26. By the Time we Realized It, It Had Already Begun.
• troubles - alert systems can’t detect it (increased)
– Mainly clientage for Social Networking Service
– When the threshold of the alert exceeds it, it is already late.
• demand of the clientage – rapidly response
– Because a loss per one second is wrong number of digits
than before.
– a loss of several hundred dollars / minutes :(
27. “There is something weird, will you check servers? :)”
Request from my customer of us
• Very difficult request...
– Clear cause identification often takes time.
• I want to do my best more!
– Yes!! I stir myself and go to work.
Administrators got exhausted…
– I want to aim at the service improvement, but this
thought is bad. Why? Let’s see the next slide.
28. An old network constitution.
One web server and one database server.
It’s very simple!
29. An old network constitution.
If it was a general
Web server, it was
such a constitution
BIND
to the utmost.
One web server and one database server.
It’s very simple!
31. This Just Can't Be Right!!
BIND
Number and the management objects of the server are increasing in comparison with the past.
Therefore support takes the time, and the degree of difficulty rises, too.
32. Why did this happen?
• On the changing environment
– Network
– Server
– Software
– Middleware
– Application
– etc
33. Be freed from CONSOLE
Ace Console: Fires of Liberation
34. most important thing, by troubleshooting
• Cause investigation work has top priority.
“When we act, it is a first thing to do condition to notice.
If there is a technique, anything cannot be settled. It
becomes necessary to notice before a technique. The
technical expert is in Japan no matter how much, but
cannot be readily settled. The reason is because it
does not notice.”
Soichiro Honda (2008) "akku baran” (candidness ) PHP inc, 10pp.
http://en.wikipedia.org/wiki/Soichiro_Honda
35. You sure that’s enough armor(tools)?
• “No problem. Everything’s fine.”
– ps
– top
– vmstat
– iostat
– free
– sar (sysstat) …etc
Really?
37. Situation has changed
Past Now (present day present time, hahaha!!)
• One or several servers • Plural servers in the same
• Apache, Sendmail, Perl network (we assume)
• PostgreSQL, MySQL • Conventional software +
nginx,Tomcat,ruby,PHP,Python,memcac
• Network appliance hed,Key-Value
(sometimes) Store,Hadoop,Cassandra,MongoDB…etc
• No scale • The need for scalability
• Upgrading is effective • Upgrading is not effective
I think that one of the answers to this problem
is resource monitoring using Munin.
38. The essence of Munin is
many resources visualization
I Know What Your Server Did Last Summer
39. MRTG has declined
Is This MRTG? No, This Is Munin.
We have lost a hero to our glorious and noble cause, but does this foreshadow our defeat?
No. It is a new beginning. Compared to Cloud Computing Federation the national resources
of Dedicated Server are less than one thirtieth of theirs. Despite this major difference,
how is it that we have been able to fight the fight for so long? It is because our goal in
this war is a righteous one. It’s been over fifty years since the elite of Cloud Computing,
consumed by greed took control of the Cloud Computing Federation. We want our freedom.
Never forget the times when the Federation has trampled us! We, the Principality of
Dedicated Server, have had a long and arduous struggle to achieve freedom for all
e n g i n e er s o f o u r g r e a t n e t w or k . Ou r f i g h t i s s a cr ed , ou r c a u s e d i v i n e .
My beloved brother, MRTG, was sacrificed. Why? The war is at a stalemate.
42. Comparerative table
Tool name Type Datastore Config Web interaface alerting
Resource
Munin monitoring RRDTool CUI
Reference only
Resouce
Cacti monitoring RRDTool & MySQL CUI/GUI
We are friends all the time...
MRTG
Resource
monitoring original CUI
Reference only
×
IT
Zabbix infrastructure
monitoring
MySQL, PostgreSQL, etc GUI
IT
Nagios infrastructure
monitoring
MySQL or PostgreSQL CUI/GUI
It is good points and bad points both.
I use Munin and a Nagios-based tool properly by my team.
44. About Munin Be alert!
• http://munin-monitoring.org/
• Resource monitoring tool
– Munin can analyze resource trends
– “what just happened to kill our performance?”
• Plug and Play architecture
– It can monitor many items by default
Munin is a networked resource monitoring tool that can help analyze resource trends and
"what just happened to kill our performance?" problems. It is designed to be very plug and
play. A default installation provides a lot of graphs with almost no work.
46. Progress in development
• Community based
– Github
• https://github.com/munin-monitoring
– Mailing list
• https://lists.sourceforge.net/lists/listinfo/munin-users
– IRC
• irc://irc.oftc.net/#munin
• Licence
– GNU Public License version2
– There is not commercial support
47. History
• 2002 - project began
– The original name is “LRRD”
• 2004 - Munin 1.0 released
– “munin-eye” name was changed to “munin-node”
– took long time, and daily improvement continued
• 2009 - Munin 1.4 released
– Perhaps I think that it is a version spreading most in 1.x.
• May 30, 2012 - Munin 2.0 (stable) released
48. Where is the Japanese information?
• NOT YET!
• Let’s make it together now!
– How about write something to wiki first?
• http://munin.jp/wiki/
“Is the number of the invitation to the Munin
“I’m sorry, user group ZERO case this week, too?
my applogies…” Hum? Do you have a mind to do?”
50. This Photo is under creative commons license
http://en.wikipedia.org/wiki/File:Odin_hrafnar.jpg
51. What obstacle
factors there are!!
Are you getting
wise with me?
Speaking munin-eye’s mind ( now, munin-node )
52. Summarize the points
• Munin is a resource monitoring tool. (GPL v2)
• Simple and powerful architecture.
• Munin frees us from a console. (effectiveness)
• Munin mean is “memory”.
You are never alone!
Munin always here for you 24x7x365
59. This is the data which we
referred to some time ago.
60. This is the work of the main Munin master, and a program is
executed by cron.
It thereby carry out the generation of the collection of data,
checking threshold, HTML files and graphs one by one.
63. Plugins are executed in munin-node, and program is a
script acquiring various data. Munin-update stores the
data which I acquired in RRDTool.
And, munin-limits checks the threshold.
64. And munin-graph and munin-html
generate a graph and HTML for the
material in data (.rrd) stored away by
RRDtool.
65. These flows are basic movement of Munin.
I think that it is really simple and cool!
68. About data collection
• munin-node collect various data.
• Port 4949(TCP)
– Munin protocol
• LIST
• CONFIG
• FETCH
•
•
VERSION
QUIT
(T_T)4949
“4949” is onomatopoeia of Japanese "tearful face".
69. Data storage and graph generation are
work of RRDtool
• Data format is RRD (round robin database)
– /var/lib/munin/<hostname>/<plugin’s name>.rrd
-rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-idle-d.rrd
-rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-iowait-d.rrd
-rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-irq-d.rrd
-rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-nice-d.rrd
-rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-softirq-d.rrd
-rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-steal-d.rrd
-rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-system-d.rrd
-rw-r--r-- 1 munin munin 50612 10月 18 2010 localhost-cpu-user-d.rrd
• 50KByte/one RRD file
– More than 200KB/one plugin (MUST)
– 150 to 250 files/munin-node (total about 8 to 15MB/node)
72. Ex) Load Average plugin
• /etc/munin/plugins/load
– “Load average” is five minutes average
– It’s a symbolic link
• Original is /usr/share/munin/plugin/load
– Simple shell script
echo -n "load.value "
cut -f2 -d' ' < /proc/loadavg
load .value 3.22
76. Environment
• Perl5
• OS
– Linux
• Source code ( version 2.0.6 )
• Binary Package
– Red Hat Enterprise Linux 系 ( EPEL )
– Debian
– openSUSE
– MacOS X
– Windows
77. Setting up flow
• Install Munin and Perl Libraries
• Change a config file ( munin.conf )
• Setting up munin-node ( munin-node.conf )
• Check its graphs
78. Case) Red Hat Enterprise Linux
• Use EPEL*1(testing repository) package or source
• procedure
– 1. enabling EPEL
– 2. “yum install munin”
– 3. configure munin.conf
– 4. turn on munin-node and setup
– 5. check
*1 Extra Packages for Enterprise Linux(EPEL) https://fedoraproject.org/wiki/EPEL
79. Case) Debian / Ubuntu
• Use apt (Debian PTS is testing) or Source
• Procedure
– 1. setting up Perl libraries (via apt-get)
– 2. install munin
– 3. configure munin.conf
– 4. turn on munin-node and setup
– 5. check
82. [munin.conf] set target node
[GroupName;node1.pocketstudio.net]
address 127.0.0.1
use_node_name yes
83. [munin-node.conf] Access control
• allow ^127.0.0.1$
– Regular expression
• cidr_allow 192.0.2.0/24
– Not regular expression
• If you change files, then you must restart
munin-node!
85. Basic knowledge of Munin plugin
• Original files is here ( shell or perl scripts )
– /usr/share/munin/plugins/
• How to use
– To make symbolic link to /etc/munin/plugins
– configure munin-node.conf
– munin-node restart (MUST)
– Check graph and html
86. How to debug plugin
• /usr/sbin/munin-run <plugin-name>
– “--debug” shows more detail
– behavior is same as munin-node
– useful
• Command line tool ( I made )
– muninwalk & muninget ; perl script
https://github.com/zembutsu/muninwalk
96. Sample case; httping plugin
• http://www.vanheusden.com/httping/
• "httping" is a command-line tool which can check
response time of the Web server like a “ping”
command.
• If you set –S opsion, then you can check response
time and processing time.
$ httping -S http://210.239.46.254/
PING 210.239.46.254:80 (http://210.239.46.254/):
connected to 210.239.46.254:80 (380 bytes), seq=0 time=0.10+0.69=0.79 ms
connected to 210.239.46.254:80 (380 bytes), seq=1 time=0.08+0.47=0.55 ms
connected to 210.239.46.254:80 (380 bytes), seq=2 time=0.07+0.68=0.75 ms
connected to 210.239.46.254:80 (380 bytes), seq=3 time=0.12+0.66=0.77 ms
Got signal 2
--- http://210.239.46.254/ ping statistics ---
4 connects, 4 ok, 0.00% failed
97. Plugin: httping_
#!/bin/sh
#
# Plugin to monitor HTTP response (httping)
#%# family=auto
#%# capabilities=autoconf
URL=${URL:-"http://localhost/"}
COUNT=${COUNT:-"5"}
httping_bin=$(which httping) This is substance of a httping plugin, and a
if [ "$1" = "autoconf" ]; then file itself is a simple shell script.
echo yes
exit 0 The contents are the definition about the
fi
Define graphing graph and commands to really acquire a value.
if [ "$1" = "config" ] ; then
echo "graph_args -r --lower-limit 0 "; A point is to acquire data, and therefore the
echo "graph_title http response $URL";
echo "graph_category httping"; plug in can make even what kind of language
echo "graph_info httping response time: $URL";
echo 'graph_vlabel msec' including perl and PHP.
echo "connect.label connect time"
echo "connect.draw AREA"
echo "connect.type GAUGE"
echo "processing.label processing time"
echo "processing.draw STACK"
echo "processing.type GAUGE"
exit Output format is “xxx.Value ***”
fi
# format for httpiing 1.5.3 http://www.vanheusden.com/httping/
$httping_bin -c $COUNT -G -S $URL | tr '+|=' ' ' | awk '{connect+=$9; processing+=$10} END{print "connect.value",connect/'$COUNT'"n""processing.value",processing/'$COUNT'}'
99. httping live demo
• http://demo.munin.jp/munin2/httping-day.html
It is a case having any problem neither for this server, There is much partial (processing time) of this server group
response time and processing time. blue.It takes the processing time by certain CMS.
On the other hand, I understand that the network is good.
101. Never say never.
• Agility is the pivot of the service (in my case)
– LOOKOUT, its cause solution of the trouble
• Hardware or Software or Network
– We need investigation
• where a problem happens promptly
102. Live Munin demo
• http://demo.munin.jp/
– Then let's observe the resource situation through
this demonstration site of Munin.
• Where is a bottleneck? or will be?
• Even if you do not log in to a server, I think that you can
refer to many resources.
104. Case) identified unauthorized access
• By the Time we Realized It, It Had Already
Begun.
• situation
– 1. Error emails beguns to arrive to postmaster
– 2. There was not the alert with the monitoring tool
– 3. Therefore at first I checked a resource in Munin
– 4. I identified that CMS had vulnerability from the
situation and acted promptly.
I was able to perform the above-mentioned movement quickly in a short time by Munin.
105. How to find it.
Sendmail’s queue rised suddenly Load Average has no problem
106. I confirmed the time MySQL’s queries were
when traffic was strange rised suddenly, too
From the above-mentioned situation, I supposed illegal access for CMS. Actually, I understood
the attack for the specific URL when I investigated log of the time.
Identification and the action of the cause should have taken time more if I did not use Munin.
111. No munin, No Troubleshoot.
I'm Not Afraid of Anything Anymore
112. Munin changed support flow (my case)
• If I don’t use tools
– Troubleshooting is various command execute (sysstat) and
investigation of the log files.
– But, this method need long time and many human resources
need, and is bad for service.
• If I use Munin (now).
– Even if I do not log in, I can understand the situation.
– I can judge abnormality visually
• “I see the ending of this troubleshooting!”
– Agile Support
• Troubleshooting that has Plan-Do-Check-Action (PDCA) cycles.
113. In work of my dedicated server hosting
• I really depend on Munin
– Always, I setup Munin. Neat
– Munin is almost in several I cannot part with Munin
hundred servers which for my work.
I manage directly.
– I think that Munin is
indispensable to our
service quality You believe it!
improvement.
BAM
BAM!
115. Detecting problem
What are
Plan
and situation
these alerts?
For real?
Trobuleshoot PDCA
Law of
Cycles
Presage!!
116. Detecting problem
What are
Plan
and situation
these alerts?
For real?
Trobuleshoot PDCA Do
Law of Suppose a cause
Cycles OK, Munin. Please tell me
that trouble lies hidden in
wherever?
Presage!!
Fire!
Please stop!!
117. Detecting problem
What are
Plan
and situation
these alerts?
For real?
Trobuleshoot PDCA Do
Law of Suppose a cause
Cycles OK, Munin. Please tell me
that trouble lies hidden in
wherever?
Presage!!
Fire!
I just talk about
what I just looked
in Munin!! Check Please stop!!
To check resources
remotely
118. Detecting problem
What are
Plan
and situation
these alerts?
For real?
Wow!
click-clack Trobuleshoot PDCA Do
click-clack
Law of Suppose a cause
Cycles OK, Munin. Please tell me
that trouble lies hidden in
Action
wherever?
Presage!!
Fire!
Log in and
I just talk about
execute commands
what I just looked
in Munin!! Check Please stop!!
To check resources
remotely
119. You are never alone!
Munin always here for you
24 x 7 x 365
The Only Thing I Have Left To Guide Me
120. Munin’s overview
・Munin is the resource monitoring tool that
specialize to notice by the visualization.
・Simple architecture, and many plug-ins.
・Ths is most suitable for the system that
quick support is necessary in a short time.
121. Conclusion * This is my personal impression.
No munin, No Operation.
While there’s Munin, there’s hope.
MY VERY BEST MONITORING TOOL.
Thank you for MUNIN. Good-bye to MRTG.
122. I wish…
• I would appreciate you use Munin that
if you were interested in Munin by my
presentation.
• Tomorrow is another day. Up to you.
Squidn’t you use Munin?
(Shoudn’t)
123. Questions?
• Do you have a questionable point for munin?
I'm glad you asked.
Let's give the rights that the reward buys Opoona for you.
(but, here is wagon sale...)
124. References
• Munin
– http://munin-monitoring.org/
• Munin User Group Japan
– http://munin.jp/
– http://munin.jp/wiki/
• Website
– Waiting for Munin 2.0 – Introduction – Personal Workflow Blog
• http://blog.pwkf.org/post/2010/06/Waiting-for-Munin-2.0-Introduction
– /tags/2.0.0/ChangeLog – Munin – Trac
• http://munin-monitoring.org/browser/tags/2.0.0/ChangeLog
Please feedback me zem@pocketstudio.jp or @zembutsu ( twitter )
Thank you for your reading!