The System
I run Munin 2.0.16 on CentOS 6 to monitor about 30 hosts.I mostly monitor 25-30 parameters per host (this depends to the running services)
About half of the hosts are nearby (with a LAN latency) the other part is far (with about 250ms network latency).
The facts
I have noticed "munin-graph" and "munin-html" took a long time. Long enough to overlap themselves if I keep the interval default (5 minutes).I also noticed a huge "load average"
One of the most annoying thing is the Munin cronjob exiting with error (leading to a mail sent to the administrators) because a lock file (the one of the previous launch) still existing. This is very noise generator and lead to lower the importance of messages sent by munin...
Some solutions
I was helped by this detailed post about Munin performance.
And I decided to put both /var/lib/munin and /var/www/html/munin in tmpfs.
In order to reduce the data loss when rebooting, I make an hourly dump to the disk file system. This is not expensive.
Results
I made it step by step:
- /var/www/html/munin then watch a moment
- /var/lib/munin and see
About disk usage, this is what happened:
About load average, this is what happened:
Conclusion
I just saved some disk I/O mostly writes. Nothing more.