Munin CPU Disk Throughput reduction load average

The System

I run Munin 2.0.16 on CentOS 6 to monitor about 30 hosts.
I mostly monitor 25-30 parameters per host (this depends to the running services)
About half of the hosts are nearby (with a LAN latency) the other part is far (with about 250ms network latency).

The facts

I have noticed "munin-graph" and "munin-html" took a long time. Long enough to overlap themselves if I keep the interval default (5 minutes).
I also noticed a huge "load average"
One of the most annoying thing is the Munin cronjob exiting with error (leading to a mail sent to the administrators) because a lock file (the one of the previous launch) still existing. This is very noise generator and lead to lower the importance of messages sent by munin...

Some solutions

I was helped by this detailed post about Munin performance.

And I decided to put both /var/lib/munin and /var/www/html/munin in tmpfs.

In order to reduce the data loss when rebooting, I make an hourly dump to the disk file system. This is not expensive.

Results

I made it step by step:

/var/www/html/munin then watch a moment
/var/lib/munin and see

About disk usage, this is what happened:

About load average, this is what happened:

Conclusion

I just saved some disk I/O mostly writes. Nothing more.

Mihamina Rakotomandimby

Search This Blog