The moral of the story – do some monitoring

We have been battling with one of 6 Domino XPage servers that is running a consistently higher load average than the other servers. The performance was OK from a customer point of view but differences in systems are always a worry.

2016-08-16_08-57-35

We tried everything I could think of – probably spending 2 – 3 days on this overall. Ā We checked the XPage application, we checked Domino configuration, we even moved it to a different host server and we even rebuilt it from a (new) standard image. Anyhow, the problem has been solved šŸ™‚

We recently did a routine update to the linux OS and the issue went away. It seems that there was a bug in the linux release that we were using which –

“Due to prematurely decremented calc_load_task, the calculated load
average was off by up to the number of CPUs in the machine. As a
consequence, job scheduling worked improperly causing a drop in the system
performance –Ā https://rhn.redhat.com/errata/RHSA-2016-0494.html

So the morale of this story

  1. Do some monitoring, that way you can see how updates to the OS and other software are affecting performance.Ā If you roll out an update look for improved or decreased performance – always being on the latest release is not necessarily the solution.
  2. NeverĀ build a standard recovery image and assume that it will run correctly in operation unless you have run it in production. This server was firstĀ based on our latest build template ( including a new OS version – which was broken ) and when we rebuilt it we used that image again ( still with the broken OS )

We areĀ using the Opsview Atom monitoring suite which is based on the Nagios open source platform. It is fairy good ( and well pricedĀ ). I will be postĀ some articles on it soon.

loadaverage