We have been battling with one of 6 Domino XPage servers that is running a consistently higher load average than the other servers. The performance was OK from a customer point of view but differences in systems are always a worry.
We tried everything I could think of – probably spending 2 – 3 days on this overall. We checked the XPage application, we checked Domino configuration, we even moved it to a different host server and we even rebuilt it from a (new) standard image. Anyhow, the problem has been solved 🙂
We recently did a routine update to the linux OS and the issue went away. It seems that there was a bug in the linux release that we were using which –
“Due to prematurely decremented calc_load_task, the calculated load
average was off by up to the number of CPUs in the machine. As a
consequence, job scheduling worked improperly causing a drop in the system
performance – https://rhn.redhat.com/errata/RHSA-2016-0494.html“
So the morale of this story
- Do some monitoring, that way you can see how updates to the OS and other software are affecting performance. If you roll out an update look for improved or decreased performance – always being on the latest release is not necessarily the solution.
- Never build a standard recovery image and assume that it will run correctly in operation unless you have run it in production. This server was first based on our latest build template ( including a new OS version – which was broken ) and when we rebuilt it we used that image again ( still with the broken OS )
We are using the Opsview Atom monitoring suite which is based on the Nagios open source platform. It is fairy good ( and well priced ). I will be post some articles on it soon.