CPU load over 70 means I can't even ssh into my server
edit: you are right, it's the I/O WAIT that it destroying my performance:
%Cpu(s): 0,3 us, 0,5 sy, 0,0 ni, 50,1 id, 49,0 wa, 0,0 hi, 0,1 si, 0,0 st
I could clearly see it using nmon > d > l > -
such as was suggested by @SayCyberOnceMore. Not quite sure what to do about it, as it's simply my sdb1
drive which is a Samsung 1TB 2.5" HDD. I have now ordered a 2TB SSD and maybe I am going to reinstall from scratch on that new drive as sda1. I realize that's just treating the symptom and not the root cause, so I should probably also look for that root cause. But that's for another Lemmy thread!
I really don't understand what is causing this. I run a few very small containers, and everything is fine - but when I start something bigger like Photoprism, Immich, or even MariaDB or PostgreSQL, then something causes the CPU load to rise indefinitely.
Notably, the top
command doesn't show anything special, nothing eats RAM, nothing uses 100% CPU. And yet, the load is rising fast. If I leave it be, my ssh session loses connection. Hopping onto the host itself shows a load of over 50,or even over 70. I don't grok how a system can even get that high at all.
My server is an older Intel i7 with 16GB RAM running Ubuntu22. 04 LTS.
How can I troubleshoot this, when 'top' doesn't show any culprit and it does not seem to be caused by any one specific container?
(this makes me wonder how people can run anything at all off of a Raspberry Pi. My machine isn't "beefy" but a Pi would be so much less.)
It's sounds like it could be an IO wait issue, system load will climb a ton without showing much CPU usage.
Make sure you're not running out of RAM and going into swap space, it doesn't sound like it though.
iotop
might show something useful. And inhtop
you can add the 'PERCENT_IO_DELAY" column which can be useful.My money is also on IO. Outside of CPU and RAM, it's the most likely resource to get saturated (especially if using rotational magnetic disks rather than an SSD, magnetic disks are going to be the performance limiter by a lot for many workloads), and also the one that OP said nothing about, suggesting it's a blind spot for them.
In addition to the excellent command-line approaches suggested above, I recommend installing netdata on the box as it will show you a very comprehensive set of performance metrics without having to learn to collect each one on the CLI. A downside is that it will use RAM proportional to the data retention period, which if you're swapping hard will be an issue. But even a few hours of data can be very useful and with 16gb of ram I feel like any swapping is likely to be a gross misconfiguration rather than true memory demand... and once that's sorted dedicating a gig or two to observability will be a good investment.
And I know OP mentioned not using much ram, but almost everytime I see a server load that high, it's usually because the server is swapping heavily causing the iowait.
Yeah I figured I would mention it since OP does describe symptoms like that.
Yep. IO.
OP, this might be overkill for you but it might be worth standing up a grafana/prometheus stack.. You'd be able to see this stuff a lot faster and potentially narrow in on a root cause.
That is definitely an interesting idea! Much, much better than the stupid
dashdot
container I am running now :-D