Hi, Peter.
The particularly difficult part of this situation is that when the spikes happen, SSH becomes essentially nonfunctional like the rest of the server. Always being at totally random times and only happening every 12-72 hours, it's about impossible for us to be watching at the right time. The few times we have had SSH open when it happens, it basically freezes and only after the load average calms back down does it work again - and so does all the rest of the server. The only thing we've -ever- been able to catch is when we could get #top to finally respond a few times, and it shows that kswapd0 immediately jumps to 100% for the time of the spike. But no access logs or anything show anything unusual beforehand. All our research regarding the kswapd0 situation points to changing the swappiness setting, which we've tried many, many times.
We've been running servers for 20+ years and haven't seen an issue like this. So frustrating.
One thing to note - this issue DOES seem somehow tied to wordpress sites. The only new centos7 server that has never had these spikes has only Joomla sites on it. All others have wordpress, joomla, custom, etc. Wordpress is by FAR the greediest and most picky on resources, but exactly what it might be triggering is a mystery. And the centos6 servers we migrated from never, ever did this. It's the same sites we migrated from centos6 to centos7 with the latest plesk, and we noticed almost immediately (within days) that this issue was happening. Those original spikes would be 10-12 times per day and could last for up to 30 minutes, so we're vastly improved now after months of countless tweaks - but we'd SO like to know what's actually causing this and stamp it out for good.