• If you are still using CentOS 7.9, it's time to convert to Alma 8 with the free centos2alma tool by Plesk or Plesk Migrator. Please let us know your experiences or concerns in this thread:
    CentOS2Alma discussion

Resolved "usermng --get-users-list" - Checking system users stuck on Google Cloud

Nomadturk

Basic Pleskian
Username:

TITLE


usermng

PRODUCT, VERSION, OPERATING SYSTEM, ARCHITECTURE

Debian 11, Bullseye, Plesk Obsidian Web Host Edition
Version 18.0.54 Update #2

PROBLEM DESCRIPTION

On a brand new Plesk installation on Google Cloud,

While doing
plesk repair all -y

The process gets stuck most of the times like below:

Reconfiguring the Plesk installation Reconfiguring the Plesk installation ............................ [OK] Checking the Plesk database using the native database server tools .. [OK] Checking the structure of the Plesk database ........................ [OK] Checking the consistency of the Plesk database ...................... [OK] Checking system users

After a while we either get:
Checking system users ERR [util_exec] proc_close() failed ['/opt/psa/admin/bin/usermng' '--get-users-list'] with exit code [137]

Or the system freezes completely and you have to restart the server.

I realized that when you run
/opt/psa/admin/bin/usermng --get-users-list

the process never gets finished and the system slowly starts eating all the ram.
Till the system crashes.


I tried this on multiple installations.
I tried this with and without swap.

They all end up being the same.

This is from syslog:

kernel: [ 271.459950] usermng invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 kernel: [ 271.470897] CPU: 2 PID: 45630 Comm: usermng Not tainted 5.10.0-23-cloud-amd64 #1 Debian 5.10.179-3 kernel: [ 272.757602] [ 45630] 0 45630 1675289 1662925 13418496 0 0 usermng kernel: [ 272.786826] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-3.scope,task=usermng,pid=45630,uid=0 kernel: [ 272.804582] Out of memory: Killed process 45630 (usermng) total-vm:6701156kB, anon-rss:6651700kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:13104kB oom_score_adj:0

I increased the memory of my machines. It didn't work.
I even created a new VM and tested this. It didn't work.
This happens even if Plesk has 1 or 0 users. So it's not like I have 1000000 users either.

Lastly,
I tried this on fresh Debian 11 and Debian 10 installations on Google Cloud, with 64GB ram.
Right after installing debian, I installed Plesk and ran the command.

Same thing on both instances.
It used all of the available ram without producing anything.


ScreenShot_NPIhuezPDF.png

firefox_hwi5hKN2Tm.png
STEPS TO REPRODUCE

Just run:
/opt/psa/admin/bin/usermng --get-users-list

ACTUAL RESULT

System hangs

EXPECTED RESULT

Getting an instant list of users.

ANY ADDITIONAL INFORMATION

This is NOT the Google Cloud image.
This is a fresh Debian 11, Bullseye installation.

I've installed Plesk via:

sh <(curl https://autoinstall.plesk.com/plesk-installer || wget -O - https://autoinstall.plesk.com/plesk-installer)


Code:
1. [*] Plesk
 2. [*] BIND DNS server
 3. [ ] PostgreSQL server
 4. [ ] Fail2Ban
 5. [*] All language localization for Plesk
 6. [*] Git
 7. [ ] Resource Controller (Cgroups)
 8. [*] Plesk Migrator
 9. [ ] Web Presence Builder
10. [*] MySQL server
11. [.] <+> Webmail services // 1 of 2 components selected
12. [.] <+> Mail hosting // 2 of 6 components selected
13. [.] <+> Web hosting // 8 of 19 components selected
14. [.] <+> Plesk extensions // 8 of 15 components selected


Here


Code:
Primary components list / extensions
===============================================================================

Select the components you want to install:
 1. [*] Plesk Web Server Configuration Troubleshooter
 2. [ ] Plesk Firewall
 3. [ ] Watchdog system monitoring
 4. [ ] WP Toolkit
 5. [ ] Advisor
 6. [ ] SEO Toolkit
 7. [ ] ImunifyAV
 8. [*] SSL It!
 9. [*] Let's Encrypt
10. [*] Repair Kit
11. [*] PHP Composer
12. [*] Monitoring
13. [*] Log Browser
14. [ ] SSH Terminal
15. [*] Site Import



Now, I'm scared to move my servers till there is a fix for this.
I don't want to wake up to a billion problems due to system trying to fix things and crashing itself.

YOUR EXPECTATIONS FROM PLESK SERVICE TEAM

Help with sorting out
 
Hi @Nomadturk, I tested it on Debian 11 with Plesk 18.0.54 #2, but was unable to reproduce the issue. "/opt/psa/admin/bin/usermng --get-users-list" delivers the expected output here.

Is there anything special about a username on your system, e.g. a non-ASCII character in it?
 
firefox_aoA1cJreSy.png


And the lines from syslog:


Bash:
#cat /var/log/syslog |grep "usermng" -A10 -B130

Aug  6 07:46:51 instance-2 kernel: [ 1428.276628] master invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Aug  6 07:46:51 instance-2 kernel: [ 1428.286534] CPU: 7 PID: 41064 Comm: master Not tainted 5.10.0-23-cloud-amd64 #1 Debian 5.10.179-1
Aug  6 07:46:51 instance-2 kernel: [ 1428.295512] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/12/2023
Aug  6 07:46:51 instance-2 kernel: [ 1428.304841] Call Trace:
Aug  6 07:46:51 instance-2 kernel: [ 1428.307395]  dump_stack+0x6b/0x83
Aug  6 07:46:51 instance-2 kernel: [ 1428.310815]  dump_header+0x4a/0x1f4
Aug  6 07:46:51 instance-2 kernel: [ 1428.314413]  oom_kill_process.cold+0xb/0x10
Aug  6 07:46:51 instance-2 kernel: [ 1428.318704]  out_of_memory+0x1bd/0x4e0
Aug  6 07:46:51 instance-2 kernel: [ 1428.322559]  __alloc_pages_slowpath.constprop.0+0xbcc/0xc90
Aug  6 07:46:51 instance-2 kernel: [ 1428.328246]  __alloc_pages_nodemask+0x2de/0x310
Aug  6 07:46:51 instance-2 kernel: [ 1428.332884]  pagecache_get_page+0x175/0x390
Aug  6 07:46:51 instance-2 kernel: [ 1428.337169]  filemap_fault+0x6a2/0x900
Aug  6 07:46:51 instance-2 kernel: [ 1428.341025]  ? xas_load+0x5/0x80
Aug  6 07:46:51 instance-2 kernel: [ 1428.344357]  ext4_filemap_fault+0x2d/0x50
Aug  6 07:46:51 instance-2 kernel: [ 1428.348469]  __do_fault+0x37/0xa0
Aug  6 07:46:51 instance-2 kernel: [ 1428.351888]  handle_mm_fault+0x1254/0x1c60
Aug  6 07:46:51 instance-2 kernel: [ 1428.356108]  ? __hrtimer_init+0xd0/0xd0
Aug  6 07:46:51 instance-2 kernel: [ 1428.360051]  do_user_addr_fault+0x1b8/0x400
Aug  6 07:46:51 instance-2 kernel: [ 1428.364343]  ? switch_fpu_return+0x44/0xc0
Aug  6 07:46:51 instance-2 kernel: [ 1428.368544]  exc_page_fault+0x78/0x160
Aug  6 07:46:51 instance-2 kernel: [ 1428.372397]  ? asm_exc_page_fault+0x8/0x30
Aug  6 07:46:51 instance-2 kernel: [ 1428.376595]  asm_exc_page_fault+0x1e/0x30
Aug  6 07:46:51 instance-2 kernel: [ 1428.380705] RIP: 0033:0x559d27bc6d90
Aug  6 07:46:51 instance-2 kernel: [ 1428.384393] Code: Unable to access opcode bytes at RIP 0x559d27bc6d66.
Aug  6 07:46:51 instance-2 kernel: [ 1428.391028] RSP: 002b:00007ffc771a4f38 EFLAGS: 00010293
Aug  6 07:46:51 instance-2 kernel: [ 1428.396354] RAX: 0000559d281674d0 RBX: 00007ffc771a4f50 RCX: 00007fb29854ed16
Aug  6 07:46:51 instance-2 kernel: [ 1428.403602] RDX: 00007fb2986978c0 RSI: 0000559d28167ac0 RDI: 0000000000000008
Aug  6 07:46:51 instance-2 kernel: [ 1428.410835] RBP: 00007fb2986978c0 R08: 0000000000000000 R09: 000000000000000d
Aug  6 07:46:51 instance-2 kernel: [ 1428.418072] R10: 00000000000003e8 R11: 0000000000000246 R12: 00007fb298697954
Aug  6 07:46:51 instance-2 kernel: [ 1428.425309] R13: 0000559d28167570 R14: 0000559d28167590 R15: 00007fb29868748a
Aug  6 07:46:51 instance-2 kernel: [ 1428.432593] Mem-Info:
Aug  6 07:46:51 instance-2 kernel: [ 1428.434986] active_anon:103 inactive_anon:16260567 isolated_anon:0
Aug  6 07:46:51 instance-2 kernel: [ 1428.434986]  active_file:114 inactive_file:0 isolated_file:0
Aug  6 07:46:51 instance-2 kernel: [ 1428.434986]  unevictable:0 dirty:0 writeback:0
Aug  6 07:46:51 instance-2 kernel: [ 1428.434986]  slab_reclaimable:10083 slab_unreclaimable:16380
Aug  6 07:46:51 instance-2 kernel: [ 1428.434986]  mapped:1127 shmem:1217 pagetables:33033 bounce:0
Aug  6 07:46:51 instance-2 kernel: [ 1428.434986]  free:81030 free_pcp:259 free_cma:0
Aug  6 07:46:51 instance-2 kernel: [ 1428.467258] Node 0 active_anon:412kB inactive_anon:65042268kB active_file:456kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:4508kB dirty:0kB writeback:0kB shmem:4868kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 675840kB writeback_tmp:0kB kernel_stack:5840kB all_unreclaimable? yes
Aug  6 07:46:51 instance-2 kernel: [ 1428.495463] Node 0 DMA free:11732kB min:16kB low:28kB high:40kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15920kB managed:15828kB mlocked:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Aug  6 07:46:51 instance-2 kernel: [ 1428.521933] lowmem_reserve[]: 0 2975 64280 64280 64280
Aug  6 07:46:51 instance-2 kernel: [ 1428.527219] Node 0 DMA32 free:248152kB min:3124kB low:6168kB high:9212kB reserved_highatomic:0KB active_anon:0kB inactive_anon:2802188kB active_file:4kB inactive_file:0kB unevictable:0kB writepending:0kB present:3126072kB managed:3060532kB mlocked:0kB pagetables:5480kB bounce:0kB free_pcp:256kB local_pcp:248kB free_cma:0kB
Aug  6 07:46:51 instance-2 kernel: [ 1428.556019] lowmem_reserve[]: 0 0 61305 61305 61305
Aug  6 07:46:51 instance-2 kernel: [ 1428.561020] Node 0 Normal free:64236kB min:64440kB low:127216kB high:189992kB reserved_highatomic:0KB active_anon:412kB inactive_anon:62240052kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:63963136kB managed:62784532kB mlocked:0kB pagetables:126652kB bounce:0kB free_pcp:780kB local_pcp:484kB free_cma:0kB
Aug  6 07:46:51 instance-2 kernel: [ 1428.590764] lowmem_reserve[]: 0 0 0 0 0
Aug  6 07:46:51 instance-2 kernel: [ 1428.594727] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 2*32kB (U) 2*64kB (U) 2*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11732kB
Aug  6 07:46:51 instance-2 kernel: [ 1428.608251] Node 0 DMA32: 1*4kB (M) 54*8kB (U) 37*16kB (UM) 37*32kB (UM) 13*64kB (UM) 4*128kB (UM) 2*256kB (UM) 1*512kB (M) 2*1024kB (UM) 2*2048kB (UM) 58*4096kB (M) = 248292kB
Aug  6 07:46:51 instance-2 kernel: [ 1428.624139] Node 0 Normal: 1621*4kB (UME) 936*8kB (UME) 799*16kB (UME) 429*32kB (UME) 177*64kB (UME) 87*128kB (UME) 9*256kB (UME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 65252kB
Aug  6 07:46:51 instance-2 kernel: [ 1428.639842] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Aug  6 07:46:51 instance-2 kernel: [ 1428.648667] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug  6 07:46:51 instance-2 kernel: [ 1428.657226] 1312 total pagecache pages
Aug  6 07:46:51 instance-2 kernel: [ 1428.661091] 0 pages in swap cache
Aug  6 07:46:51 instance-2 kernel: [ 1428.664542] Swap cache stats: add 0, delete 0, find 0/0
Aug  6 07:46:51 instance-2 kernel: [ 1428.669885] Free swap  = 0kB
Aug  6 07:46:51 instance-2 kernel: [ 1428.672895] Total swap = 0kB
Aug  6 07:46:51 instance-2 kernel: [ 1428.675899] 16776282 pages RAM
Aug  6 07:46:51 instance-2 kernel: [ 1428.679075] 0 pages HighMem/MovableOnly
Aug  6 07:46:51 instance-2 kernel: [ 1428.683026] 311059 pages reserved
Aug  6 07:46:51 instance-2 kernel: [ 1428.686461] Tasks state (memory values in pages):
 
Hi Peter,

There is none. But I am still testing this with new VM's from Google Cloud using their Debian 10 and 11 images. In every instance I try and install Plesk on, I get the same problem.
I'm running all the commands as root.
The only other user on the system has been created by Google and it is using the format: user_domain_tld
So, only the underscore character.
 
I tried to reproduce it with a user that contains two underscores in its name, but still was not able to reproduce this.
I have forwarded the case as ID PPS-14796 to developers.
 
I don't think it has anything to do with the username.
I tried adding a user like that to my older, working Debian 10 installation.
The command just worked as-is.

I tried scp'ing the sbin/usermng file but... It didn't change anything either.

Also, I created another VM from Google Cloud Marketplace.
That one uses the dreaded CentOS.

That CentOS installation had the exact same problem though.
firefox_hGNmQTcHwq.png




If anyone has Google Cloud and can install it there, they should be able to replicate this problem.
 
OK, I am working on this issue for quite some hours.
After more digging, I did:

Code:
strace /opt/psa/admin/bin/usermng --get-users-list

And I noticed a loop happening:

Code:
openat(AT_FDCWD, "/etc/oslogin_passwd.cache", O_RDONLY|O_CLOEXEC) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=223, ...}) = 0
read(4, "user_domain_tld:*:667684058:667"..., 4096) = 223
close(4)                                = 0
openat(AT_FDCWD, "/etc/oslogin_passwd.cache", O_RDONLY|O_CLOEXEC) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=223, ...}) = 0
read(4, "user_domain_tld:*:667684058:667"..., 4096) = 223
openat(AT_FDCWD, "/etc/group", O_RDONLY|O_CLOEXEC) = 5
lseek(5, 0, SEEK_CUR)                   = 0
fstat(5, {st_mode=S_IFREG|0644, st_size=1220, ...}) = 0
read(5, "root:x:0:\ndaemon:x:1:\nbin:x:2:\ns"..., 4096) = 1220
lseek(5, 0, SEEK_CUR)                   = 1220
read(5, "", 4096)                       = 0
close(5)                                = 0
close(4)                                = 0
openat(AT_FDCWD, "/etc/oslogin_passwd.cache", O_RDONLY|O_CLOEXEC) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=223, ...}) = 0
read(4, "user_domain_tld:*:667684058:667"..., 4096) = 223
close(4)


This kept going and going till I stopped the script.
This oslogin_passwd.cache file is part of Google's templates.


And it includes values such as:

Code:
user_domain_tld:*:667684058:667684058::/home/user_domain_tld:/bin/bash
otheruser_domain2_tld:*:975067857:975067857::/home/otheruser_domain2_tld:/bin/bash
ext_someone_gmail_com:*:3874395100:3874395100::/home/ext_someone_gmail_com:/bin/bash

These are all the users that exist in Google Cloud Project and that has access to virtual machines there.
Maybe that asterisk is throwing usermng off, no idea. But as a test, I renamed /etc/oslogin_passwd.cache and usermng instantly started working as expected.

But when I restarted the machine, Google re-creating this file. Meaning the issue persists.

That oslogin file belongs to
GitHub - GoogleCloudPlatform/guest-oslogin: OS Login Guest Environment for Google Compute Engine

and is referenced in /etc/nsswitch.conf as

Code:
passwd:         files cache_oslogin oslogin
group:          files cache_oslogin oslogin

To test, I have commented it out and rebooted the server.
Google did not try to uncomment those two lines. And even though /etc/oslogin_passwd.cache file exists, the usermng command ran with success again.

The second I change

/etc/nsswitch.conf From:
Code:
passwd:         files oslogin
group:          files oslogin

/etc/nsswitch.conf To
Code:
passwd:         files cache_oslogin oslogin
group:          files cache_oslogin oslogin

The issue comes back again.
So, the system is trying to use cache_oslogin here and usermng goes into an endless loop due to that.

As a test, I've tried adding the metadata below to that Google Cloud Compute Engine Instance

Code:
enable-oslogin = FALSE

After that, Google updated /etc/nsswitch.conf to be:

Code:
passwd:         files
group:          files

And usermng works as expected.

So, till there is a fix for this, I will keep oslogin option disabled. Even though it's part of GCP Best Practices.
But these findings should be helpful to the devs as well as someone who might run into this issue like me.
 
A qualified engineer tried to reproduce the issue on the authentic Google Cloud Debian 11 environment, but could not reproduce the issue. Before I also failed to reproduce it on other test environments. We have also not yet received reports by other users on this.

So probably this issue is specific to your individual environment. Your best chance to find out why it is happening in your Google account is to create a support ticket with Plesk support and allow SSH access to the server. Please provide PPS-14796 as a reference for support staff so that they can continue to work on that same case.

To sign-in to support please go to https://support.plesk.com

If you experience login issues, please see this KB article:
https://support.plesk.com/hc/en-us/...rt-plesk-com-and-password-reset-does-not-work

If you bought your license from a reseller, your reseller should provide support for you. If the reseller does not provide support, here is an alternative:
https://support.plesk.com/hc/en-us/articles/12388090147095-How-to-get-support-directly-from-Plesk-
 
enable-oslogin = TRUE

This is the key here.
He should be after enforcing Best Practices.
That Metadata doesn't come enabled on a vanilla environment.

If you have that Metadata enabled, project wide, you can easily replicate it.
 
According to the test engineer, this is exactly what was tested. If you wish to go further, please contact support as asked for above.
 
@Nomadturk, thank you for helping to identify the cause. I see that the ticket was processed. Investigation has proven that the issue is caused by a bug in oslogin, not Plesk, and it needs to be fixed by Google in the oslogin algorithm. Plesk has provided the operating system level code to you so that it can be reproduced independently from Plesk in the Google Cloud environment. While in August 2022, Google had blamed 3rd party software for that bug (OSlogin makes chef (ruby) crash · Issue #87 · GoogleCloudPlatform/guest-oslogin) the provided code sample proves that the bug exists independent from 3rd party software. It cannot be fixed by 3rd parties.

Please contact Google Cloud support and present the findings including the code snippet by which the error can be reproduced on the operating system level of their environments so that they can provide a fix for their cloud environment.
 
Back
Top