• If you are still using CentOS 7.9, it's time to convert to Alma 8 with the free centos2alma tool by Plesk or Plesk Migrator. Please let us know your experiences or concerns in this thread:
    CentOS2Alma discussion

Forwarded to devs Let's Encrypt still causing issues with web server configuration on updating certificates

Bitpalast

Plesk addicted!
Plesk Guru
TITLE:
Let's Encrypt still causing issues with web server configuration on updating certificates
PRODUCT, VERSION, OPERATING SYSTEM, ARCHITECTURE:
Onyx 17.0, CentOS 7.2, Let's Encrypt 2.0.3 release 31, 64-Bit
PROBLEM DESCRIPTION:
SSL renewal breaks web server configuration. It still does :-(​
STEPS TO REPRODUCE:
Have many domains up for SS certificate renewal, then execute Let's Encrypt renewal routine in a situation where many certificates need renewal.​
ACTUAL RESULT:
Breaking web server configuration, requiring manual reconfiguration of a small number of domains and manual restart of httpd and nginx.​
EXPECTED RESULT:
Silently renew without breaking certificate file names or Plesk web server configuration.​
ANY ADDITIONAL INFORMATION:
The daily renewal task has renewed certificates. The renewal process lasted approximately 25 minutes. Many certificates were renewed. During the process, approximately a dozen times this error message is mailed to admin:

"Unable to generate the web server configuration file on the host <HOSTNAME.TLD> because of the following errors:

Template_Exception: nginx: [emerg] BIO_new_file("/usr/local/psa/var/certificates/cert-7CIkBh") failed (SSL: error:02001002:system library:fopen:No such file or directory:fopen('/usr/local/psa/var/certificates/cert-7CIkBh','r') error:2006D080:BIO routines:BIO_new_file:no such file)

nginx: configuration file /etc/nginx/nginx.conf test failed

file: /usr/local/psa/admin/plib/Template/Writer/Webserver/Abstract.php
line: 75
code: 0

Please resolve the errors in web server configuration templates and generate the file again."

and similar from Nginx:

"Template_Exception: nginx: [emerg] BIO_new_file("/usr/local/psa/var/certificates/cert-7CIkBh") failed (SSL: error:02001002:system library:fopen:No such file or directory:fopen('/usr/local/psa/var/certificates/cert-7CIkBh','r') error:2006D080:BIO routines:BIO_new_file:no such file)

nginx: configuration file /etc/nginx/nginx.conf test failed"

httpd web server configuration tests ("apachectl -t") during the renewal process list configuration errors. During the last few minutes of the script run, nginx configuration tests ("nginx -t") also show errors. After the renewal script has finished these configuration errors do not longer appear, but in one case one Plesk configuration in the GUI was marked as "damaged" while it was not --> The first domain configuration that the mail came on. The only way to remove the error marker and message from GUI is to run Troubleshooter Extension and to reconfigure all erroneous configuration files.

But after doing that, we had to learn that Nginx displayed a 502 bad gateway error, although httpd was "active" according to the service status output. Only a manual restart of httpd and nginx afterwards resolved that issue.

This behavior is all not new, it's been the case for many months with different versions of the Let's Encrypt extension on different hosts that we operate. I know that this will be hard to figure out, because the test scenario is difficult to reproduce. I can only say that we've seen the wrong certificate links before. I read that with 2.0.3 the algorithm was replaced to use symbolic links instead of real files for the certificates, so that this could no longer happen, but it does not seem to work yet.
YOUR EXPECTATIONS FROM PLESK SERVICE TEAM:
Confirm bug
 
Peter,

Developers can't reproduce this issue on their test servers.
So they need access to your affected server to continue this investigation.
 
Last time it occured on May 7 on one host. When you tell me when the many certificate that were renewed on May 7 will be renewed next time I could enable debug mode before and give access to developers afterwards. Or you can have access right away. Or we can stop the cron for a few days around the new renewal data and let developers run that "live". It's a production system with approx. 850 domains on it, so we'll need to find a way that services are not interrupted during surveillance of the issue. PM for further information?
 
I've contacted Aleksey.

Today, after we mailed a newsletter that now webmail subdomains can be secured by Let's Encrypt, several customers updated their certs around the same time - and again this has caused missing certificates (or wrong filenames). The issue seems to occur only when several certificates are updated at the same time, e.g. when one update process is not finished while another starts. I think this is caused by the web server start interval not being observed by the extension. The extension reloads/restarts right away, but it should actually wait until the interval expires. Further, a reload of the web server(s) should never be done while the extension is creating or updating a certificate. Currently, a restart/reload is performed if the interval expires, not watching an ongoing Let's Encrypt renewal process. This seems to be causing the outages. It only occurs on systems with many accounts or many renewals.
 
Same thing again today, different domain, same transaction. User has added "webmail" checkbox to his cert and clicked "renew" button. Resulting in:

Code:
   Redirecting to /bin/systemctl status  httpd.service
   â httpd.service - The Apache HTTP Server
      Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
     Drop-In: /usr/lib/systemd/system/httpd.service.d
              ââlimit_nofile.conf
      Active: failed (Result: exit-code) since Wed 2017-05-24 09:23:34 CEST; 2min 50s ago
        Docs: man:httpd(8)
              man:apachectl(8)
     Process: 17913 ExecStop=/bin/kill -WINCH ${MAINPID} (code=exited, status=1/FAILURE)
     Process: 4174 ExecReload=/usr/sbin/httpd $OPTIONS -k graceful (code=exited, status=0/SUCCESS)
     Process: 17910 ExecStart=/usr/sbin/httpd $OPTIONS -DFOREGROUND (code=exited, status=1/FAILURE)
    Main PID: 17910 (code=exited, status=1/FAILURE)

   May 24 09:23:33 trebel systemd[1]: Starting The Apache HTTP Server...
   May 24 09:23:34 trebel httpd[17910]: AH00526: Syntax error on line 49 of /etc/httpd/conf/plesk.conf.d/vhosts/DOMAIN.TLD.conf:
   May 24 09:23:34 trebel httpd[17910]: SSLCertificateFile: file '/usr/local/psa/var/certificates/cert-la2pAT' does not exist or is empty
   May 24 09:23:34 trebel systemd[1]: httpd.service: main process exited, code=exited, status=1/FAILURE
   May 24 09:23:34 trebel kill[17913]: kill: cannot find process ""
   May 24 09:23:34 trebel systemd[1]: httpd.service: control process exited, code=exited status=1
   May 24 09:23:34 trebel systemd[1]: Failed to start The Apache HTTP Server.
   May 24 09:23:34 trebel systemd[1]: Unit httpd.service entered failed state.
   May 24 09:23:34 trebel systemd[1]: httpd.service failed.

3 minutes later: manual restart not an issue anymore, server restarting and active again.
 
I have more details on it now, because we were able to reproduce the issue in one case: When a certificate is being renewed, the Apache web server is restarted before the certificate file is in place. httpd service goes into "deactivating" status, but the cert file is still missing. A test of "nginx -t" shows that Nginx realizes that there is configuration problem. Would Nginx start in that very moment, too, it would fail, because of the missing certificate file. Then things go very fast: Very briefly before the "deactivating" phase of the httpd service restart is complete, the certificate file becomes available in the expected path location. Nginx can then be restarted, too. After the "deactivating" phase in a restart of Apache, an "activating" phase follows. Here the certificate become available "just in time", so that in a majority of cases it works.

So this is definitely a timing issue. A web server reload or restart is executed before the certificate file has been written to the location that is given in the web server configuration files. If the extension would simply wait to write the certificate file name into the web server configuration and wait with web server restart until the actual certificate file becomes available in the certificate directory, things would be alright.

Further, Plesk itself must be modified: The current web server restart interval must be delayed further, if a "letsencrypt" entry is present in the process list. A web server restart while a certificate is being renewed (e.g. a restart caused by a configuraton change from customer B while the certificate of customer A is still in the renewal process) will break the web server service. That is because the restart from process of customer B cannot complete as the certificate for customer A is still missing. So to really fix this issue, for one the extension must be modified, but the Plesk web server restart behavior must be modified, too.
 
Last time I checked it was still on 2.2.2, now seeing that on August 31st the update to 2.3 came out. It looks like at least two issues were addressed in it that are linked to this case:
  • [-] Under certain circumstances, if the web server restarted during the process of renewing a certificate, it could not access the certificate file, which resulted in failure to restart. (EXTLETSENC-213)
  • [-] The symbolic links to issued certificates were created with Unix-style path separator, which resulted in them being unreadable. (EXTLETSENC-235)
The first (EXTLETSENC-213) is the major one that we had observed dozens of incidences through the past. If that is fixed, most likely the case is solved. I'll continue to monitor this and report here should the problem surface again.
 
Not resolved. This morning again:
"Unable to generate the web server configuration file on the host <ohre.bitpalast.net> because of the following errors:

Template_Exception: AH00526: Syntax error on line 80 of /etc/httpd/conf/plesk.conf.d/webmails/<domain>_webmail.conf:
SSLCACertificateFile: file '/usr/local/psa/var/certificates/cert-yR9a0p' does not exist or is empty

file: /usr/local/psa/admin/plib/Template/Writer/Webserver/Abstract.php
line: 75
code: 0"

A test a minute later showed no error, the certificate exists. The timing issue with certificate placements on disk obviously still exists. The web server is restarted (or reloaded) before it is verified that the certificate is really physically accessible on disk. I did discuss it with Aleksey, but it is obviously still the same problem.
 
Back
Top