Resolved Help troubleshooting Nginx + Php-Fpm Set up

Discussion in 'Plesk 12.x for Linux' started by Dre, Apr 14, 2017.

  1. Dre

    Dre New Pleskian

    0
    70%
    Joined:
    Apr 14, 2017
    Messages:
    6
    Likes Received:
    0
    Location:
    San Jose
    Researched this issue many times and now need help understanding some diagnostics. Basically switched a server over from Apache + Php regular to Nginx + php-fpm . The problem is after a small increase in users Nginx starts encountering 503 and 502 errors. The main error we see in the log is "Connection reset by peer". I ran iostat and top commands sorted by CPU usage and need help understanding if these are normal.
    We stress tested with Apache Jmeter and as you can see in the top graphic many child processes get created with about 30 - 40 % cpu. I can't imagine that being good. Also when running iostat the avg-cpu %user column jumps to the 90th percentile . I can't imagine that being good either. The stress test we did was 100 users in 60 seconds which is the kind of activity this site gets.
    I imagine this is a php script causing this problem , particularly wordpress. Does anyone have suggestions where to begin debugging this problem? So far we have tried increasing the buffer size, increasing the timeouts in php and nginx, but nothing has worked.
    iostat
    [​IMG]
    top
    [​IMG]
    php-fpm pool settings attached in text file
     

    Attached Files:

  2. Peter Debik

    Peter Debik Golden Pleskian Plesk Guru

    34
    30%
    Joined:
    Oct 15, 2015
    Messages:
    1,425
    Likes Received:
    233
    Location:
    Berlin, Germany
    Checkout the video on PHP-FPM: Process Management . It gives some good insight in the PHP FPM settings.

    Personally I think that the pm.max_requests = 50 is too low, because it forces new children to be respawned after 50 requests have been processed. An unnecessary extra load. Why not set a 1000 for example for max_requests?

    I am unsure whether the numbers in your screenshot are bad or not. The total load is only around 2. How many CPU cores do you have in the machine? And what exactly is the test doing? You are writing about Wordpress. Is this actually running Wordpress scripts? Wordpress can have extreme overheads and is "slow by nature" compared to real websites that were optimized for speed. Advice here is to first measure the performance of the Wordpress website in question. If the program is bad, it creates long running script instances, leading to intensive load on PHP (CPU respectively).
     
    Dre likes this.
  3. Dre

    Dre New Pleskian

    0
    70%
    Joined:
    Apr 14, 2017
    Messages:
    6
    Likes Received:
    0
    Location:
    San Jose
    I made that adjustment for another issue. What do you mean by the total load is around 2? There are 4 cores, the test is just viewing 4 pages on the site and ramping up 500 sessions in 90 seconds. The website has many plugins customized so the wordpress Api . Do you have any suggestions for find g long running scripts ?
     
  4. Peter Debik

    Peter Debik Golden Pleskian Plesk Guru

    34
    30%
    Joined:
    Oct 15, 2015
    Messages:
    1,425
    Likes Received:
    233
    Location:
    Berlin, Germany
    Full load would be 4 if there are 4 cores involved. Please correct me if I am wrong, but I think the 2.05 cpu load means that the server has 50% something load during the test. A 4-core-server might simply be too weak for handling 500 sessions in 90 seconds (>5/second). It works, but surely it becomes slower. The system will probably behave correctly until the load reaches 70% of what the CPU can handle. Testing 4 pages per session is realistic, because at least half of users normally leave a site after page 2. Summing it up it all looks good for me. There is a high load, yes, but it is probably not too high, but realistic for the szenario that you are describing taking a 4-core-cpu into account.

    In order to find slow functions or plugins, you need to do microtime measurements in the code. Something that one normally won't do in third-party plugins. A wealth of plugins is programmed bad with many issues in speed optimization.

    Let us hear other users' advice on this. My opinion is that there is actually no problem.
     
    Last edited: Apr 16, 2017
    Dre likes this.
  5. Dre

    Dre New Pleskian

    0
    70%
    Joined:
    Apr 14, 2017
    Messages:
    6
    Likes Received:
    0
    Location:
    San Jose
    Thank you peter, your input is only helpful. The reason I set the stress test to these values is to try and get the server to break pretty much reproduce the three errors listed

    HTTP 499 in Nginx means that the client closed the connection before the server answered the request. In my experience is usually caused by client side timeout. As I know it's an Nginx specific error code
    To quote the definition of 502 Bad Gateway from Wikipedia: "The server was acting as a gateway or proxy and received an invalid response from the upstream server."
    503 The Web server (running the Web site) is currently unable to handle the HTTPrequest due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay.

    This errors are pretty much the issues we are trying to fix when visitors come to the site and traffic increases.

    Below is a link to a few hundred pages of access log, which shows alot of bot activity, this could be a problem, but I am not sure. According to the description of these errors I am believing that php scripts have to be the cause...do you know if there is anyway to prevent 502 or 503 errors from happening no matter how slow the server is. We are just trying to avoid having nginx go to the "CATCH 50x" error page when one of these occur since it is a negative UX.

    http://andraebrowne.com/web/accesslogsample.html
     
  6. Peter Debik

    Peter Debik Golden Pleskian Plesk Guru

    34
    30%
    Joined:
    Oct 15, 2015
    Messages:
    1,425
    Likes Received:
    233
    Location:
    Berlin, Germany
    499: The user clicked on to another website before the server was able to return the result to the user. This can but does not have to be for a long wait on the response. It is a client-side issue that cannot be fixed by the server. If it is caused by long delays of a response, the server is simply too slow to deliver the website in time. The user gets impatient and moves on to another website.

    502: That's a frequent issue with Nginx. The problem there is that Nginx is a frontend that has forwarded a request to Apache (backend) and is now waiting on the response. Apache however is unreachable. This will be the case either when httpd service is not running or if the maximum number of instances allowed for httpd has been reached. The later is probably the case on your system, because the number of requests are simply too many. If the maximum number of Apache instances that your system has are all busy with delivering pages, the next request cannot be processed, hence Apache becomes unresponsive for the n+1 request. One way to solve it is to increase the maximum number of httpd instances allowed. MaxClients etc. are normally "enough", so if you encounter this issue, the best approach for a long lasting solution is to speed up PHP scripts. If they are executed faster, httpd children will be free sooner so that they can handle more requests.

    503: This is normally caused by a PHP script that is sending no content as a response or a PHP script that does not answer or a script that does not answer within the maximum script runtime set in PHP configuration. First you should in crease the maximum script runtime of PHP scripts in your Plesk configuration, but then you should look into the error_log of the website whether there are errors that kill scripts before they send an output. And last again, the script may simply be too slow.

    All these point into the direction to either beef up the hardware, so that more requests can easily be answered because CPU and harddisks are faster, or to optimize the website scripts for speed. It is normally possible to double the response speed of websites when the plugins are optimized, because many programmers waste resources, place unnecessary commands into loops, do too many and bad formulated SQL statements and sometimes even place them inside loops. However, in a Wordpress website, it might not be good to change plugin code, because on the next auto-update, this will be overwritten.
     
  7. Dre

    Dre New Pleskian

    0
    70%
    Joined:
    Apr 14, 2017
    Messages:
    6
    Likes Received:
    0
    Location:
    San Jose
    Yes it would be very labor intensive to change the plugin code since it would require changing everything through action and hook assuming the plugin was written correctly.

    One thing you mentioned that is very interesting is that we are currently not running nginx in combination with apache. We are just running nginx as a standalone server which I assume should not be a problem. Were you just assuming that we were using nginx as a reverse proxy I believe it is called or is apache necessary for every nginx set up. I don't think it is. Since nginx is currently standalone and we are running at maximum 8 nginx processes could this be the reason for the 502 errors. Basically when traffic picks up the 8 nginx worker processes get created and then if 500 child php fpm processes are allowed those get created next. These are my observations when running the stress testing. Could more nginx child processes be needed to serve the requests? If the settings can't fix the issues , then I will have to recommend updating the hardware to my clients, but it is possible that 503 errors are happening during lower traffic periods. That being said I would assume it is a php script causing the issue, but I am also worried that it might be an nginx setting that is misconfigured, leaning towards php issue.
     
  8. Peter Debik

    Peter Debik Golden Pleskian Plesk Guru

    34
    30%
    Joined:
    Oct 15, 2015
    Messages:
    1,425
    Likes Received:
    233
    Location:
    Berlin, Germany
    In that case it can be the case, too, that PHP FPM is causing the issue. There might not be enough children available to handle all requests, or respawning children could take too long. As always, endless possibilities ...
     
  9. Dre

    Dre New Pleskian

    0
    70%
    Joined:
    Apr 14, 2017
    Messages:
    6
    Likes Received:
    0
    Location:
    San Jose
    Thanks for the direction and advice
     
  10. Peter Debik

    Peter Debik Golden Pleskian Plesk Guru

    34
    30%
    Joined:
    Oct 15, 2015
    Messages:
    1,425
    Likes Received:
    233
    Location:
    Berlin, Germany
    I forgot to mention MySQL. Check your MySQL logs, because it is possible that it is actually the database that cannot handle so many parallel connections. The number of connections that can be made to a database is normally limited. If the number is exceeded by previous requests to a script, additional script runts might not be able to access the database, waiting on it or simply failing. That will, too, lead to a 502 or 503 error.
     
  11. Dre

    Dre New Pleskian

    0
    70%
    Joined:
    Apr 14, 2017
    Messages:
    6
    Likes Received:
    0
    Location:
    San Jose
    Thanks peter, I checked the mysql logs, specifically the slow query log and it look like one plugin is continuously trying to execute a query, and this query normally ends up in the hundreds of seconds for example 600, 700,600. This plugin is also calling this query at a rate of 20 seconds. I have no idea why this query is taking so long to execute ,but there is a good chance this is the source of the problem.
     
Loading...