We are urgently hiring an experienced System Administrator to help us troubleshoot a serious issue with one of our main web servers, which has recently been failing due to heavy load.
This is a cloud server hosted at DigitalOcean, having 8 CPUs, 160GB SSD Disk and 16GB of RAM, running nginx 1.4.2 with PHP 5.5.3 (served via php-fpm) and MySQL 5.6
Although the traffic has increased drastically in the past week, we still have over 95% of idle CPU and over 1GB free RAM at any given time. However, the php-fpm fastcgi server stops responding at random intervals during peak hours and a manual 'service php-fpm restart' is necessary to get it back online.
We have a hard time identifying whether the bottleneck is MySQL, php-fpm or nginx.
When the issue occurs, the following errors are recorded in the error logs:
WARNING: [pool www] seems busy
PHP Warning: Error while sending QUERY packet. PID=22771 in ********
[error] 9240#0: *581395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: *******, server: *******, request: "GET /l.php?id=124 HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "******", referrer: "****"
[error] 9242#0: *582048 connect() failed (111: Connection refused) while connecting to upstream, client: *******, server: ******, request: "GET /l.php?id=50 HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "*******", referrer: "*******"
Our observations are showing that at some point MySQL stops responding to the requests sent to it by PHP-FPM. At this point, more and more PHP-FPM children are spawned by the server, until the pm.max_children limit is finally reached and PHP-FPM stops responding. This happens literally in seconds, and thus the entire system goes down.
We are looking for a very experienced person, who can work on fixing this issue ASAP. If you don't have experience troubleshooting servers that process a couple of million requests per day, please do not apply.
The selected candidate will receive full cooperation from our technical rep, as well as a walkthrough of how the system is currently set up and what troubleshooting attempts have been made so far.
We are looking forward to hearing from you.