Tech:Incidents/2015-03-16

prod8 suffered a Linux OOM killer issue on March 6th, 2015, which killed HHVM, thus causing downtime on the farm. Not much later, it seems that this issue also had a chance to affect prod9 and prod11 too.

Summary
At 12:00 (yes, 24-hour clock) HHVM on prod8 got killed by the Linux OOM killer, so from that moment a big part (if not all) of the traffic got 502 errors when trying to access the farm.

When I tried to log in into prod8, it took longer than usual, and I finally killed (SIGKILL) around 6 processes, most of them "php5 /srv/mediawiki/w/maintenance/runJobs.php".

During the writing of this report, I checked ganglia, and saw that prod9 was using 100% of the CPU, while prod11 was too (on prod11 the CPU usage was going up and down, up and down, etc). On these servers, multiple instances of runJobs.php were running too, and to lower the load on these severs, I killed any instance of runJobs.php on these servers too.

Timeline

 * 12:00 Southparkfan: HHVM got killed by the Linux OOM killer (per /var/log/syslog), the first signs of trouble got reported by nagios one minute later
 * 13:08 Southparkfan: during a routine check in nagios (I didn't try to access the farm before I logged in to nagios), nagios reports that HHVM on prod8 doesn't look well.
 * 13:09 Southparkfan: restart HHVM on prod8, issues solved
 * 13:32 Southparkfan: "sudo killall php5" on prod9 and prod11
 * 13:38 Southparkfan: had to kill more processes manually, now the servers are stable

Todo

 * Repair job running (do not create more than 1 instance of runJobs.php)
 * Job running disabled in d6a20126ced5478078aded1685a4efcc8784ca5b for now.

Meta

 * Staff on hand: Southparkfan
 * Report published by: Southparkfan
 * Timestamp: 13:51, 16 March 2015 (GMT)