Tech:Incidents/2015-02-23

All MediaWiki servers (as of this writing, prod8, prod9, and prod11) became overloaded and intervention was needed to restore them to an operational state.

The initial cause of the overload is unknown, but seems to have started with HHVM and progressed into a cascade failure - the HHVM overload caused cronjobs to back up, causing further increased load, causing HHVM to run even slower, causing more load, causing more cronjobs to back up, ad nauseum.

Event duration is unknown as it is difficult to pin down when the event started. The first indications of trouble occurred at 04:14 UTC, with slow Ansible run warnings from Nagios, and Nagios relaying the first HHVM full-queue warning at 06:03 UTC. The last public edit before the downtime began was processed by Orain at 06:26 UTC. We will assume the event began at this point.

Timeline
All times given in UTC unless otherwise noted.
 * 2015 February 23, 06:26: Event Begins
 * 07:19, FastLizard4: Noticed persistent 504 Gateway Timeout errors on metawiki and internalwiki.
 * 07:28, FastLizard4: Attempted to SSH in to prod8, prod9, and prod11. Servers unresponsive.
 * 07:29, FastLizard4: Interactive SSH session established with prod8. Server presented with 100% CPU usage, high load averages (greater than 9.00), 100% RAM utilization, and near-100% swap utilization.  Also of note was the presence of multiple identical cronjobs running in parallel, as if cron had started the jobs normally but they never had a chance to complete before cron started a new instance of them.
 * 07:30, FastLizard4: Interactive SSH sessions established with prod9 and prod11. Servers presented similarly to prod8, though swap usage was lower on these servers (between 60% and 80%).
 * 07:30, FastLizard4: Attempted to signal graceful restart to HHVM on prod8, unresponsive.
 * 07:34, FastLizard4: SIGKILL issued to HHVM on prod8 and prod11.
 * 07:37, FastLizard4: SIGTERM issued to HHVM on prod9, and was successful.
 * 07:43, FastLizard4: Stop command issued to cron on prod8, prod9, and prod11, to prevent creation of additional cronjobs ad infinitum, causing cron to gracefully stop but not affecting already running cronjobs. At this point, waiting for running cronjobs to run to their natural conclusion and for server loads to stabilize.
 * 07:54, FastLizard4: Servers normalized, cron started on prod8, prod9, and prod11. Loads remained normal.
 * 07:55, FastLizard4: HHVM started on prod8, prod9, and prod11.
 * 08:01: Nagios reports normalization of HTTP/HTTPS and HHVM on prod8, prod9, and prod11.
 * 08:01: Event Ends
 * Event Duration: 1 hour, 35 minutes

Meta

 * Staff on hand: FastLizard4
 * Report published by: FastLizard4
 * Report timestamp: 08:35, 23 February 2015 (GMT)