Tech:Incidents/2014-12-prod3

This incident briefly covers all of prod3's incidents in the month of December but focuses on the major and final incident on December 22nd.

December Blues
In general, through the month of December prod3 had several downtimes ranging minutes to hours as a result of a lack of memory (despite having the most memory on the cluster, ironic). The MySQL process never through errors up regarding memory is just suddenly stopped without any logs and only careful analysis of the kern.log showed any details of memory issues.

In theory, it was attributed to a high irregular load as it occurred so sporadically it was not worth implementing any serious fixes.

Timeline

 * 09:01 JohnLewis: prod3 is reported down by nagios
 * 09:05 JohnLewis: prod3 is rebooted and an investigation into the cause of the crash begins.
 * 09:30 JohnLewis: cause is linked to the memory issues which seem to have solved. Memory usage ~40%
 * 17:49 JohnLewis: code change is deployed on prod3 and prod5, updating MariaDB
 * 17:58 JohnLewis: prod3 doesn't update 'fails to stop' while prod5 has been updated for 10 minutes
 * 18:01 JohnLewis: manually stop MariaDB via 'mysqladmin shutdown -p'
 * 18:06 JohnLewis: prod3 begins to cycle in nagios between 'OK' and 'CRITICAL' while prod5 is OK
 * 18:07 JohnLewis: MariaDB crashes
 * 18:33 JohnLewis: prod3 goes into a kernel panic with all services affected
 * 18:39 JohnLewis: SSH either won't go through or fails during key authentication
 * 18:42 JohnLewis: 'johnflewis is not in the sudo list' messages occur, fall back to logging in as root manually
 * 19:00 JohnLewis: boot prod3 in a recovery kernel to access MySQL files
 * 19:20 JohnLewis: tar /var/lib/mysql up
 * 19:25 JohnLewis: created prod12 running bare ansible, install MariaDB manually
 * 19:40 JohnLewis: untar prod3 onto prod12, kernel panics more
 * 19:45 JohnLewis: fresh prod12 install, get MariaDB running and deploy changes to prod{8|9|11} before importing prod3 to prod12
 * 20:20 JohnLewis: after a lot of work, prod3 is on prod12 - all works.
 * 20:37 JohnLewis: wikis confirmed up after a reboot of MariaDB on prod12

Summary
prod3 had been the victim of quite a few memory failure recently. An upgrade of MariaDB some-how triggered prod3 causing a serious failure that took ~2 and a half hours to solve.

Lessons

 * Sometimes, a single small memory issue could result in a major issue. Whether it is a month later or a few hours later.
 * Databases need to be monitored more with memory usage

Action

 * Decommissioned prod3 and pooled a replacement, prod12.

Todo

 * Create an 'emergency fail over' plan. This is essential for services but also passwords as following on from prod3's past, compromises can occur and we need to be prepared.

Meta

 * Staff on hand: John
 * Report published by: John
 * Date: 23:21, 22 December 2014 (GMT)