Tech:Incidents/2014-06-14

On 14th June 2014, Orain experienced a downtime affecting all wikis for approximately 20 minutes where the databases appeared to be missing

Timeline

 * 20:45 BST Addshore: metawiki up, running the get db list script
 * 20:44 BST Addshore: DBlist is corrupt, replacing with "metawiki|meta|"
 * 20:35 BST Addshore: Removed Popups extension from mediawiki and reenabled ansible cron
 * 20:24 BST Addshore: all sites getting DB errors (Reported by pingdom)

Description
I got a notification through saying Orain meta was down, on going to the site the error that was being shown was a "Unknown database error". Nothing was appearing in the logs, the sql db looked fine and I restarted the service for good measure. When trying to run update.php I decided something was wrong with the DB list and upon looking at the list it was indeed wrong. The list contained HTML and no db list. The HTML was displaying some sort of error message saying that Localsettings.php was not readable. Adding "metawiki|meta|" to the dblist would allow metawiki to come back up and thus allow us to see what the dblist should be. Running the db list update script then resolved the issue for all wikis


 * See this

In Hindsight

 * The database log files had been moved from "/var/log/mediawiki/database.log" to "/var/log/mediawiki/debuglogs/database.log" which I has not previously noticed or been notified about. This probably added at least 5-7 mins to the downtime while I tried to work out what was up.
 * This whole issue could have been avoided if the db list update script had some sort of check to make sure the list it is fetching is valid.
 * It would be a good idea for the db list update script to move the previous list into a backup location before overwriting the file, this would mean in the event of a failure such as this we could simply use the previous one!

Actions taken

 * ec2f525e04863c2a5285991705e6d3487e709993 hardcoded the fetching of the dbname for metawiki. As a result if the dblist is ever corrupt the script will still be able to fetch a new dblist from metawiki which should remain operational.

Meta
Staff on hand in downtimes: Addshore.

Incident report published by: Addshore

Timestamped: 21:14, 14 June 2014 (BST)