Tech:Incidents/2015-02-07

Extension testing which gone wrong and a bad DB list caused 20 minutes of downtime on 7th February 2015.

Timeline

 * 21:35 Dusti: merging this commit
 * ~21:41 Orain: Orain gone down
 * 21:46 Dusti: reports Orain down on IRC
 * 21:55 Southparkfan: revert various commits
 * 21:57 Southparkfan: forced ansible runs on all servers
 * 21:58 Southparkfan: confirmed ansible ran successfully, but instead of blanking out people get 404 Wiki Not Found errors everywhere
 * 22:00 Southparkfan: discovers that dblist is empty, replacing it with "metawiki|Orain|en|" to make Orain Meta accessible again for get_db_list.py, and run get_db_list.py on all servers
 * 22:03 Southparkfan: all is up again

In Hindsight

 * Extension testing should at all time be done on extloadwiki, and not in production.
 * The bad dblist issue already caused issues in the past, 8 months ago: Tech:Incidents/2014-06-14. The db fetching script should determine whether a dblist looks sane or not before actually fetching it.

Meta

 * Staff on hand in downtime: Dusti, Kudu, Southparkfan
 * Report published by: Southparkfan
 * Timestamp: 22:20, 7 February 2015 (GMT)