Tech:Incidents/2015-04-04-prod7-resize

An attempt to upgrade prod7 from the 512MB to the 1024MB plan, caused 15 minutes of downtime, and more than 1 hour of FOUC issues + cookie issues + login issues etc.

Details
Why the resize needed to happen:
 * Static (NFS) which is hosted on prod7 had a very limited amount of space left (~1G).
 * Redis kept being killed by the Linux Out-Of-Memory Killer

What we were expecting to happen during resizing:
 * Page loading times increased due to expensive operations normally being cached now being executed all the time
 * Login/session issues
 * Loading of files/static content/js/css be impossible

What happened:
 * Site was inaccessible for 15 mins while a snapshot was taken.

Timeline

 * Some testing happened prior to this seeing short bursts of no loading pages lasting around 30 - 60 seconds x2-3
 * 20:14 - Southparkfan - disable redis for MW
 * 20:32 - Southparkfan - reverted the above
 * 20:49 - Addshore - umount /mnt/mediawiki/private/uploads on prod11
 * 20:53 - Addshore - Switch all mw traffic to hit prod11
 * 20:54 - Addshore - Disable redis on prod7
 * 20:55 - Addshore - Shutdown prod7
 * 20:58 - Addshore - Start snapshot of prod7 before resizing should happen
 * 20:59 - MW pages stop being loaded, 504s!
 * 21:14 - Addshore - Noticed this revert and manually commented out redis stuff on prod11
 * 21:14 - MW Pages start loading again with no styles of JS
 * 21:15 - Snapshot of prod7 done - rebooted automatically
 * 21:18 - Addshore - Shutdown prod7 again for resize
 * 21:21 - Addshore - Resizing prod7 in DO
 * 21:36 - Resize done! - Booting up
 * 21:42 - Addshore - Rebooting mw servers prod8 and prod9
 * 21:53 - Addshore - Switch all MW traffic to prod8 & prod9 All MW pages fully load again
 * 21:53 - Addshore - Running ansible on prod11
 * 21:58 - Addshore - Running ansible on prod10 (restoring original LB settings)

Lessons

 * When performing tasks like this make sure you have a fool proof plan and stick to every stage of it, double check each stage!
 * It may be nice to have a CDN in front of static meaning if static.orain.org is down pages still get JS CSS and files (We did once have this)
 * We need to work out how login sessions can be handled without Redis (so we can touch Redis and people can still log in)
 * NFS for static sucks? Maybe we could use S3? Perhaps? Or something similar? Or a server dedicated for

Meta

 * Staff on hand: Addshore, Southparkfan
 * Report published by: Addshore
 * Timestamp:  ·addshore·  talk to me! 22:28, 4 April 2015 (BST), Southparkfan (talk) 22:34, 4 April 2015 (BST)