Tech:Incidents/2015-05-ddos

Orain suffered 9 days of consistent issues and downtime for most users due to a UDP DDoS attack which caused DigitalOcean to null route our hosts.

(Times are UTC+0)

Timeline
Note: small(er) issues have been reported before.

Note: DigitalOcean's automated emails for some reason did not get sent during the first stages of the attack.


 * 20 May
 * 06:15 Nagios: nagios sends last alerts out
 * 06:56 Southparkfan: discovers Orain is down and sends mail to staff@undefinedorain.org notifying of downtime
 * 07:06 Southparkfan: realizes that prod6 seems down, sends mail to Dusti's and addshore's personal email addresses
 * 10:15 Addshore: mails back and says Orain is up. It is unknown whether addshore accessed Orain with or without IPv6.
 * 10:25 Addshore: mails a picture of prod10 graphs, which includes a graph of the inbound and outbound private and public traffic. There is a public inbound traffic spike, with one hitting 800mb/s inbound traffic
 * 17:54 DigitalOcean Incoming DDoS Detected -- prod5.orain.org
 * 17:54 DigitalOcean Incoming DDoS Detected -- prod6.orain.org
 * 17:54 DigitalOcean Incoming DDoS Detected -- prod8.orain.org
 * 19:57 Southparkfan: FastLizard4 tells me Orain is accessible when using IPv6, but not when using IPv4.
 * 19:58 Southparkfan: tries to SSH into prod10 by using either prod10.orain.org or its public IP, but none of them work. SSH'ing into prod10 by using prod8 as a proxy works though.
 * 22:33 Southparkfan: proposes to FastLizard4 (since he was able to do things, and Southparkfan wasn't) to redirect *.orain.org to prod13-temp.orain.org
 * 22:40 Southparkfan: above setup will break IPv6 support, Southparkfan proposes to revert the whole DNS repo to b83d1ca08fe6bc728427d061b040d0245078e031
 * 22:45 FastLizard4: tries to push commits to the DNS repo, but gets stuck with permission errors.


 * 21 May
 * 06:46 Southparkfan: revert /config to b83d1ca08fe6bc728427d061b040d0245078e031
 * 06:46 Southparkfan: grant operations full access to DNS repo (it already should have, but just add the group in 'Colloborators' too
 * 07:01 Southparkfan: All The Tropes and TestWiki are both confirmed back online, Meta is still down
 * 10:44 FastLizard4: confirms Orain is up
 * 14:36 Southparkfan: confirms Orain is up
 * 16:37 DigitalOcean Incoming DDoS Detected -- prod8.orain.org
 * 16:37 DigitalOcean Incoming DDoS Detected -- prod9.orain.org
 * 17:24 DigitalOcean Incoming DDoS Detected -- prod7.orain.org
 * 17:24 DigitalOcean Incoming DDoS Detected -- prod11.orain.org
 * 17:24 DigitalOcean Incoming DDoS Detected -- prod10.orain.org
 * 17:25 DigitalOcean Incoming DDoS Detected -- prod12.orain.org


 * 22 May
 * 01:39 DigitalOcean Incoming DDoS Detected -- prod11.orain.org


 * 23 May


 * 24 May
 * 19:58 DigitalOcean Incoming DDoS Detected -- prod10.orain.org


 * 25 May
 * 20:30 DigitalOcean Incoming DDoS Detected -- prod10.orain.org
 * 23:41 DigitalOcean Incoming DDoS Detected -- prod10.orain.org


 * 26 May
 * 01:40 DigitalOcean Incoming DDoS Detected -- prod5.orain.org
 * 01:40 DigitalOcean Incoming DDoS Detected -- prod7.orain.org
 * 01:40 DigitalOcean Incoming DDoS Detected -- prod8.orain.org
 * 01:41 DigitalOcean Incoming DDoS Detected -- prod6.orain.org
 * 01:50 DigitalOcean Incoming DDoS Detected -- prod11.orain.org


 * 27 May
 * 01:11 DigitalOcean Incoming DDoS Detected -- prod10.orain.org


 * 28 May
 * Addshore switch orain DNS to CloudFlare (and wait for propagation)
 * 18:04 DigitalOcean Incoming DDoS Detected -- prod10.orain.org


 * 29 May
 * 02:54 DigitalOcean Incoming DDoS Detected -- prod10.orain.org
 * 10:00 - Addshore remove all public Ips from ansible playbook
 * 10:33 - Addshore Power everything off


 * 10:34 - Addshore snapshot prod10
 * 10:34 - Addshore snapshot prod7
 * 10:34 - Addshore snapshot prod6
 * 10:35 - Addshore snapshot prod5
 * 10:35 - Addshore snapshot prod8
 * 10:35 - Addshore snapshot prod9
 * 10:36 - Addshore snapshot prod11
 * 10:36 - Addshore snapshot prod12


 * 10:50 - Addshore prod6 done 12 minutes 28 seconds
 * 10:51 - Addshore prod8 done 12 minutes 38 seconds
 * 10:51 - Addshore prod9 done 10 minutes 11 seconds
 * 10:51 - Addshore prod10 done 6 minutes 24 seconds
 * 10:51 - Addshore prod11 done 5 minutes 22 seconds
 * 10:54 - Addshore prod12 done 16 minutes 42 seconds
 * 10:55 - Addshore prod5 done 19 minutes 44 seconds
 * 10:56 - Addshore prod7 done 21 minutes 1 second


 * 10:57 - Addshore Power everything off
 * 11:03 - Addshore All powered off


 * 11:05 - Addshore prod10 create
 * 11:08 - Addshore prod11 create
 * 11:08 - Addshore prod12 create
 * 11:10 - Addshore prod5 create
 * 11:10 - Addshore prod6 create
 * 11:10 - Addshore prod7 create
 * 11:12 - Addshore prod8 create
 * 11:12 - Addshore prod9 create


 * 11:25 - Addshore Push change to ansible rotating all Ips
 * 11:38 - Addshore updating DNS settings
 * 11:55 - Addshore all nginx servers having Permission denied issues
 * 12:17 - Addshore everything up! :)
 * 12:23 - Addshore Delete all old droplets


 * Notes from Addshore
 * prod6 ansible run was failing
 * mail is not working
 * TODO add hack to create wiki to add CNAME to cloudflare

Links

 * Write up by Addshore