Tech:Incidents/2015-01-23

At Friday January 23th, 2015, Orain experienced four hours of downtime because users couldn't resolve DNS records.

Description
Around 18:55 (UTC +0) I (Southparkfan) experienced issues with accessing Orain, and I asked some other sysadmins and stewards on IRC if they had problems too, but they didn't. Almost an hour later some of them experienced issues too now, and so far I can be sure it wasn't just me. Around 21:50, after a lot of investigation which didn't had much sense, in cooperation with John I've discovered that there were DNS loops on Orain. While the IPv4 IP of prod6 (ns1.orain.org nameserver) is 178.62.52.223, requests were redirected to 178.62.52.222, which is an IP not assigned to any DNS or Orain server. John stated that the nameservers in Namecheap (the registrar of orain.org) might been wrong. I've asked Dusti to check the Orain nameservers, and at my request he set them to ns1.orain.org and ns2.orain.org back (again). Around 10 minutes after Dusti changed the nameservers, the farm became accessible again.

Timeline

 * 18:35 Pingdom: Orain is reported down
 * 18:55 Southparkfan: I discovered that Orain seems down to me. Started to investigate why
 * 19:35 Southparkfan: other users are reporting issues too now
 * 21:53 Southparkfan: ask Dusti to set Orain's nameservers to ns1.orain.org and ns2.orain.org
 * 21:59 Dusti: nameservers are set to ns1.orain.org and ns2.orain.org
 * ~22:10 Southparkfan: Orain works again

Lessons

 * A plan should be made so we are prepared for these issues. Southparkfan takes partial responsibility for looking at unreasonable causes.
 * While John left the team three days ago, that does not mean he can't help Orain out anymore. Use other people experienced with our infrastructure when possible.

Meta

 * Staff on hand in downtime: Dusti, GethN7, Southparkfan, Tanner
 * Report published by: Southparkfan
 * Timestamp: 22:45, 23 January 2015 (GMT)