Tech:Incidents/2014-07-prod3Reinstall

3 days of downtime after a forced reinstall on prod3 due to being compromised due to a vulnerability in ElasticSearch.

The same vulnerability also affected other sites.

Timeline

 * July 10th
 * 12:31 Dusti: notifies the council of RamNode suspending Orain's prod3 server
 * 13:21 JohnLewis: receives email and starts to follow the case up with RamNode
 * 14:40 RamNode: request we reinstall the server as it has been compromised (see compromised section below)
 * 16:40 JohnLewis: confirms prod3 has been reinstalled and all the data has been successfully transferred back
 * 19:25 Addshore: Installed ansible on prod3 and tried a first run (which failed, but did most the important stuff)
 * 10:27 Addshore: Manually reset the mysql root password
 * 19:30 Addshore: Started mysql import for all data in a screen on prod3, this will likely take some time
 * 19:38 Addshore: Now we are stick with 502 errors
 * 19:46 Addshore: Manually made this change on prod4 only (this was later overridden once ansible started working)
 * 19:49 Addshore: Run i18n rebuild on prod4 (this is slowly running in a screen)


 * July 13th
 * 20:00~ Addshore: fix the running of ansible on prod3 with this change
 * 20:00~ Addshore: fix the reporting of nagios for prod3 with this change
 * 20:10 Addshore: prod4 still cant connect to prod3 mysql (only)
 * 21:00~ Addshore: fix the mysql issue with this change. Our problem was the settings we though we were using were being overridden..
 * 21:00~ Addshore: Again fix the 502s in the same way as before using this change.
 * 21:39 Addshore: All is fixed

Compromised
There are a few things which lead to prod3 being compromised. These are;
 * ElasticSearch had a few security issues which we were not aware of but were public
 * We had not bound ElasticSearch to any single interface
 * The ElasticSearch port was open to the public

All the above unfortunately came together to allow someone to manipulate our ElasticSearch instance against us. Any user was able to inject some code into our ElasticSearch instance, using the code, any user was able to manipulate prod3 in any way. In our case, it was manipulated to take part in a DoS attack. We can confirm no data was compromised from our database using this method.

As a security measure; ElasticSearch was removed from all active services. prod1 was not subject to the attack as it was using a patched version.

Action
John spent some time drafting ways we can prevent this occurring in the future or reduce the impact if worst comes to worst.


 * All changes to the orain network need to be made on ansible and must work through ansible
 * this change was obviously needed but missed when nagios was setup#
 * Someone had obviously changed the overriding my.cnf in the previous prod3 which was lost in the reinstall and thus fixed with this change
 * Someone previously implemented a work around for this change which was lost
 * Investigate to see if our monitoring can be more expansive (John)
 * It seems as expansive as it needs to be for us -John

Todos

 * Implement a fail-over server or, prod5 (John and Addshore)
 * Get this approved by Dusti ✅
 * Implement a policy for sysadmins with how to deal with services, basic working standards, emergency actions and a public basic contact page for key services and user access (John)
 * Complete a full services review of all Orain servers (John)
 * Ensure only necessary services are enabled
 * Ensure all services have the latest security updates
 * Complete a basic security review of services if necessary
 * Provide a basic notification page for maintenance or extended downtime
 * Could run off prod1 -John
 * Subscribe all sysadmins to security updates / release mailing lists for all software deployed (php, mariadb, mediawiki, debian, elasticsearch, ramnode, etc)
 * https://lists.wikimedia.org/mailman/listinfo/mediawiki-announce ✅ Addshore (talk) 17:50, 15 July 2014 (BST)
 * php Announcements Awaiting email confirmation Addshore (talk) 17:54, 15 July 2014 (BST)
 * Elastic search. Is there a mailing list? Addshore (talk) 17:55, 15 July 2014 (BST)
 * Ignore ElasticSearch. We aren't using it anymore and I've removed it from prod1 (well will; the service is not enabled however) John (talk) 17:57, 15 July 2014 (BST)