Tech:Incidents/2014-12-hhvm

A few days of constant loading issues because of a mix of issues which were initially traced down to HHVM not fulfilling requests, presumably due to a lack of resources but after investigation - wider causes were sought and found.

Lessons

 * Matters should be investigated as major issues unless otherwise proven. John takes fault for this as he passed these slow load times as 'just a temporary issue' for several days.
 * The HHVM admin server proves essential for dealing with suspected HHVM issues.
 * Ganglia is essential to debugging server side performance issues.
 * Servers should not be added to solve a problem until all other methods have been tried, tested and failed. Be persistent if you think a new server will not help but duplicate the problem until you are proven wrong.

Action

 * prod8 and prod9 were reinstalled by John.
 * GDNSD was reconfigured to use a realistic load balancing system.
 * HHVM was migrated to use the ini syntax as opposed to the deprecated hdf syntax.
 * Implemented a loadbalancer on December 7th 2014 to give more consistent resource sharing with loads.
 * Nagios checks were added for HHVM Health, currently only the 'queued' metric is monitored but more can be added when needed.

Meta

 * Operations on hand: John
 * Report published by: John
 * Dated: 20:49, 5 December 2014 (GMT)