For this update we wanted to be extra careful as possible as the size of it is more than anything done before. Once all the database upgrading was done (several passes over more than 17 GB of data) we started the servers in a way that only we here at Project: Hazel and Customer Support at HARR could connect.
We wanted to do this to do a dry run of the various systems that were affected by the update. All was looking good but about 30 minutes into the checklists the nodes in the cluster started to go offline.
This of course caused us to change plans. We Immediately rebooted the cluster with full logging and started running through checklists again, after exactly 30 minutes the same thing happened.
We pulled all logs over to Dartmouth and started digesting and analysing. After digging around for some time we found the reason for this.
The reason was an old bug that has been in our cluster code since release. The proxies and nodes are configured to health check each other. If they haven’t heard from their siblings in a fixed amount of time, they send a heartbeat package, if the node answers then all is fine but if the node doesn’t answer then the sibling node removes her sister node from the node registry, this we call sororicide.
Now when February is in full swing there is little need for the heartbeat packages, the is enough packed flow between the nodes and the proxy so they are aware of each other health.
When we were running said checklists, then of course the load is a lot less than when a load of you guys are all logged in.
Don’t know if such details interest you but I feel slightly better after giving you the details of this unfortunate event.