Yesterday, I had two deployment tasks get canceled unexpectedly by what seems to be our Octopus Server instance getting terminated.
I then checked the audit logs and noticed this:
- What exactly happened?
- How can we better prepare for an incident like this so deployments are not left in an inconsistent state? If this would have happened in Production, this could have had a larger impact.
It looks like the Virtual Machine that was running your instance stopped responding and your instance was re-deployed. The timeline is (UTC times):
17:24:45 - Last Server log entry recorded
17:29:27 - An automated process failed to contact the server
17:31:00 - Our monitoring raised an alert
17:33:07 - AWS Terminated the EC2 Instance
17:33:52 - Provisioning of a new EC2 Instance started
17:48:00 - Our monitoring indicated the instance is now available again
17:50:47 - Provisioning completed
The best way I can think of to detect this happening is to create a subscription that listens to the
Task Cancelled event and raises an event in a monitoring system somewhere. At the moment though, a dirty shutdown like what happened here does not raise the event required to trigger the subscription. I have raised and issue and will look at getting that into the product soon. The issue also shows the subscription configuration you would need.
Thanks for the information and thanks for raising an issue!