Where would we execute this script? It doesn’t seem like putting it inside the “Custom Deployment Scripts” section of a Deploy step is logical as the timing of each script section doesn’t allow execution while it is deploying:
This script will run after the package is extracted, but before any configuration changes are made.
This script will run after configuration changes and variable substitutions are made, but before any of the core deployment processes are made.
This script will run after core deployment processes are made.
I submitted a ticket recently trying to handle the same situation and pretty much got “you’re out of luck”:
I understand we can work around the issue by adding timeouts, but I’m more interested in why this is now happening? I think we are bumping against the same problem that others are facing described in this forum. We used to run multiple (20) parallel deployments without any issues, but currently Octopus deployments often get “stuck” without any obvious reason. It almost seems like Octopus deadlocks and that was introduced in the past month or two since v.2019.7.8
We have identified the issue causing this one, which also likely fixed a similar issue in earlier versions. We will be releasing 2019.8.5 today with the fix, and we recommend upgrading at earliest opportunity to correct the problem.
Thank you for the sample script @Justin_Walsh. Looking at it though, it appears to cancel the whole Task, not just the Step that has run too long? So there is no opportunity to run on-fail/recovery/clean-up/notify Steps later in the Task. I guess you could make a complex API script that cancels the Task with the stuck step, then started a clean-up Task. Ugh. A configurable Step timeout would be better.
If we have to use API scripts like this to polyfill for timeouts. It is a pity we can’t get Octopus to run scripts like this periodically on Octopus amd/or Workers. Octopus already has scheduling for Tasks, and runs things like tentacle health checks and retention policies periodically. So it would see Tasks or script modules able to be run periodically. Even if just a rudimentary schedule like add to queue every X minutes.
We actually have an upcoming feature, Runbooks, that will meet the second part of your post, which will be coming out in the near future!
The concept of a per-step timeout has been brought up internally in the past but did not get any traction, mostly due to the fact that we want our deployments to be reliable every time, and having step timeouts counteracts that somewhat. Saying that, it’s not something that’s completely off the table, so it may be something we revisit in the future.
Thanks @Justin_Walsh. Runbooks sound great, especially if they come with recurrent scheduling.
We want our deployments to be reliable every time too, that is why we need Step time-outs
Two use cases for Step time-outs are (1) retries and (2) detecting wedged/failed deployments.
Shockingly some things on the Internet are not 100% reliable . If we can’t implement retry logic, we can’t make our deploys reliable. As others have cited, things like Azure FTP connections or CloudFormation updates (using the Octopus Step template) can hang forever or at least for many hours. If we can’t timeout then we can’t retry.
Second, without a timeout, a wedged deployment never fails. The Octopus official CloudFormation Step is a good example, it has no built-in time-out/retry/rollback logic. For some resources CloudFormation will ‘stick’ for 3+ hours. When you schedule a deployment in Octopus, you can set a critical start window, but not a critical duration. So without a timeout you can’t get to the notification Step to tell you a deployment failed/wedged. It never fails because it is wedged (for hours in this case, but potentially forever).
With a Step timeout we could move to the next Step, detect that the CF action has not completed in any reasonable time, cancel it to roll it back, and notify operators to investigate. Or we could enter Guided Failure mode. Without Step timeouts this is a silent failure we can’t detect or address.
If Octopus can provide a CloudFormation and other Step templates that are reliable every time, and never ever wedge, then sure, I’m stop asking for Step timeouts