How to set timeout to fail long running deployment step (powershell)?

aca · 16 September 2019 16:33

We utilize Library Step Template ‘IIS Application - Create’.
For some reason Octopus hangs on that step during deployment indefinitely, blocking other deployment on the same target.

Is it possible to add timeout and fail the deployment step either as Octopus Process setting or PowerShell script code ?

Justin_Walsh · 16 September 2019 20:14

Hi @aca

While we don’t have a configurable timeout at this stage, you could use a PowerShell script, such as this one to cancel any tasks that have been running for longer than your configured timeout period.

Please don’t hesitate to let us know if you have any further questions!

DG · 17 September 2019 16:34

Where would we execute this script? It doesn’t seem like putting it inside the “Custom Deployment Scripts” section of a Deploy step is logical as the timing of each script section doesn’t allow execution while it is deploying:

**Pre-deployment script**
This script will run after the package is extracted, but before any configuration changes are made.
**Deployment script**
This script will run after configuration changes and variable substitutions are made, but before any of the core deployment processes are made.
**Post-deployment script**
This script will run after core deployment processes are made.

I submitted a ticket recently trying to handle the same situation and pretty much got “you’re out of luck”:

Justin_Walsh · 19 September 2019 15:05

Hi @DG

You would need to run it externally against your Octopus instance, likely via a scheduled task or something similar.

I hope this helps, please don’t hesitate to let me know if you have any further questions on this.

aca · 19 September 2019 16:15

I understand we can work around the issue by adding timeouts, but I’m more interested in why this is now happening? I think we are bumping against the same problem that others are facing described in this forum. We used to run multiple (20) parallel deployments without any issues, but currently Octopus deployments often get “stuck” without any obvious reason. It almost seems like Octopus deadlocks and that was introduced in the past month or two since v.2019.7.8

Justin_Walsh · 23 September 2019 00:57

Hi @aca

We have identified the issue causing this one, which also likely fixed a similar issue in earlier versions. We will be releasing 2019.8.5 today with the fix, and we recommend upgrading at earliest opportunity to correct the problem.

aca · 24 September 2019 16:34

Confirmed.
v2019.8.5 fixes the issue with “Waiting for the script in task ServerTasks-xxxxx to finish…”
After the upgrade we no longer see the issue with blocked/hanging deployments.

Aaron_R · 1 October 2019 21:51

Thank you for the sample script @Justin_Walsh. Looking at it though, it appears to cancel the whole Task, not just the Step that has run too long? So there is no opportunity to run on-fail/recovery/clean-up/notify Steps later in the Task. I guess you could make a complex API script that cancels the Task with the stuck step, then started a clean-up Task. Ugh. A configurable Step timeout would be better.

If we have to use API scripts like this to polyfill for timeouts. It is a pity we can’t get Octopus to run scripts like this periodically on Octopus amd/or Workers. Octopus already has scheduling for Tasks, and runs things like tentacle health checks and retention policies periodically. So it would see Tasks or script modules able to be run periodically. Even if just a rudimentary schedule like add to queue every X minutes.

Justin_Walsh · 2 October 2019 02:59

Hi @Aaron_R,

We actually have an upcoming feature, Runbooks, that will meet the second part of your post, which will be coming out in the near future!

The concept of a per-step timeout has been brought up internally in the past but did not get any traction, mostly due to the fact that we want our deployments to be reliable every time, and having step timeouts counteracts that somewhat. Saying that, it’s not something that’s completely off the table, so it may be something we revisit in the future.

Aaron_R · 2 October 2019 03:42

Thanks @Justin_Walsh. Runbooks sound great, especially if they come with recurrent scheduling.

We want our deployments to be reliable every time too, that is why we need Step time-outs

Two use cases for Step time-outs are (1) retries and (2) detecting wedged/failed deployments.

Shockingly some things on the Internet are not 100% reliable . If we can’t implement retry logic, we can’t make our deploys reliable. As others have cited, things like Azure FTP connections or CloudFormation updates (using the Octopus Step template) can hang forever or at least for many hours. If we can’t timeout then we can’t retry.

Second, without a timeout, a wedged deployment never fails. The Octopus official CloudFormation Step is a good example, it has no built-in time-out/retry/rollback logic. For some resources CloudFormation will ‘stick’ for 3+ hours. When you schedule a deployment in Octopus, you can set a critical start window, but not a critical duration. So without a timeout you can’t get to the notification Step to tell you a deployment failed/wedged. It never fails because it is wedged (for hours in this case, but potentially forever).

With a Step timeout we could move to the next Step, detect that the CF action has not completed in any reasonable time, cancel it to roll it back, and notify operators to investigate. Or we could enter Guided Failure mode. Without Step timeouts this is a silent failure we can’t detect or address.

If Octopus can provide a CloudFormation and other Step templates that are reliable every time, and never ever wedge, then sure, I’m stop asking for Step timeouts

system · 1 November 2019 03:42

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.