Crashed tentacle hangs Octopus server tasks

gavin.burke · 26 June 2017 07:54

We had a machine crash at the weekend and it ran a tentacle. The task for the tentacle health has been running since Saturday waiting for this tentacle, and then a deployment started this morning too.

I had to hard reset the VM and now the tentacle is online, and a new health check reports that, but the old tentacle health check task from Saturday plus the deployment from a couple of hours ago are in a state of “Cancellation requested, trying to cancel the task”

It seems to be the same problem with the tentacle health check having no timeout at all, it seems trying to stop a task on a tentacle that previous died has no timeout and I can see no way of clearing this other than restarting the Octopus server, which should not be happening just for the case of one of our dozens of tentacles going offline!

Am I wrong in this assumption - is there anything other than restarting the server to stop these two tasks?

I am unable to redeploy as the redeployment is now queued being this forever pending task…

Thanks
Gavin

P.S. Screenshot attached.

fen · 26 June 2017 15:49

This is actually a pretty severe problem and seems to be introduced recently (Only notice since upgrading to 3.13.x from 3.7.x

Tasks get stuck in a cancelled state
Cancelled tasks hold up other tasks

I’m under the impression this can be mitigated by updating the DB (For us, this aint happening during business hours) so we have to resort to restarting the Octopus server (Which screws with running TeamCity builds etc)

The other cause for this to happen from what I’ve seen is where Tentacles upgrade calamari during a deployment and seem to sit there for hours doing nothing, if you cancel it the upgrade task will stay cancelled but is classed as a running task.

Shane_Gill · 27 June 2017 01:11

Hi,

Thanks for getting in touch. We have had other reports of similar issues. I have opened an investigation to determine why this is happening: https://github.com/OctopusDeploy/Issues/issues/3592

Cheers,
Shane

Shane_Gill · 7 July 2017 00:14

Hi,

Are you both using Octopus HA or a single Octopus Server? How many Tentacles do you have involved in the health checks/tentacle upgrades that are becoming stuck?

Cheers,
Shane

gavin.burke · 7 July 2017 07:14

Hi Shane

Using a single Octopus Server and there was just the one tentacle that was stuck, out of roughly 30 we have.

I could potentially test with more than one if you need as to replicate should just be a case of me stopping some tentacle services (or maybe just killing the .exe so it’s more like the scenario of the machine going offline), re-running the health check, then trying to Cancel.

Gavin

fen · 7 July 2017 17:43

Hey Shane,

Using a single Octopus Server, with over 300+ tentacles. I’ve commented on https://github.com/OctopusDeploy/Issues/issues/3592 with more information.

It seems to be a pretty high chance of happening when doing upgrades of environments which may contain up to 60 machines at a time.

Michael_Noonan · 9 August 2017 01:49

Hi Gavin,

Thanks for keeping in touch! We’ve recently shipped https://github.com/OctopusDeploy/Issues/issues/3702 and https://github.com/OctopusDeploy/Issues/issues/3701 which should help alleviate pressure on your Octopus Server, and give us a clearer understanding of what else could be going wrong.

Please upgrade to Octopus Server 3.16.1 or newer and let us know if this change helps.

At this point we aren’t able to reproduce the problem in our lab, so we are going to wait and see if this surgical change improves the situation before investigating further.

We’ll continue to track progress on this investigation here: https://github.com/OctopusDeploy/Issues/issues/3592

Hope that helps!
Mike