We had a machine crash at the weekend and it ran a tentacle. The task for the tentacle health has been running since Saturday waiting for this tentacle, and then a deployment started this morning too.
I had to hard reset the VM and now the tentacle is online, and a new health check reports that, but the old tentacle health check task from Saturday plus the deployment from a couple of hours ago are in a state of “Cancellation requested, trying to cancel the task”
It seems to be the same problem with the tentacle health check having no timeout at all, it seems trying to stop a task on a tentacle that previous died has no timeout and I can see no way of clearing this other than restarting the Octopus server, which should not be happening just for the case of one of our dozens of tentacles going offline!
Am I wrong in this assumption - is there anything other than restarting the server to stop these two tasks?
I am unable to redeploy as the redeployment is now queued being this forever pending task…
This is actually a pretty severe problem and seems to be introduced recently (Only notice since upgrading to 3.13.x from 3.7.x
Tasks get stuck in a cancelled state
Cancelled tasks hold up other tasks
I’m under the impression this can be mitigated by updating the DB (For us, this aint happening during business hours) so we have to resort to restarting the Octopus server (Which screws with running TeamCity builds etc)
The other cause for this to happen from what I’ve seen is where Tentacles upgrade calamari during a deployment and seem to sit there for hours doing nothing, if you cancel it the upgrade task will stay cancelled but is classed as a running task.
Are you both using Octopus HA or a single Octopus Server? How many Tentacles do you have involved in the health checks/tentacle upgrades that are becoming stuck?
Using a single Octopus Server and there was just the one tentacle that was stuck, out of roughly 30 we have.
I could potentially test with more than one if you need as to replicate should just be a case of me stopping some tentacle services (or maybe just killing the .exe so it’s more like the scenario of the machine going offline), re-running the health check, then trying to Cancel.
Please upgrade to Octopus Server 3.16.1 or newer and let us know if this change helps.
At this point we aren’t able to reproduce the problem in our lab, so we are going to wait and see if this surgical change improves the situation before investigating further.