Tasks stuck in cancelling state

Octo version 2023.1.9882
Running on Linux Docker
I’ve killed the docker and took the instance out of ASG as well

I am unable to delete the node because

Tasks are currently executing on the node. Drain the node to prevent further tasks from running, wait until the current tasks are complete and then delete the node.

NEED HELP ASAP! Some project deployments are blocked because of this.
I have added another replacement node to the HA Pool for now, but since the previous tasks are stuck, it won’t let new one’s start

Need a way to forcefully cancel these stuck deployments

Hi @Naman.Kumar,

Thank you for contacting Octopus Support. I’m sorry to hear you have stuck tasks.

Rather than deleting the node from the Octopus UI, you will need to shut down the Octopus Server service on each Octopus Server node, then restart the service on each node once they have all shut down.

This should trigger each task stuck as “Cancelling” to become “Cancelled”.

Also, if there is a particular Target that is responsible for the tasks that are stuck as “Cancelling”, it is a good idea to reboot the Tentacle service (or the Target itself) on that machine to ensure there are no hung instances of Calamari.

Let me know if that unblocks you at your earliest convenience.

Best Regards,
Donny

This is preposterous behaviour from the application.
The container wasn’t getting killed because of the following error

Error response from daemon: Cannot kill container: octo: container a43bcb0767e2 PID 7034 is zombie and can not be killed. Use the --init option when creating containers to run an init inside the container that forwards signals and reaps processes

Process:

root      7034 31.5  0.0      0     0 ?        Zsl  Apr16 40527:11 [Octopus.Server] <defunct>

I also see a Processes Block on DB during the time all the deployments got stuck running on that server.

Ideally, there should be a better way to handle this entire situation.

  1. Better process block handling when communicating to DB
  2. Even if it does land into this state, some or the other API call / DB call that can release the deployments from a cancelling state rather than “REBOOTING EVERYTHING”
  3. A quicker way to get folks from Octopus Deploy’s Assistance if production deployment is getting impacted

Hi @Naman.Kumar,

Thank you for getting back to me. I completely understand your frustration, especially with the importance of being able to rely on your Production Octopus Server.

Regarding your concerns:

  1. Both the Support Team and our Development Team are aware of this limitation. We have features in the works that should help prevent stuck tasks found on our Octopus Roadmap (such as Timeout for Steps and Improve Tentacle Resiliency for Unreliable Networks).
  2. I will re-raise this specific concern with our Development Team
  3. While we strive to provide prompt responses on our forums, for urgent Octopus issues, we recommend contacting Octopus Support directly via email (Support@Octopus.com). This is our official method for seeking support, ensuring the quickest assistance possible.

If you have any additional questions or if we can assist with anything else, please don’t hesitate to reach out!

Best Regards,
Donny

Hi @Naman.Kumar,

Just jumping in for Donny from the Aus based team with an update from the devs.

They’d like to double check the logs for any potential issues that could cause these tasks to get stuck, as it looks like they might have been stuck before the request to cancel them was made, would you be able to please send through the deployment process JSON along with the Task logs for some of the stuck tasks? The OctopusServer logs

You should be able to upload files to our secure upload portal but please let us know if there are any issues with it and feel free to reach out with any questions at all!

Best Regards,

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.