Task hangs and deployment won't cancel

A deployment is hanging on a particular step, but when the CANCEL button is hit, the deployment continues in a CANCELLING… state for many hours afterwards.

As you can see from the log excerpt, the cancellation was requested 2 hours 40 minutes after the deployment task was started, but the overall duration was 17 hours.

Task ID:        ServerTasks-941045
Related IDs:    Deployments-151720, Channels-1730, Releases-21388, Tenants-2322, Projects-1703, Spaces-182, Environments-461
Task status:    Canceled
Task queued:    Wednesday, 31 March 2021 3:46:32 PM +00:00
Task started:   Wednesday, 31 March 2021 3:46:32 PM +00:00
Task duration:  17 hours
Server version: 2020.6.4722+Branch.release-2020.6.Sha.49d1a1050c9bb43cce9b9e9741314b2bb3df3691
Server node:    octopus-i009472-54c49b949b-nqjzl

                    | == Canceled: Deploy Certificate Manager release 1.2.0 to PreProd for DG Test ==
15:46:32   Verbose  |   Guided failure is not enabled for this task
18:26:39   Info     |   Requesting cancellation...

This doesn’t seem to be an Octopus issue, as it only started happening after a recent change (which will be reversed). There is nothing in the logs to indicate what issue may be.

Currently a deployment is hanging is on DynamicWorker 21-04-01-0701-242in, and work is unable to continue during this hang. Is it possible for you to stop the deployment or even kill the DynamicWorker?

Happy to share the full raw logs privately if they would be helpful.

Thanks,
David

Hi David,

Thanks for getting in touch!

I’ve queued that worker for deletion, let me know if this frees up the task.
If the task was running on a deployment target when it hung, then restarting the tentacle on that machine may be necessary.

Regards,

Hi Paul,

There is now a message in the log to say that the Tentacle is shutting down, which wasn’t there for the 17 hour cancellation. And as I write this, that looks to have done the trick.

09:20:56   Verbose  |       The Tentacle service is shutting down and cannot process this request.

The task was running on a Kubernetes cluster (AWS EKS) deployment target, so unless I’m mistaken, there is no tentacle to restart there?

Thanks,
David

That’s correct, it would just be the Worker in that case.

1 Like

Hi Paul,

I’ve found the source of our problem, but I’m unsure as to why Octopus is having issues with it.

In a Kubernetes script step we are attempting to install some custom resource definitions (CRD’s). When I run the script to install these CRD’s locally everything works fine, but on Octopus it seems to hang on the third definition.

Again there is nothing in the logs indicating what the issue may be. Are you able to see if anything is amiss on the worker (21-04-01-0715-oe4q0)? I have no need for this task to be stopped, so take as much time as you need.

Script

CRD_CMD=(kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.2.0/cert-manager.crds.yaml)
write_highlight "Installing cert-manager Custom Resource Definitions..."
write_verbose "$(echo "${CRD_CMD[@]}")"
"${CRD_CMD[@]}"

Output from local terminal

All six CRD’s are installed and the apply command completes successfully.

dgard@SOME-PC:~/repos/ldx-analytics-infrastructure$ kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.2.0/cert-manager.crds.yaml
customresourcedefinition.apiextensions.k8s.io/certificaterequests.cert-manager.io created
customresourcedefinition.apiextensions.k8s.io/certificates.cert-manager.io created
customresourcedefinition.apiextensions.k8s.io/challenges.acme.cert-manager.io created
customresourcedefinition.apiextensions.k8s.io/clusterissuers.cert-manager.io created
customresourcedefinition.apiextensions.k8s.io/issuers.cert-manager.io created
customresourcedefinition.apiextensions.k8s.io/orders.acme.cert-manager.io created

From Octopus

The first two CRD’s are installed and then the deployment hangs.

10:45:09   Verbose  |       kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.2.0/cert-manager.crds.yaml
10:45:13   Info     |       customresourcedefinition.apiextensions.k8s.io/certificaterequests.cert-manager.io configured
10:45:14   Info     |       customresourcedefinition.apiextensions.k8s.io/certificates.cert-manager.io configured

Thanks,
David

Hi David,

Thanks for the detailed explanation there.

I’ve been able to replicate this using a Windows 2019 dynamic worker (I was curious if the OS made a difference).

You’re free to cancel that deployment whenever you want, let me know if it locks up the worker again and I’ll cycle that out.

I’ll most likely need to get our cloud engineers to look into this, so, due to timezone differences and Easter it may be next week before I have an update for you.

Regards,
Paul

Hi Paul,

Thanks for looking further. I’ve cancelled the job but it does seem to have hung again. If you could kill the worker that would be good.

21-04-01-0715-oe4q0

Thanks,
David

No problem, that is being removed now.

1 Like

Hi David,

We managed to get to the bottom of this a bit quicker than expected.

Turns out the issue looks to be related to the kubectl version.
Our Workers are running 1.18.0 or lower at the moment and when testing I hit the same issue with this. Upgrading my local instance to the latest kubectl version works fine.

We’re going to raise a request to get 1.19 and 1.20 added to our worker images.

Regards,
Paul

Thanks for investigating Paul.

We are running the script in a docker container that has 1.16.9 on it. Unfortunately it’s not something I can easily upgrade as other tasks depend on a specific version. But I’m sure I’ll be able to work around it in the script.

Thanks,
David

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.