Task that can neither complete nor cancel in Octopus Cloud

Travis_Hall · 13 January 2019 23:11

We have a deploy task that has stopped - not failed, just not progressing - on the Acquire Packages step. It has subsequently been manually cancelled, but the cancellation itself has been in progress for several days (over the weekend). Furthermore, the in-progress task seems to block other tasks trying to access the same resources (for example, if we try to deploy a later release of the same project - but the blocked tasks can be cancelled.)

We are using Octopus Cloud. The general advice for dealing with tasks that lock up and cannot either complete or cancel is to restart the Octopus service, but that does not seem to be an option for Octopus Cloud.

How can we cancel this failed task and get back to normal operations?

Travis_Hall · 14 January 2019 02:49

I have discovered that restarting the server that the failed task was attempting to deploy to caused the Task to finally cancel.

However, not long after that (but not immediately either) our entire instance stopped responding entirely! I can’t think of what to do about that, since I can’t interact with it in any way.

Travis_Hall · 14 January 2019 02:49

(I do not know whether there is some connection between the problems.)

Alex.Rolley · 14 January 2019 02:59

Hi @Travis_Hall,

Thanks for getting in touch, and sorry that you are not having a great experience at the moment.

Our alerting notified us that your instance ran out of memory after your restart, our engineers are already looking into what is the cause of the issue. Just a heads up that it’s more than likely that another restart will be involved, we will keep you as informed as we can.

Regards
Alex

Travis_Hall · 14 January 2019 03:14

Thank you that information, Alex.

I’m a bit confused, though. I didn’t restart our Octopus instance, only the target server (running on AWS). What caused the failed Octopus restart?

Pawel_Pabich · 14 January 2019 04:00

Hi Travis,

My name is Pawel and I’m engineer on-call that is taking care of your instance. I had a look at our logs and it looks like your VM used all available CPU and Memory. I had to restart your instance because it became unresponsive. It should be back to normal now.

That being said I would like to get to the bottom of this problem. Are you running any deployments/tasks that would consume a lot of CPU/Memory?

Regards,

Pawel

Travis_Hall · 14 January 2019 04:30

I would not have thought so. We are deploying a system with a number of parts, but I think each part isn’t particularly large, and our custom scripts (when they are used) are not very complex.

However, we did have a task that failed and was cancelled on Friday, but did not stop running until today. It has been failing on the Acquire Packages step due to running out of disk space. I can’t see why that would cause it to consume extra CPU/Memory.

Also, I’ve since increased the disk space of the target machine for that task, and it still fails. This time the error message reads:
mkdir: cannot create directory ‘/home/ubuntu/.octopus/HostedOctopus/Work/20190114041123-6688-24’: No space left on device

But I’ve just increased the disk size significantly, so I don’t know why it is running out of disk space for the creation of a directory. It’s not even getting as far as copying the file across.

Travis_Hall · 14 January 2019 05:52

Looking closer… Our deployment does not get as far as running the deployment steps we have specified. It fails while acquiring packages. Four of those packages are to be deployed from an Octopus server, and the Octopus server acquires those packages with no problems.

The other two packages are to be deployed from a Linux server, so Octopus attempts to acquire them and store them on the Linux server. This fails, with the previously-given error message telling us there is no disk space available on the server.

Those two packages are 45MB and 75MB as .tar files. I don’t know if you would consider this to be unusually large, but since the acquisition fails without even creating a folder, I’m not sure it really matters how big these artifacts are.

This failure is not taking place in a custom script, as far as I can tell (noting that I’m not the person who set this up). This part of the process seems to be vanilla Octopus functionality. We do have some custom scripts, but they should be running later in the process.

Travis_Hall · 14 January 2019 06:35

Okay, I’ve now figured out what caused that failure. The disk on that server actually was full, and I’d increased the wrong resources to fix it that way. But we don’t actually want that big a disk, we just need to not be storing months worth of test builds at 118MB for each set. I’ve deleted a huge number of old artifacts, and that has got our deployments working again for now.

(Obviously, we need to be setting up some sort of clean-up process so that we aren’t relying on having somebody who knows what needs to be cleaned up in the office when the disk fills. I’ll speak to the person who set this system up about this when he’s back from leave. This is a process failure on our parts.)

None of that explains the deployment tasks that could not be cancelled, or running out of RAM on an Octopus instance.

Pawel_Pabich · 15 January 2019 05:38

Hi Travis,

I had a deeper look at your instance and it looks like an OS update was using all resources. Last night we provisioned new resources for your instance and it should be running now on OS that is up to date.

Please let me know if you see any other problems.

Regards,

Pawel

Travis_Hall · 15 January 2019 05:44

Right. That also explains why our instance’s IP address changed overnight. Cool, thank you, now I know everything I need to know at the moment.