Healthchecks hang forever, and won't chancel

estubbs · 31 August 2018 21:45

We have a very dynamic infrastructure in aws (tentacles coming and going all the time). The healthcheck is intermittently stalling when checking SSH targets. It appears to hang when a machine goes offline (expectedly).

The task eventually transitions to the canceled state, but never actually cancels (we’ve waited for days). We have to reset the octopus server every time this happens. Canceling manually has no effect.

Also canceling tasks in general doesn’t seem to work more than 10% of the time on anything.

Octopus 2018.7.13

Robert_Wagner · 3 September 2018 06:41

Hi,

Thank you for getting in touch. It is quite strange that it is hanging there. Is there a proxy between the Octopus server and SSH targets? Does the Octopus server also run on AWS?

Are you able to capture a process dump and upload it to https://file.ac/CMzAF3IaJDY/ (Write only link, you won’t be able to see the file after upload)?

For a task to cancel it requires all it’s tasks to cleanup (eg terminate any processes that are running), so that we don’t leave zombie code running. It should also cancel when it gets a timeout on a network connection. What kind of tasks fail to cancel? Is it usually to do with a machine not being available?

A process dump might also help us get to the bottom of tasks that are not cancelling (if captured while in that state).

estubbs · 5 September 2018 18:38

I will take a process dump and upload it there the next time the problem occurs.

Thank you!

system · 5 October 2018 18:44

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.