Check Health job stuck

binliu · 11 March 2017 00:16

Server running 3.7.2 HA with 2 server nodes and 1000+ tentacles (mostly listen & some poll). Check Health task runs every few hours. Often times the task gets stuck, with Task Summary page displaying message like ‘1,031 of 1,032 health checks complete (99%)’ under Task Progress.

The message may be misleading - I downloaded the raw task log, extracted all the Success & Failed lines (see samples below), found he total count matched what’s expected (in this case 1,032). So it appears that Check Health did complete checking each tentacle, but it might be doing something else after done with the last check, and didn’t get to update the count.

I’ve experienced this many times - message always says ‘n-1 of n health cheks complete (99% done)’. The only way to unstuck is to cancel the task.

Questions - Where/how it gets stuck? How to resolve/work around?

== Success: Check deployment target: ==
== Failed: Check deployment target: ==

Vanessa_Love · 12 March 2017 22:19

Hi,

Thanks for getting in touch! This is a known issue that we believe is fixed in 3.11.2.
Here is the corresponding GitHub issue: https://github.com/OctopusDeploy/Issues/issues/3203
If you are able to upgrade, and report back that it is resolved as the customers who previously reported this have not yet confirmed it resolved the problem.

Unfortunately there is no known workaround so upgrading will be the only way to resolve the issue.

Thanks,
Vanessa

binliu · 14 March 2017 12:21

We upgraded from 3.7.3 to 3.11.8. It seems to have resolved the issue.

binliu · 4 April 2017 13:09

The issue didn’t really get resolved. It completed a few times, but got stuck on all other runs.

I’ve just upgraded Octopus to 3.12.0, then re-ran the job, it still got stuck.

When it got stuck, it always shows that ‘n-1 of n health checks complete (99%)’, and stayed in a state that the task can’t be cancelled either. The only way to cancel the task is to restart the server, afterwards the task will be automatically rerun and before it’s complete, click on ‘Cancel’.

This is frustrating! We rely on Check Health to mark servers that temporarily went offline as healthy when they became available again during Check Health run. It appears that if Check Health didn’t complete, none of the machine status got updated.

Shane_Gill · 5 April 2017 00:49

Hi,

Thanks for getting in touch again, I’m sorry to hear that your health checks are still hanging.

To help us figure out why this is happening, could you please send a log for one of the check that has failed?

It would also be an immense help to get a dump file of the Octopus Server process while a health check is hung. There are instructions here: https://octopus.com/docs/reference/process-dumps#creating-a-process-dumps

You can upload the dump securely to this location: https://file.ac/w02sON535cM/

In addition I will investigate saving the health check as each machine completes, rather than at the end of the health check.

Cheers,
Shane