Unhealthy linux target blocking global heath check

Octopus Server version: 2019.9.10

An unhealthy target blocked a global health check for 2 days causing other operations to queue up behind it and preventing them from running.

This is the excerpt from the log:

18:45:41 Verbose | Performing health check on machine
18:45:41 Verbose | Establishing SSH connection…
18:46:05 Verbose | SSH connection established
18:48:17 Verbose | SSH connection disposed.
18:48:17 Verbose | Exit code: 0
18:48:17 Verbose | Requesting upload…
18:48:17 Verbose | Establishing SSH connection…
18:48:34 Verbose | SSH connection established
18:49:04 Verbose | Beginning streaming transfer of command.sh to $HOME.octopus\OctopusServer\Work\20200616234817-15703-1921
18:49:04 Verbose | Establishing SFTP connection…

It health check hangs on that deployment target on the establishing SFTP connection step.

Has this been addressed in more recent versions of Octopus? We’re in the process of upgrading to 2020.2.13, but just not there yet.

Hi @christopher.newton!

This is an interesting one - I’ll look into why there wasn’t a timeout here, but for the root cause, are you able to check your system logs on that target to see why the SFTP connection didn’t go through? Depending on your distro, you can usually find this info in /var/log/secure or /var/log/messages - I wonder if you have fail2ban or some other ratelimiting software that saw the multiple connection attempts in short succession and enacted a block.

Look forward to hearing from you soon.

I’ve passed your response on to our operations team and let you know what they find out.

1 Like

We did not see any error messages, just the successful connection until it stops for 2 days and then the successful connections after we restarted everything, No errors or failed attempts.

Thanks @christopher.newton!

This is very strange. I’ve created an issue in our tracker to investigate why the SFTP step did not timeout, and ensure that it does have a reasonable timeout to prevent this happening again. Additionally, I added a link to https://github.com/OctopusDeploy/OctopusDeploy-Api/blob/master/REST/PowerShell/Deployments/CancelLongRunningTasks.ps1 in the issue, which you might want to having run on a periodic cycle to prevent any backups from occurring if you’re not closely monitoring the task queue.

I hope we can have this resolved for you in the near future. You can make use of the “Subscribe” button on the issue itself to get updates when changes happen.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.