Health check failing for polling tentacles

Hi,

We have an Octopus server (3.4.14) and a number (~25) of polling tentacles on remote machines (largely 3.14.159) which started to have issues with their health checks as of yesterday. We are now unable to trigger a health check on any machine. Some deployment targets show as healthy while others show as unavailable - the health check fails on both.

For those which are still showing as healthy, when we try to do a deployment, the tentacle still picks up the deployment instruction and carries out the deployment. I have also tried putting TentaclePing onto a polling tentacle and run it against the Octopus server - this shows a variable latency but consistent successful connection:

Connect: Success! 22ms, 969 bytes read
Connect: Success! 15ms, 969 bytes read
Connect: Success! 12ms, 969 bytes read
4,430 successful connections, 0 failed connections. Hit Ctrl+C to quit any time.
Connect: Success! 17ms, 969 bytes read
Connect: Success! 32ms, 969 bytes read
Connect: Success! 17ms, 969 bytes read
Connect: Success! 15ms, 969 bytes read
Connect: Success! 23ms, 969 bytes read
Connect: Success! 12ms, 969 bytes read
Connect: Success! 20ms, 969 bytes read
Connect: Success! 14ms, 969 bytes read
Connect: Success! 22ms, 969 bytes read
Connect: Success! 943ms, 969 bytes read
4,440 successful connections, 0 failed connections. Hit Ctrl+C to quit any time.

When I try and browse to the tentacle port on the octopus server I get an SSL error but I’m not sure if this was always there.

We have tried rebooting tentacles and the octopus server itself.

What are the next steps for troubleshooting this scenario?

Thanks,

This is the tentacle log with nlog config changed to debug level after a restart of the tentacle, then an attempted health check via the octopus server UI. I have redacted some potentially sensitive variables.

2018-07-26 13:19:23.0551 7 DEBUG Loading certificate with thumbprint: {ClientThumbprintRedacted}
2018-07-26 13:19:23.0863 7 DEBUG Certificate was found in store
2018-07-26 13:19:23.1020 7 INFO Octopus Deploy: Tentacle version 3.14.159 (3.14.159+Branch.master.Sha.f1bb46d87e8c002cb3b8e2881b4b4ffda4c87f8c) instance Tentacle
2018-07-26 13:19:23.1332 7 INFO Environment Information:
OperatingSystem: Microsoft Windows NT 6.2.9200.0
OsBitVersion: x64
Is64BitProcess: True
CurrentUser: NT AUTHORITY\SYSTEM
MachineName: {ServerMachineNameRedacted}
ProcessorCount: 2
CurrentDirectory: C:\Windows\system32
TempDirectory: C:\Windows\TEMP
HostProcessName: Tentacle
2018-07-26 13:19:23.1332 7 INFO ==== RunAgentCommand ====
2018-07-26 13:19:23.1332 7 DEBUG Loading certificate with thumbprint: {ClientThumbprintRedacted}
2018-07-26 13:19:23.1332 7 DEBUG Certificate was found in store
2018-07-26 13:19:23.1332 7 DEBUG Loading certificate with thumbprint: {ClientThumbprintRedacted}
2018-07-26 13:19:23.1332 7 DEBUG Certificate was found in store
2018-07-26 13:19:23.4301 7 INFO Agent will trust Octopus servers with the thumbprint: {ServerThumbprintRedacted}
2018-07-26 13:19:23.4301 7 INFO Agent will poll Octopus server at https://{ServerUrlRedacted}:10943/ for subscription poll://nccydkjeepeqchdpze3q/ expecting thumbprint {ServerThumbprintRedacted}
2018-07-26 13:19:23.4301 7 INFO Agent will not use a proxy server
2018-07-26 13:19:23.4457 7 INFO Agent will not listen on any TCP ports
2018-07-26 13:19:23.4457 7 INFO The Windows Service has started
2018-07-26 13:19:23.5863 8 INFO https://{ServerUrlRedacted}:10943/ 8 Opening a new connection
2018-07-26 13:19:23.8917 8 INFO https://{ServerUrlRedacted}:10943/ 8 Performing TLS handshake
2018-07-26 13:19:24.0323 8 INFO https://{ServerUrlRedacted}:10943/ 8 Secure connection established. Server at [{ServerIpv6Redacted}]:10943 identified by thumbprint: {ServerThumbprintRedacted}, using protocol Tls12

On the server all I can see is this (server timezone is UTC, tentacle timezone is GMT+8):
05:24:07 Info | Starting health check for a limited set of deployment targets
05:25:07 Fatal | Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

Hi Sam,

Thanks for getting in touch and I’m sorry to hear you’re having these issues.

These logs, is that from the health check task or from the Octopus Server log (they look to be from the health check)?

Could you send through the Octopus Server logs as well, these might have some more information in them that could help us figure out what is going on here.

Also, we this docs page for troubleshooting Tentacle communication issue, but it looks like you’ve already gone through all of the steps for polling Tentacles.

Thank you and kind regards,
Henrik

Hi Henrik,

Please find the Octopus server logs attached. As a reminder, the server is UTC.

Some additional information, I tried connecting an additional tentacle to the same server from a public network with no firewalls in place – I was able to register the tentacle but I was not able to successfully create a health check. Additionally, despite several tentacles not having passed a health check for multiple days, they appear as Healthy for some reason and can still take part in a deployment. If I queue a health check manually for these targets, it does not work. But when I go to deploy a release, any targets showing as “Unavailable” do not collect the request quickly enough, as shown in the deployment log:

08:43:28 Error | A request was sent to a polling endpoint, but the polling endpoint did not collect the request within the allowed time (00:02:00), so the request timed out.
| Server exception:
| System.TimeoutException: A request was sent to a polling endpoint, but the polling endpoint did not collect the request within the allowed time (00:02:00), so the request timed out.
08:45:03 Info | Guidance received: Ignore
In the deployment above which deploys this step to two targets, the other target was successful once I ignored this failure.

I have gone through the troubleshooting steps for tentacle communication with no success. Is there any more logging I can enable which can help me diagnose the problem?

Thanks,
Sam
OctopusServer.txt (117.9 KB)

Hi Sam,

Looking through the Octopus log I can’t find any calls to the subscription ID of the polling Tentacle, could you compare what Octopus shows to what is stored on the Tentacle.

Octopus:
image

Tentacle.config Tentacle.Communication.TrustedOctopusServers key:

Thanks,
Henrik

Hi Henrik,

In our version of Octopus (3.4.14), I can’t see any subscriptionId associated with the tentacle in the Octopus UI. I’m looking on the Environments page and then through all the settings of the deployment target itself.

There have been some changes over the weekend - all of the targets has successful health checks up to and including 37 minutes ago, and all but one of them are showing as healthy. Despite this, when I queue a health check manually on any of the targets, it still fails.

If I look at the Connectivity tab of one of the targets, it shows the attached pattern repeated successfully every hour over the last day or so.

The one that is marked as unhealthy still will not take part in a deployment.

Regarding the extra tentacle that is showing as unhealthy, I just logged into that machine to find the tentacle service was not running. I’ve started it again and I’ll monitor what happens.

I did start a health check having restarted the service which was not successful. It isn’t clear to me whether this is just an issue with a manual health check now. We’re considering upgrading to the latest version to see if that resolves any of our problems.

All deployment targets are now showing as healthy, however I still can’t run a manual health check.

Hi Sam,

Thanks for the update, I’ve looked through our release notes but have been unable to find any fixes related to manual health checks not working as expected. Upgrading to latest version could fix the issue, and it’s never a bad thing to be on the latest with all the other bug fixes and enhancements that have been made since the version that you’re on.

I’ve asked with the rest of the team if the issue rings any bells for them, I’ll let you know if we can figure it out.

Thanks.
Henrik

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.