Deployment Target health check marks as healthy, then immediately reverts to unhealthy

dhockett · 27 July 2023 19:24

We have a deployment target that is displaying as ‘Unhealthy’ despite operating as normal and running through deploys without issue. When I run a health check, it succeeds and is marked as ‘Healthy’. However, when I refresh the page a few seconds after the health check finishes, it shows ‘Unhealthy’ again.

The ‘Events’ page seems to indicate that it’s having issues resolving the target’s OS/architecture information; while it successfully updates the information in the health check, it immediately reverts all of it to unknown/null a few seconds after the health check finishes.

I’ve copied the Events information below, removing our target’s actual name. I don’t see anything that stands out in the ’ Recent Communication Logs’, but I also don’t have access to the full communication logs, only the last couple hundred lines that display in the Octo UI’s Connectivity page (I may be able to request them from our ops team if required). When refreshing the page a few times I think I saw a Halibut error related to ‘tried to read past the end of the stream’ but I haven’t been able to reproduce the error in order to provide the full error message.

Tentacle version: 7.0.33
Calamari version: 25.3.3

DeploymentTarget (target name) was modified

Thursday, July 27, 2023 2:02:51 PM
system

Established with: Unknown
User agent: Server
Category: Document modified

before:

    "OperatingSystem": "Unknown",
    "ShellName": "Unknown",
    "ShellVersion": "Unknown",
    "Architecture": "Unknown",
    "IsRunningInContainer": null

after:

    "OperatingSystem": "Ubuntu 18.04.6 LTS (bionic)",
    "ShellName": "Bash",
    "ShellVersion": "4.4.20(1)-release",
    "Architecture": "x86_64",
    "IsRunningInContainer": false

DeploymentTarget (target name) became healthy

Thursday, July 27, 2023 2:02:51 PM
system

Established with: Unknown
User agent: Server
Category: Machine found healthy

DeploymentTarget (target name) was modified

Thursday, July 27, 2023 2:02:58 PM
system

Established with: Unknown
User agent: Server
Category: Document modified

before:

    "OperatingSystem": "Ubuntu 18.04.6 LTS (bionic)",
    "ShellName": "Bash",
    "ShellVersion": "4.4.20(1)-release",
    "Architecture": "x86_64",
    "IsRunningInContainer": false

after:

    "OperatingSystem": "Unknown",
    "ShellName": "Unknown",
    "ShellVersion": "Unknown",
    "Architecture": "Unknown",
    "IsRunningInContainer": null

DeploymentTarget (target name) became unhealthy

Thursday, July 27, 2023 2:02:58 PM
system

Established with: Unknown
User agent: Server
Category: Machine found to be unhealthy

The remote script failed with exit code 1

clare.martin · 31 July 2023 16:08

Good afternoon @dhockett,

Thank you for contacting Octopus Support and sorry it has taken us so long to get back to you, our support team were at a company meetup over the weekend and some of us have only just got back.

I am also sorry you are seeing an unhealthy tentacle, though it is strange it is showing as unhealthy and still being deployed to.

If you are seeing an error referencing ‘Attempted to read past the end of the stream’ it could be for a number of reasons, usually we do see this if you are running an Anti-virus on your tentacle which is blocking some of the Octopus health check.

If you do have an antivirus would you be able to temporarily disable it on that target and run the health check again to see if it stays healthy or not? If it does stay healthy we do have a document here on whitelisting Octopus which you can use on your AV application to whitelist the Octopus folders required.

This could also be networking so I recommend you take a look at our documentation on Tentacle Troubleshooting if you have not seen it yet. The main one to look at would be our Tentacle ping tool which you can use on your Linux box to see if you have any connection dropouts.

If none of those tests get an answer for you let me know if you see any relevant log entries/errors for this authentication under /var/log/secure on the target?

Feel free also to send this to us alongside your tentacle logs if we can get the full ones as they would help. You can send them to our secure site here.

Let me know once those have been sent over and I can take a look at them for you,
Kind Regards,
Clare

dhockett · 4 August 2023 14:25

I’m working with our ops team to get server-side logs. As for the Tentacle logs, they’re surprisingly sparse; after running a deployment at 8:23am, and then another health check at 8:38am, these are the full contents of today’s log:

2023-08-04 08:23:06.3850    810     87  INFO  listen://[::]:10933/             87  Accepted TCP client: [::( ... )]:52620
2023-08-04 08:23:06.6927    810     82  INFO  listen://[::]:10933/             82  Client at [::( ... )]:52620 authenticated as ( ... )

Again, the deploy had no issues, the health check completed successfully and marked the tentacle as healthy, but then the server marked the tentacle as unhealthy about 5 seconds after the health check ended.

clare.martin · 4 August 2023 14:34

Hey @dhockett,

Thanks for the tentacle log snipppet, you are right in that it does not show us much but I do see it is a listening tentacle. If you can get the server side logs that may show something but if the tentacle connection is dropping out you wont really see that in the tentacle logs for a listening tentacle as it just sits and waits for Octopus to send it instructions unlike a polling tentacle which will reach out to the Octopus server so if that cannot connect you would see that in the tentacle logs.

I would definitely check out the tentacle ping on your Octopus server and try pinging that target and see if there are any dropouts. Your Octopus server logs would show a connection issue on the health check for that target and possibly if you try to deploy to it and it bombs out you may find something in the server logs.

But the tentacle ping is where I would look for connection dropouts if you can get that installed on your Octopus server as that is our ‘gold standard’ for connections from Octopus to a tentacle.

Hopefully that helps, once you have the logs you can send them over if you want, if the secure files link has expired let me know and I will generate you a new one.

Kind Regards,
Clare

dhockett · 30 August 2023 16:17

OK, I was able to coordinate with our ops team to run the ping/pong tests and collect logs. Could you provide a new link for the secure files?

finnian.dempsey · 30 August 2023 23:40

Hi @dhockett,

Just stepping in for Clare with a fresh link to our Secure Upload Portal.

Feel free to let us know if there are any issues with it!

Best Regards,

dhockett · 31 August 2023 13:06

Thanks, I’ve uploaded the logs from the ping/pong tests.

clare.martin · 31 August 2023 14:37

Hey @dhockett,

Thank you for sending those logs over, I took a look and unfortunately it does look like you have networking issues with the machine ending in 01:

(Some details have been redacted and replaced with X’s)

2023-08-29T21:53:27 Connect: Failed! 44ms; connected: True; SSL: True
System.ComponentModel.Win32Exception (0x80004005): The client and server cannot communicate, because they do not possess a common algorithm

2023-08-29T21:56:33 Accepted TCP client XX01:5XXX9
Unhandled error when processing request from client: System.IO.IOException: Authentication failed because the remote party has closed the transport stream.

XX02 connects fine with no connection dropouts:

Using SSL Protocol: Tls12
Pinging XX02 on port 10933
2023-08-29T20:54:58 Connect: Success! 90ms, 1,216 bytes read

Usually when we see the message:

The client and server cannot communicate, because they do not possess a common algorithm

It means the tentacle and Octopus server do not have the same TLS protocols and so struggle to communicate with each other. We have an article on this here if you wanted to take a look.

I can see your Octopus server is using a high tentacle and calamari version so it must be on TLS 1.2.

I did just bring this up in our daily support meeting and one of the other support members has seen something similar to this where a tentacle will fail an upgrade in Octopus which results in an unhealthy communications check, however, it will deploy to the tentacle as that does not involve an upgrade.

To rule this out are you able to filter your Octopus Server task log for tentacle upgrades and see if there are any failed ones to that box:

For our information are you also able to confirm which box is your Octopus server is it the one ending in 01 or 02?

I look forward to seeing if you do have failed tentacle upgrades to that box, if you do not we can take a look at the TLS and SSL settings.

Kind Regards,
Clare

dhockett · 31 August 2023 20:37

I just uploaded the server task log at the same secure file link, looks like we actually do have failing tentacle upgrades on that node.

The main things that stood out to me in those logs:

It seems to be failing on permissions issues, saying ‘are you root?’. It should be running in the context of our automated deploys user, which is not root, but does have sudo privileges.
It seems to be using tentacle version 7.0.33, but is trying to ‘upgrade to’ 6.3.305, which is a lower version, right? Not sure what that means.

finnian.dempsey · 1 September 2023 03:58

Hi @dhockett,

Just stepping in for Clare while she’s offline to confirm we have received the logs, cheers for sending them through!

Using a non-root user should be fine, as long as the following requirements are met: Linux targets | Documentation and Support

Linux Tentacle Requirements:

The $HOME environment variable must be available.
bash 3+ is available at /bin/bash.
tar is available. This is used to unpack Calamari.
base64 is available. This is used for encoding and decoding variables.
grep is available.

Was this tentacle recently migrated to the non-root account?

I have a feeling the contents of the /opt/octopus/tentacle folder could be owned by another user, possibly the root user.

Could you please run the following command ls command to confirm ownership of the tentacle dll’s?

ls -lah /opt/octopus/tentacle

I’ll have to dig into why it’s trying to install a lower version than is running but I have a feeling it might have something to do with the ownership of the Tentacle files.

Looking forward to hearing how you get on, feel free to reach out with any questions at all!

Best Regards,

clare.martin · 1 September 2023 08:42

Hey @dhockett,

Off of the back of what Finnian suggested and taking a look at the logs you uploaded for myself I can see there is a failed tentacle upgrade which may be the issue here and is what is causing this tentacle to show as unhealthy even if you can deploy to it.

I can see a fair few lines like this in the logs:

Error | tar: System.Xml.Linq.dll: Cannot open: File exists

Which I have never seen before on an upgrade task, it may be an idea to fully manually uninstall the tentacle using our documentation here.

Your target is currently using Tentacle 7.0.33 but I would try and install Tentacle 6.3.305 as this is what your logs show Octopus is trying to upgrade it to so it must be the version 2023.1.9781 has been shipped with.

Once you have done that I am hoping this issue is resolved but let me know either way, if it is not resolved we can dig deeper into this but it would be beneficial for us to rule the failed upgrade issue out as the root cause of the failing health checks but deployable machine.

I look forward to hearing from you,
Kind Regards,
Clare

dhockett · 1 September 2023 13:38

All the contents of /opt/octopus/tentacle are owned by root.

We had duplicate tentacle service files (one /etc/systemd/system/Tentacle.service (uppercase) with User=root, one /etc/systemd/system/tentacle.service (lowercase) with User=(deploy user)).

tentacle was the one currently started/enabled, but apparently Tentacle had been run instead at some point, so it was running as our deploy user and trying to access a bunch of stuff owned by root. I checked one of our other (properly operational) targets and confirmed it was running tentacle as root, not as our deploy user like I thought. For now, I disabled tentacle and started Tentacle, and ran another health check, and the target now shows as healthy. I’ll most likely try a reinstall as well at some point since we generally try to avoid using root for these things.

I thought these .service files were auto-generated, I don’t remember configuring them manually, is there any chance that the Tentacle install command could have generated both of these duplicate .service files?

finnian.dempsey · 4 September 2023 04:56

Hi @dhockett,

Cheers for confirming that, great to hear you were able to find the cause!

I’ve been caught by this exact issue in the past and recall that lowercase tentacle.service is the default value we used for Manually configuring Tentacle to run as a service. The uppercase Tentacle.service is autogenerated when first configuring Tentacle.

Our docs about uninstalling Tentacle has more info about the folders and files being used: Linux Tentacle | Documentation and Support

Hope that helps but feel free to reach out with any questions at all!

Best Regards,

system · 5 October 2023 04:57

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.