Linux Tentacle losing connection

ops · 17 September 2020 16:53

I’ve updated my older post but not sure if that’s been seen / notified

We have 2 new Linux tentacles that are consistently losing connection to the server. It’s multiple times per day and we cannot currently make reliable deployments without a number of failures and then babysitting the targets. Not what we need in our “automated” deployments.

Logs (as per the last few posts on that linked topic) show the target closing it’s connection and then never waking back up. The server tries to ping it and we receive connection timeouts on the server logs.

Connectivity checks from Infrastructure tab timeout and deployments fail.
Only thing that seems to fix it is a manual service restart.
We can’t run the watchdog because that doesn’t run on Linux.

Both our Ubuntu targets are 20.04 and set to Listening Tentacles.
Doesn’t seem to be a network issue in the first instance as they work fine once restarted.

Any ideas?

jeremy.miller · 17 September 2020 17:22

Hi,

Dane is in Australia so he isn’t responding due to the timezone difference.

Is there anything in the systemd/kernel logs that would indicate an issue? Anything with ulimit -a or netstat -an?

Thanks,
Jeremy

ops · 18 September 2020 08:49

Right I think I’ve got a pretty reasonable reproduction of the issue. Not that I understand why it is happening…

9am our service automatically restarted on a cron job (an attempt to set a watchdog on it)
netstat -an showed:

tcp6       0      0 :::10933                :::*                    LISTEN

9:12am Octopus server ran an hourly automated connectivity check and it succeeded. Tentacle log showed:

2020-09-18 08:12:24.1781  441388      5 TRACE  listen://[::]:10933/              5  Performing TLS server handshake
2020-09-18 08:12:24.6122  441388     15 TRACE  listen://[::]:10933/             15  Secure connection established, client is not yet authenticated, client connected with Tls12
2020-09-18 08:12:24.7442  441388     15 TRACE  listen://[::]:10933/             15  Begin authorization
2020-09-18 08:12:24.7442  441388     15  INFO  listen://[::]:10933/             15  Client at [::ffff:13.66.133.169]:7488 authenticated as 4C808322E57AAD30A9B009817540F5F118A54FC0
2020-09-18 08:12:24.7926  441388     15 TRACE  listen://[::]:10933/             15  Begin message exchange
2020-09-18 08:12:25.2386  441388     15 TRACE  listen://[::]:10933/             15  Received: IScriptService::StartScript[1] / bdb841e6-8783-4a5c-a72e-1fa0678d00ec
2020-09-18 08:12:25.3811  441388     18 TRACE  [ServerTasks-33621] [RunningScript] [Read Lock] [no locks] Trying to acquire lock.
2020-09-18 08:12:25.3935  441388     18 TRACE  [ServerTasks-33621] [RunningScript] [Read Lock] ["ServerTasks-33621" (has a read lock)] Lock taken.
2020-09-18 08:12:25.4722  441388     15 TRACE  listen://[::]:10933/             15  Sent: Halibut.Transport.Protocol.ResponseMessage
2020-09-18 08:12:25.9131  441388     18 TRACE  [ServerTasks-33621] [RunningScript] [Read Lock] ["ServerTasks-33621" (has a read lock)] Releasing lock.
2020-09-18 08:12:25.9653  441388     15 TRACE  listen://[::]:10933/             15  Received: IScriptService::GetStatus[2] / 4d147fdf-ca8b-43ca-97e7-08c05e5266f0
2020-09-18 08:12:26.0110  441388     15 TRACE  listen://[::]:10933/             15  Sent: Halibut.Transport.Protocol.ResponseMessage
2020-09-18 08:12:26.3434  441388     15 TRACE  listen://[::]:10933/             15  Received: IScriptService::CompleteScript[3] / 8ec172c6-f123-4364-a55a-25462b9f2343
2020-09-18 08:12:26.3592  441388     15 TRACE  listen://[::]:10933/             15  Sent: Halibut.Transport.Protocol.ResponseMessage

9:22am (10 minutes later) tentacle “disconnects”:

2020-09-18 08:22:26.4756  441388     15 TRACE  listen://[::]:10933/             15  No messages received from client for timeout period. Connection closed and will be re-opened when required

Port was still listening

tcp6       0      0 :::10933                :::*                    LISTEN

9:31am I ran a manual Health Check. This failed. Server log showed:

September 18th 2020 09:31:34
Info
Opening a new connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/ 
September 18th 2020 09:32:34
Error
The client was unable to establish the initial connection within the timeout 00:01:00. Retrying in 1.0 seconds. 
September 18th 2020 09:32:35
Info
Retrying connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/ - attempt #1. 
September 18th 2020 09:32:35
Info
Opening a new connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/ 
September 18th 2020 09:33:35
Error
The client was unable to establish the initial connection within the timeout 00:01:00. Retrying in 1.0 seconds. 
September 18th 2020 09:33:36
Info
Retrying connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/ - attempt #2. 
September 18th 2020 09:33:36
Info
Opening a new connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/ 
September 18th 2020 09:34:36
Error
The client was unable to establish the initial connection within the timeout 00:01:00. Retrying in 1.0 seconds. 
September 18th 2020 09:34:37
Info
Retrying connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/ - attempt #3. 
September 18th 2020 09:34:37
Info
Opening a new connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/ 
September 18th 2020 09:35:37
Error
The client was unable to establish the initial connection within the timeout 00:01:00. Retrying in 1.0 seconds. 
September 18th 2020 09:35:38
Info
Retrying connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/ - attempt #4. 
September 18th 2020 09:35:38
Info
Opening a new connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/ 
September 18th 2020 09:36:38
Error
The client was unable to establish the initial connection within the timeout 00:01:00. Retrying in 1.0 seconds.

Tentacle log showed no connection attempt whatsoever.
Port was still listening:

tcp6       0      0 :::10933                :::*                    LISTEN

So…

Whatever the tentacle is doing 10 minutes after a connection when it closes it’s connection is cutting it off completely.
This is a pretty serious issue for us. We would need to restart our tentacle every 10 minutes at this rate.
It doesn’t feel like a network infrastructure issue at the moment due to the fact that the client is still supposedly listening on the correct port according to the server, and that this connection works as long as it has recently restarted. It seems after a restart it waits for longer than 10 minutes, but once a check is performed, it then closes it’s connection after a 10 minutes timeout, and it’s this disconnect that kills the connection completely.

Help?!

ops · 18 September 2020 12:36

After much wailing and gnashing of teeth… we have performed some troubleshooting steps and found an “interesting” solution.

Disable IPv6

Seems that somewhere in the network stack the traffic gets lost. No idea how or why the initial connection after tentacle restart is absolutely fine.
The Ubuntu network stack uses IPv4 to IPv6 mapping so this shouldn’t get in the way. We could see in netstat that when the Octopus server connected it was connecting fine on an IPv4 address.
Just… whatever happened after that 10 minutes timeout was killing off the IPv4 part of the traffic. The port was still open and listening but the tentacle process was just not responding.

I have no idea where the issue lays now.
We have a workaround at least.

Is it at all possible to confirm that this is an issue in your environment?
If it isn’t there then it’s more likely to be in our network infrastructure somewhere (although we haven’t set up anything bespoke or specific around IPv6). All our servers are pretty vanilla as are our routers etc.

If it is an issue either around the tentacle itself or lower in the dotnet code then it might be worth a troubleshooting hint somewhere?

Appreciate you reading this and feel free to close

jeremy.miller · 18 September 2020 14:25

Hi,

Thanks for the detailed information.

Let me talk to some colleagues about this and I will get back to you.

Please feel free to reach out in the meantime.

Thanks,
Jeremy

ops · 21 September 2020 08:04

Thanks Jeremy.
I will just add that while this seemed to alleviate the issues with the health checks failing, we still received a number of errors when actually deploying. It was always at the Acquiring Package stages and was always timeouts.
We’ll continue to experiment here but it seems as if this issue isn’t quite solved.

jeremy.miller · 21 September 2020 11:57

Hi,
I did do some testing on my own by enabling ipv6 on both my Linux tentacle and the server and had zero issues with disconnects so this does seem to be likely environmental. If you do a tracert over ipv6 and ipv4 do the routes differ at all? Can you check all the conifgs on the appliances to make sure there isn’t anything timing out or blocking/closing ports?

Which version of tentacle and server are you using?

Thanks,
Jeremy

ops · 21 September 2020 12:29

Thanks for checking @jeremy.miller

We’re using Octopus Cloud so the latest version AFAIK
2020.4.0-rc0006

Tentacles are also latest version I believe
6.0.0+Branch.master.Sha.603170d4c5edeac957d0d42fa77a6c302fe416b4

We are getting constant disconnects at the moment between the server and both tentacles during deployments. Some projects work, and then they don’t. Some deploys haven’t deployed successfully yet.

It’s normally at the Acquire Package stage

tracert -4 gives us a route (from my local machine), tracert -6 doesn’t resolve the target. Not sure where that’s being stopped.

We’ve had some other scripting issues that have muddied the waters a little so I’m clearing those and then redeploying.
I will update here if we are still seeing connectivity problems.

Mike

jeremy.miller · 21 September 2020 14:10

Hey Mike,

My apologies, I wasn’t aware you were on cloud. There is actually an Azure limitation with containers that doesn’t allow for IPv6. Would you be able to look at the appliances in between the cloud server and your tentacles to see if there are any policies that would be blocking/closing connections? Is there anything in the logs of your appliances in the timeframe where the tentacle connections terminate? Are you still doing IPv4->IPv6 translation? If so please check if anything is logged there.

Please let me know if you find anything.

Thanks,
Jeremy

ops · 21 September 2020 14:21

Ha at least that rules out IPv6 on the server side!

We are not explicitly blocking anything on the network appliances we control. Everything is vanilla i.e. not set up for IPv6.
We are no longer performing the IPv4->IPv6 translation on the Ubuntu tentacles as we’ve disabled IPv6 completely.

I appreciate you looking into this with us.

We are deploying this afternoon into our development and UAT environments which is a pretty busy operation and puts a lot of deploys on to the same server(s). If we notice anything disconnecting then we’ll look into it and try and get you more information.
If nothing appears we may re-enable IPv6 on the tentacles to see if we can replicate it fully and get logs from across the stack.

Thanks,

Mike

jeremy.miller · 21 September 2020 14:27

Hey Mike,

You’re very welcome.

That sounds like a plan. Please update me if it works as well if you have time.

Thanks,
Jeremy

ops · 1 October 2020 14:01

Hi @jeremy.miller
We’ve been running with this for a few days now.
It seems to still not be fixed.
Unfortunately our logs are not showing anything further.
Listening tentacle closes connection, deploy starts and fails with a timout. Subsequent deploy then works. But this keeps failing our CI/CD pipeline. It’s not really helpful at the moment.
Subsequent connectivity checks are often now succeeding. Subsequent deploys then work for a while. But once the tentacles go idle they refuse to wake up within the timeout.

2020-10-01 13:29:37.1071    668     18 TRACE  listen://0.0.0.0:10933/          18  No messages received from client for timeout period. Connection closed and will be re-opened when required
2020-10-01 13:46:19.9467    668      8  INFO  listen://0.0.0.0:10933/           8  Accepted TCP client: 13.66.133.169:7872
2020-10-01 13:46:19.9467    668      8 TRACE  listen://0.0.0.0:10933/           8  Performing TLS server handshake

There was an attempted and failed deploy at 13:43. Absolutely nothing registered on the tentacle. Next project deploying at 13:46 then worked.

Any suggestions?

jeremy.miller · 1 October 2020 14:10

Hi @ops,

I’m sorry to hear you’re still having issues. Have you by chance tried rolling back the tentacle version to the latest 5.0 version? It looks to be 5.0.15.

Please let me know.

Thanks,
Jeremy

system · 1 November 2020 14:10

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.