I’ve updated my older post but not sure if that’s been seen / notified
We have 2 new Linux tentacles that are consistently losing connection to the server. It’s multiple times per day and we cannot currently make reliable deployments without a number of failures and then babysitting the targets. Not what we need in our “automated” deployments.
Logs (as per the last few posts on that linked topic) show the target closing it’s connection and then never waking back up. The server tries to ping it and we receive connection timeouts on the server logs.
Connectivity checks from Infrastructure tab timeout and deployments fail.
Only thing that seems to fix it is a manual service restart.
We can’t run the watchdog because that doesn’t run on Linux.
Both our Ubuntu targets are 20.04 and set to Listening Tentacles.
Doesn’t seem to be a network issue in the first instance as they work fine once restarted.
2020-09-18 08:22:26.4756 441388 15 TRACE listen://[::]:10933/ 15 No messages received from client for timeout period. Connection closed and will be re-opened when required
Port was still listening
tcp6 0 0 :::10933 :::* LISTEN
9:31am I ran a manual Health Check. This failed. Server log showed:
September 18th 2020 09:31:34
Info
Opening a new connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/
September 18th 2020 09:32:34
Error
The client was unable to establish the initial connection within the timeout 00:01:00. Retrying in 1.0 seconds.
September 18th 2020 09:32:35
Info
Retrying connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/ - attempt #1.
September 18th 2020 09:32:35
Info
Opening a new connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/
September 18th 2020 09:33:35
Error
The client was unable to establish the initial connection within the timeout 00:01:00. Retrying in 1.0 seconds.
September 18th 2020 09:33:36
Info
Retrying connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/ - attempt #2.
September 18th 2020 09:33:36
Info
Opening a new connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/
September 18th 2020 09:34:36
Error
The client was unable to establish the initial connection within the timeout 00:01:00. Retrying in 1.0 seconds.
September 18th 2020 09:34:37
Info
Retrying connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/ - attempt #3.
September 18th 2020 09:34:37
Info
Opening a new connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/
September 18th 2020 09:35:37
Error
The client was unable to establish the initial connection within the timeout 00:01:00. Retrying in 1.0 seconds.
September 18th 2020 09:35:38
Info
Retrying connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/ - attempt #4.
September 18th 2020 09:35:38
Info
Opening a new connection to https://5-102-175-149.deploy.schemeswriter.co.uk:10933/
September 18th 2020 09:36:38
Error
The client was unable to establish the initial connection within the timeout 00:01:00. Retrying in 1.0 seconds.
Tentacle log showed no connection attempt whatsoever.
Port was still listening:
tcp6 0 0 :::10933 :::* LISTEN
So…
Whatever the tentacle is doing 10 minutes after a connection when it closes it’s connection is cutting it off completely.
This is a pretty serious issue for us. We would need to restart our tentacle every 10 minutes at this rate.
It doesn’t feel like a network infrastructure issue at the moment due to the fact that the client is still supposedly listening on the correct port according to the server, and that this connection works as long as it has recently restarted. It seems after a restart it waits for longer than 10 minutes, but once a check is performed, it then closes it’s connection after a 10 minutes timeout, and it’s this disconnect that kills the connection completely.
After much wailing and gnashing of teeth… we have performed some troubleshooting steps and found an “interesting” solution.
Disable IPv6
Seems that somewhere in the network stack the traffic gets lost. No idea how or why the initial connection after tentacle restart is absolutely fine.
The Ubuntu network stack uses IPv4 to IPv6 mapping so this shouldn’t get in the way. We could see in netstat that when the Octopus server connected it was connecting fine on an IPv4 address.
Just… whatever happened after that 10 minutes timeout was killing off the IPv4 part of the traffic. The port was still open and listening but the tentacle process was just not responding.
I have no idea where the issue lays now.
We have a workaround at least.
Is it at all possible to confirm that this is an issue in your environment?
If it isn’t there then it’s more likely to be in our network infrastructure somewhere (although we haven’t set up anything bespoke or specific around IPv6). All our servers are pretty vanilla as are our routers etc.
If it is an issue either around the tentacle itself or lower in the dotnet code then it might be worth a troubleshooting hint somewhere?
Appreciate you reading this and feel free to close
Thanks Jeremy.
I will just add that while this seemed to alleviate the issues with the health checks failing, we still received a number of errors when actually deploying. It was always at the Acquiring Package stages and was always timeouts.
We’ll continue to experiment here but it seems as if this issue isn’t quite solved.
Hi,
I did do some testing on my own by enabling ipv6 on both my Linux tentacle and the server and had zero issues with disconnects so this does seem to be likely environmental. If you do a tracert over ipv6 and ipv4 do the routes differ at all? Can you check all the conifgs on the appliances to make sure there isn’t anything timing out or blocking/closing ports?
Which version of tentacle and server are you using?
We’re using Octopus Cloud so the latest version AFAIK
2020.4.0-rc0006
Tentacles are also latest version I believe
6.0.0+Branch.master.Sha.603170d4c5edeac957d0d42fa77a6c302fe416b4
We are getting constant disconnects at the moment between the server and both tentacles during deployments. Some projects work, and then they don’t. Some deploys haven’t deployed successfully yet.
It’s normally at the Acquire Package stage
tracert -4 gives us a route (from my local machine), tracert -6 doesn’t resolve the target. Not sure where that’s being stopped.
We’ve had some other scripting issues that have muddied the waters a little so I’m clearing those and then redeploying.
I will update here if we are still seeing connectivity problems.
My apologies, I wasn’t aware you were on cloud. There is actually an Azure limitation with containers that doesn’t allow for IPv6. Would you be able to look at the appliances in between the cloud server and your tentacles to see if there are any policies that would be blocking/closing connections? Is there anything in the logs of your appliances in the timeframe where the tentacle connections terminate? Are you still doing IPv4->IPv6 translation? If so please check if anything is logged there.
Ha at least that rules out IPv6 on the server side!
We are not explicitly blocking anything on the network appliances we control. Everything is vanilla i.e. not set up for IPv6.
We are no longer performing the IPv4->IPv6 translation on the Ubuntu tentacles as we’ve disabled IPv6 completely.
I appreciate you looking into this with us.
We are deploying this afternoon into our development and UAT environments which is a pretty busy operation and puts a lot of deploys on to the same server(s). If we notice anything disconnecting then we’ll look into it and try and get you more information.
If nothing appears we may re-enable IPv6 on the tentacles to see if we can replicate it fully and get logs from across the stack.
Hi @jeremy.miller
We’ve been running with this for a few days now.
It seems to still not be fixed.
Unfortunately our logs are not showing anything further.
Listening tentacle closes connection, deploy starts and fails with a timout. Subsequent deploy then works. But this keeps failing our CI/CD pipeline. It’s not really helpful at the moment.
Subsequent connectivity checks are often now succeeding. Subsequent deploys then work for a while. But once the tentacles go idle they refuse to wake up within the timeout.
2020-10-01 13:29:37.1071 668 18 TRACE listen://0.0.0.0:10933/ 18 No messages received from client for timeout period. Connection closed and will be re-opened when required
2020-10-01 13:46:19.9467 668 8 INFO listen://0.0.0.0:10933/ 8 Accepted TCP client: 13.66.133.169:7872
2020-10-01 13:46:19.9467 668 8 TRACE listen://0.0.0.0:10933/ 8 Performing TLS server handshake
There was an attempted and failed deploy at 13:43. Absolutely nothing registered on the tentacle. Next project deploying at 13:46 then worked.
I’m sorry to hear you’re still having issues. Have you by chance tried rolling back the tentacle version to the latest 5.0 version? It looks to be 5.0.15.