Transient communication issues with tentacles

Hi,

Since the end of November we have on rare occasions encountered communication issues with a small subset of our tentacles. The issue encountered looks like this:

The step failed: Activity failed with error 'An error occurred when sending a request to 'https://IPADRESS/', after the request began: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.. Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.. A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.'.

Shortly after this error is logged in Octopus Deploy the tentacle logs the following in the OctopusTentacle.txt logfile:

Socket IO exception: [::ffff:IPADRESS]:52833
System.Net.Sockets.SocketException (0x80004005): A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)

The issue is transient and after one to three retries the deploy is successfull. The transient nature of this issue makes it very difficult to troubleshoot what might be causing this.

A list of things done to try and solve this problem:

  • Scaled up the server that Octopus Deploy is hosted on
  • Scaled up the database that Octopus Deploy uses
  • Made sure the target servers running the tentacles were not out of hardware resources during the time of deploy (CPU/MEM/Network)

The tentacles encountering this issue are on the same network as several other tentacles that are not encountering these issues so it’s likely not related to network exhaustion rather something related to these specific servers.

Any tips for further troubleshooting are very welcome!

Good morning @Wildpipe,

Thank you for getting in touch and sorry to hear you are having issues with some of your tentacles.

Having looked into this it seems to point to a networking issue (firewall/proxy) on specific tentacles which may be killing the connection after a specified amount of time.

I found two pages in our forums here and here relating to this, both from 2015 though so some of the advice will be outdated (mainly the bits about ‘improving the connection timeouts’ as we would have already done that and also the settings getting wiped after an upgrade won’t happen anymore). It may be worth trying changing some of the settings suggested such as:

<add key="Halibut.ConnectionErrorRetryTimeout" value="00:15:00"/>
<add key="Halibut.TcpClientConnectTimeout" value="00:02:00"/>
<add key="Halibut.TcpClientHeartbeatSendTimeout" value="00:02:00"/>
<add key="Halibut.TcpClientHeartbeatReceiveTimeout" value="00:02:00"/>

You can always delete those if that does not work.

Can I ask if there were any changes to your Network configuration around November time, also were tentacles working before November and they have just started hitting this issue after November?

Is there a pattern to it, are the same Tentacles having the issue or is it random (happens on different machines). If they are the same machines are they new machines (installed after November) or are there any differences (OS version, on different subnet etc).

Are your tentacles using a proxy server?

What version of Octopus are you running?

Finally are you able to upload a full raw deployment log of a failed deployment and tentacle log of the machine you are trying to deploy to please? I have created you a secure link here which you can use to upload the logs, are you able to let me know when those have been uploaded as we don’t get notified when customers upload to our secure site.

I can then take a look at the logs and see if there is anything else I can spot.

I look forward to hearing from you,

Kind Regards,

Clare Martin

Thank you for taking the time to answer our questions.

The timeout configuration we will try right away. Thanks for the tip!

Can I ask if there were any changes to your Network configuration around November time, also were tentacles working before November and they have just started hitting this issue after November?

No, the networking has not changed for the servers encountering these issues. Yes, the tentacles were working perfectly fine before November.

Is there a pattern to it, are the same Tentacles having the issue or is it random (happens on different machines). If they are the same machines are they new machines (installed after November) or are there any differences (OS version, on different subnet etc).

Yes, the problem is located to three out of nine servers. But between those three it’s a bit random which one of them encounters the issue. The servers work in clusters of three servers each and it’s one of these clusters having this problem. All nine servers were re-installed in late november using an Amazon Machine Image which means they are near identical copies of one another. They all run the same type of workload and are all spread over three different subnets (with one server from each cluster in each subnet).

Are your tentacles using a proxy server?

No

What version of Octopus are you running?

2021.2.7713

Let me get back to you once I’ve grabbed a raw deployment log for you to analyze.

In the meantime thank you for all the help!

Afternoon @Wildpipe,

Thank you for those answers they help build a bigger picture of how your network is set up, I will wait for the logs but am I able to ask one more question if you don’t mind, do you have a load balancer in your network anywhere? We tend to get a lot of issues that come down to load balancers and due to the fact you mentioned your servers were in clusters, it reminded me to ask that question.

Seems very interesting that only one server cluster is having this issue, I will talk to the rest of the team too to see if they have any ideas, I will get back to you if they do.

Please reach out in the meantime if you have any other queries and I will await the logs.

Kind Regards,

Clare Martin

Hey @Wildpipe,

Sorry to spam the post here but I spoke to the rest of the team to see if there was anything else you could try and they suggested using TentaclePing on the tentacles that are having the issue to the Octopus server, this would show you when the connection drops out and would rule out any networking issues rather than an issue with Octopus itself, you could then take that to your network engineers and troubleshoot with them to see if there was anything obvious going on with the network.

Does that help at all? I will take a look at the logs but if we can get a better idea of connectivity between the three servers having the issue and the Octopus server that would give us a good idea of what is going on.

Kind Regards,

Clare Martin

Thank you for those answers they help build a bigger picture of how your network is set up, I will wait for the logs but am I able to ask one more question if you don’t mind, do you have a load balancer in your network anywhere? We tend to get a lot of issues that come down to load balancers and due to the fact you mentioned your servers were in clusters, it reminded me to ask that question.

Yes, the connection between the tentacles and the Octopus Deploy server goes through an AWS Application Load Balancer. All three clusters are behind the same ALB.

Sorry to spam the post here but I spoke to the rest of the team to see if there was anything else you could try and they suggested using TentaclePing on the tentacles that are having the issue to the Octopus server, this would show you when the connection drops out and would rule out any networking issues rather than an issue with Octopus itself, you could then take that to your network engineers and troubleshoot with them to see if there was anything obvious going on with the network.

Thanks for this. I’ll setup TentaclePing on all of the 9 servers.

I have uploaded a couple of logs where we enounter the issue.

Thanks!

Hey @Wildpipe,

Thank you for the logs, I have reviewed them and can’t see anything jumping out at me. It seems in both logs the Octopus Server is having issues connecting to one of your tentacles ending in 1-2 (shortened in case that might be a sensitive machine name).

In task 224130 it does connect to the machine to upload a package at around 07:32, it looks like the connection to that machine bins out at around 07:35.

In task 224054 the connection to that same tentacle bins out at around 07:03 and it can’t deploy a package to it as none have been staged (which makes sense if the Octopus Server cannot connect to it to tell it to upload a package).

Are you able to check Event Viewer on that machine for around those times on the date of those logs to see if there is anything you can see in there?

I look forward to hearing the results of the tentacle ping too, it would be interesting to see if you can get a deployment to fail and see if it coincides with the results of the tentacle ping.

I look forward to hearing the results of the ping when you have had a few days to test it and also test it with a failed deployment.

Reach out in the meantime if there is anything else we can help you with,

Kind Regards,

Clare Martin

Hi,

We have now been able to catch a failed deploy while running the tentacle ping tool. The result is that the tentacle ping tool shows no interruption in communication during the failed deploy.

If you send me a new link where I can upload files I can provide you with the logs in case you want to more closely inspect them.

Since this issue is limited to just one out of three clusters we decided to re-launch this cluster on new EC2 instances. The instances were re-lanuched on Tuesday so I will get back to you once they’ve had time to run some deploys.

Hi @Wildpipe,

Sorry to hear the tentacle ping didn’t produce anything solid, I have sent you a private message with your support link in so you can upload those files and we can take a look at them for you.

Thank you for letting us know about the re-creation of the clusters, it will be interesting to see if those new ones allow you to deploy properly.

Let us know once the files have been uploaded and we will take a look, if we don’t find anything and the re-creation of the cluster has the same issues I will have a good chat with the rest of our support team to see if we can collectively have a brainstorm to see what else might be going on here.

I look forward to hearing from you,

Kind Regards,

Clare Martin

Hi! Sorry for the delay. The logs are now uploaded.

Hey @Wildpipe,

Thank you for those logs, can you confirm these are logs taken after the cluster was rebuilt, so that confirms you are still having the same issue with web1-2 even after the cluster has been rebuilt?

I can see from the Tentacle ping you are having connections issues, 3 on 14th Feb:

10:07:21, 11:18:02, 17:17:28

And 4 on 15th Feb:

02:55:54, 06:15:14, 09:09:44, 09:19:53

Interestingly all the errors seem to be the same except this one from 15th at 09:09:44:

All my attempts to try and work out what this is are coming up with ‘check your network connection’ and those failures do not correlate with the logs you sent over, when it fails in the logs you get a successful Tentacle Ping.

Our lead support engineer is online in a few hours so I am going to speak to him now we have the tentacle ping logs and see if he has any other ideas.

This is a bit of hard one to diagnose I am afraid, but we will get there!

Thank you for your patience so far throughout all of our communication, hopefully we can get to the bottom of this soon for you!

Kind Regards,

Clare Martin

The logs uploaded are from before the cluster was rebuilt. Since the cluster was rebuilt we have not had any connection issues. If any new connection issues appear for the new cluster I will let you know.

At the moment I think it’s too early to say but if the issues goes away with the new cluster it’s likely something was incorrectly setup with these specific machines.

Thanks for providing great support!

Hey @Wildpipe,

Thank you for letting us know, I will keep an eye out on this post to see if you have any updates but I am hedging my bets here that you are correct, seems to be a cluster issue which I think we mentioned before because it was only happening to that one cluster, fingers crossed it does not happen again but please do reach out if it does.

Kind Regards,

Clare Martin

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.