Too many timeout errors occuring for no apparent reason

reliability
(Sam Kompfner) #1

This is breaking our deployments about 50% of the time.

Common observations:
If a deployment fails once, its usually halfway through the process.
The targets are Azure Windows VMs.
Reattempting the deployment from just the step where it failed last time usually works.
The health checks on the targets are all completely fine.
There does not appear to be any kind of resource saturation anywhere in the chain (neither network, CPU, RAM, etc).
The failure seems to occur when the process moves from deploying to one set of targets to a different set.
I have tried increasing the timeout values in the Octopus server config as follows, to no avail:

<add key="Logging.Level" value="Debug" />
<add key="SqlServerInstance" value="(local)\SQLEXPRESS" />
<add key="Halibut.ConnectionErrorRetryTimeout" value="00:15:00"/> 
<add key="Halibut.TcpClientConnectTimeout" value="00:02:00"/> 
<add key="Halibut.TcpClientHeartbeatSendTimeout" value="00:02:00"/> 
<add key="Halibut.TcpClientHeartbeatReceiveTimeout" value="00:02:00"/>

Typical deployment run:

2019-07-29 16:32:45.6560 2084 161 ERROR https://tb-l-agnt-1.westeurope.cloudapp.azure.com:10933/ 161 Connection initialization failed while connecting to https://tb-l-agnt-1.westeurope.cloudapp.azure.com:10933/
Halibut.Transport.Protocol.ConnectionInitializationFailedException: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. —> System.IO.IOException: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. —> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
— End of inner exception stack trace —
at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
at System.Net.FixedSizeReader.ReadPacket(Byte[] buffer, Int32 offset, Int32 count)
at System.Net.Security._SslStream.StartFrameHeader(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security._SslStream.StartReading(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security._SslStream.ProcessRead(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
at System.Net.Security.SslStream.Read(Byte[] buffer, Int32 offset, Int32 count)
at System.IO.StreamReader.ReadBuffer()
at System.IO.StreamReader.ReadLine()
at Halibut.Transport.Protocol.MessageExchangeStream.ReadLine() in C:\buildAgent\work\fe2b45bbd4978f75\source\Halibut\Transport\Protocol\MessageExchangeStream.cs:line 124
at Halibut.Transport.Protocol.MessageExchangeStream.ExpectProceeed() in C:\buildAgent\work\fe2b45bbd4978f75\source\Halibut\Transport\Protocol\MessageExchangeStream.cs:line 113
at Halibut.Transport.Protocol.MessageExchangeProtocol.PrepareExchangeAsClient() in C:\buildAgent\work\fe2b45bbd4978f75\source\Halibut\Transport\Protocol\MessageExchangeProtocol.cs:line 61
— End of inner exception stack trace —
at Halibut.Transport.Protocol.MessageExchangeProtocol.PrepareExchangeAsClient() in C:\buildAgent\work\fe2b45bbd4978f75\source\Halibut\Transport\Protocol\MessageExchangeProtocol.cs:line 66
at Halibut.Transport.Protocol.MessageExchangeProtocol.ExchangeAsClient(RequestMessage request) in C:\buildAgent\work\fe2b45bbd4978f75\source\Halibut\Transport\Protocol\MessageExchangeProtocol.cs:line 35
at Halibut.HalibutRuntime.<>c__DisplayClass34_0.b__0(MessageExchangeProtocol protocol) in C:\buildAgent\work\fe2b45bbd4978f75\source\Halibut\HalibutRuntime.cs:line 157
at Halibut.Transport.SecureListeningClient.ExecuteTransaction(Action`1 protocolHandler) in C:\buildAgent\work\fe2b45bbd4978f75\source\Halibut\Transport\SecureListeningClient.cs:line 65

(Kartik Gupta) #5

Hi Sam,

Thanks for getting in touch and congratulations on your first support ticket! We are sorry you are running into this issue. Recently we have had a few customers reach out to us with similar issues and we are investigating them at the moment.

Could we please confirm which version of Octopus you are running?

Would you be able to attach the log from the tentacle for the same time period as when the error occurred? We’re keen to see what the server and the tentacle are both reporting to see if there are consistencies across the different reports we’re seeing. Setting the Octopus and Tentacle logging to Trace may help to see more information (just remember to set it back to Info afterwards. You can find instructions on how to do that here: https://octopus.com/docs/support/log-files. Bumping the logging level up will hopefully get to the low level of the error to try to get to the cause.

To help us reproduce the error, could you please let us know which VM image (i.e. Windows 2019/2016/2012) and size (i.e. between B1lS - DS3_v2) you are running on Azure Window’s VMs? Which inbound and outbound networking ports are enabled? Does your solution have accelerated networking or load balancing turned on?

You also mentioned that this fails 50% of the time; have you noticed any patterns around the times it fails or do they fail randomly? Sometimes it helps us to compare failures on two different targets to further narrow down the problem. Would you please be able to attach the output from a different tentacle and server that has failed?

Furthermore, the error occurs when deploying from one set of targets to another. Do the first set of targets successfully deploy? What kind of packages are you deploying? How many targets in each deployment set do you have?

Cheers,
Kartik

(Sam Kompfner) #6

Hi Kartik,

firstly, how do I change the log level of the Tentacle? That help page only shows where the config for the Server log is…anyway, I’ve attached extracts from yesterday’s logs from the time when the error happened:

Our process is currently 47 steps long, and deploys a database DACPAC, then installs web applications to IIS servers, then stops a failover cluster role, then installs agent applications as Window services, then starts a cluster role again. All the install steps also perform various config manipulations with pre/post deploy scripts.

Our Azure infrastructure estate looks like this:
image
All the servers are in a West Europe vNet.
The Octopus environment consists of those VMs with the commented roles:


As you can see, it is not a particularly large environment.

The Octopus Server itself is in vnet in a different region, UK South, and connects directly to the servers via Listening Tentacles, using public IP addresses. No VPN, proxy, jump box or Express Route involved here. The Octopus Server is a D2s v3 virtual machine, and as mentioned, none of it’s metrics seem to be particularly high during one of these problematic deployments. They are all Windows Server 2016 Datacenter (latest) image.

The packages are NuPkg, pulled from the Octopus built-in repository. Please note, it is actually a simple Powershell script that often fails first -
Stop-ClusterGroup #{ClusterRole} -Wait 0

Here is where in the process that it usually fails on a full run:

I’ll actually share with you a snapshot of the environment in Azure (I’ve redacted sensitive names), so you can get an idea of what it looks like:

As you can see, the Agent VMs are indeed behind an Azure load balancer. In this environment, only the Agent VMs have Accelerated Networking enabled, as they are the only 4+ core VMs.

The NSG has a rule to allow traffic from the public IP of the Octopus VM inbound on port 10933 only, to everything in the vNet (again, IPs and names obfuscated):

This is exactly how and where it fails, and I guarantee if I was to run it again from scratch this morning, it will fail in exactly the same place:

If I run the same release again, but exclude all steps prior to the first Agent role step (which happens to be the PowerShell Stop-ClusterRole script), it works fine!

(Sam Kompfner) #7

Hi Kartik,

We are running Octopus Deploy v2019.6.0.
Running Tentacle version 4.0.5
Running latest version of Calamari: 4.20.0

I’ve just experienced the same failure on a different, much smaller environment, again at the point where the the process attempts to deploy to a different target role (agent), but in this environment, there are no load balancers in the equation (which I thought might be a factor):

I’ve attached the log files for the server and tentacle for the time period this release ran, but what is apparent is that the calls to the tentacle are not even reaching the tentacle…

OctopusServer29072019-2.txt (58.9 KB)
OctopusTentacle29072019-2.txt (2.8 KB)

(Kartik Gupta) #8

Hi Sam,

Thank you so much for all the info above and apologies that the link to the logging documentation did not have steps for Tentacle and Halibut. We have updated the documentation and you can access it here: https://octopus.com/docs/support/log-files

As we go through all the info you have provided and try to reproduce your set up on Azure, we think it might help for you to add a “Check Tentacle Health” step to check the health for the “agent” role tentacle before Step 21 on your smaller deployment.

To add this step, click add a step to your release process and search for “health” and select “Health Check by Octopus Deploy”.

Configure the health check to run on the “agent” role, as “connectivity only”. This will help you to see if Octopus can connect to that Tentacle at this point in your deployment.

Your deployment process should have a step that looks like this before Step 21:

The logs suggest the Tentacle is not online or available, so if you see that failing and have guided failures enabled you could then remote into the Tentacle machine (at the time of failure) and see if the machine is online. We’re hoping that a health check step can provide some clues here.

Cheers,
Kartik

(Sam Kompfner) #9

Cheers Kartik - that is a good point - running a manual Health Check just before kicking off a release always does seem to make the process run successfully, so I will certainly try adding a Health Check step to the process (I wasn’t aware that was available as a step template), but I can assure you the machines are definitely on and running when I run the releases. Our scheduled health check only runs every 4 hours, but always runs after the machines are powered on.

(Sam Kompfner) #10

Success!

Adding the Health Check step prior to the problematic steps works:

Because of the number of targets in total that we have, we have restricted the frequency of the scheduled machine health check to once every 2 hours:
image

We think there needs to be some kind of keepalive process added to either the Tentacle Watchdog job (which we put on every target) to stop Tentacles going zombie after comms are closed…or something like that!

Thanks for you workaround, though,

Sam

(Kartik Gupta) #11

Hi Sam,

Thank you for letting us know that the workaround worked for you and for providing more info. We are going to keep investigating why the tentacle instance went offline during your deployment process.

In the meantime you can keep an eye out on our issues page for all the issues known to us and are currently being fixed.

Please do not hesistate to get in touch if you run into any other problems.

Cheers,
Kartik

(Sam Kompfner) #12

Hi @kartik.gupta

unfortunately it is STILL failing, even with the full health check step added:

In fact, it is the health check itself that is reporting the Tentacles as offline, even though they absolutely are online (a manual health check confirms that).

This is becoming quite problematic for us, and prevent unattended deployments. Perhaps you guys could come up with some kind of PowerShell script that forces the initiation of a new thread, even if the Tentacle appears online? It definitely feels like it is zombie thread issue here…

Sam

(Michael Noonan) #15

Hi @sam.kompfner1,

This is proving to be quite a mystery. Rest assured we won’t give up until your deployments are running reliably. I have two next steps for you to try.

1. Test with TentaclePing

TentaclePing is a little utility which uses the same communications protocol as Tentacle to try and shake out and isolate networking problems. It would be good to prove the network is 100% reliable and the Tentacle protocol is also reliable over that network.

2. Scheduled deployment test project

Set up a simple test project which does something innocent on the troublesome deployment targets, like writing a log message. Then set up a project trigger to run a deployment every minute. This will give us an indication of how often communication is failing each day and may give some more insight into what is going wrong.

Hope that helps!
Mike