Network disconnect causes deployment to fail

tgillitzer1 · 12 July 2022 16:35

All of our tentacles (10,000+) are connected remotely via VPN as polling tentacles. Normally this is fine.

Occasionally, the VPN tunnel will reset in the middle of a deployment. When this happens, the Tentacle connection to the Octopus Server is disconnected. The VPN tunnel re-establishes relatively quickly (seconds).

If this happens during a deployment, the deployment fails and we have to manually re-start it. In some cases that would be fine, but a lot of our deployments are scheduled in the middle of the night.

When this happens, the Tentacle logs the disconnect. The server seems to either fail immediately with an System IO error or it times out after 1 minute, saying the polling tentacle didn’t collect the request.

I can generally re-create this issue with the use of a Windows Firewall rule on the tentacle that blocks outbound traffic to the Octopus server on the Tentacle communication port.

In the cases where the command times out on the server, if the tentacle reconnects before the timeout (1 minute), the task resumes. However, this is hit or miss due to the timing of the retry logic on the Tentacle. It seems to take anywhere from 30 second to 120 seconds for the Tentacle to retry the first time. A significant part of that window is outside of the 1 minute server side timeout. If we could do something to get a quicker/more frequent retry, that, I think, would help.

For the server side System IO error, I would like input as to what options there might be to address it.

Thanks

Server Log when it fails after timeout

The step failed: Activity <StepName> on <TentacleName> failed with error 'A request was sent to a polling endpoint, but the polling endpoint did not collect the request within the allowed time (00:01:00), so the request timed out. Server exception: System.TimeoutException: A request was sent to a polling endpoint, but the polling endpoint did not collect the request within the allowed time (00:01:00), so the request timed out.'.

Server Log when it fails immediately

The step failed: Activity <StepName> on <TentacleName> failed with error 'Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.. Server exception: System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.. ---> System.Net.Sockets.SocketException (10054): An existing connection was forcibly closed by the remote host. at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size) --- End of inner exception stack trace --- at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size) at System.Net.Security.SslStream.FillBufferAsync[TReadAdapter](TReadAdapter adapter, Int32 minSize) at System.Net.Security.SslStream.ReadAsyncInternal[TReadAdapter](TReadAdapter adapter, Memory`1 buffer) at System.Net.Security.SslStream.Read(Byte[] buffer, Int32 offset, Int32 count) at System.IO.Compression.DeflateStream.ReadCore(Span`1 buffer) at System.IO.Compression.DeflateStream.Read(Byte[] array, Int32 offset, Int32 count) at System.IO.BinaryReader.InternalRead(Int32 numBytes) at System.IO.BinaryReader.ReadInt32() at Newtonsoft.Json.Bson.BsonDataReader.ReadNormal() at Newtonsoft.Json.Bson.BsonDataReader.Read() at Newtonsoft.Json.JsonReader.ReadForType(JsonContract contract, Boolean hasConverter) at Newtonsoft.Json.Serialization.JsonSerializerInternalReader.Deserialize(JsonReader reader, Type objectType, Boolean checkAdditionalContent) at Newtonsoft.Json.JsonSerializer.DeserializeInternal(JsonReader reader, Type objectType) at Newtonsoft.Json.JsonSerializer.Deserialize[T](JsonReader reader) at Halibut.Transport.Protocol.MessageExchangeStream.ReadBsonMessage[T]() at Halibut.Transport.Protocol.MessageExchangeStream.Receive[T]() at Halibut.Transport.Protocol.MessageExchangeProtocol.ProcessReceiverInternalAsync(IPendingRequestQueue pendingRequests, RequestMessage nextRequest)'.

Tentacle Log in either scenario

2022-07-12 16:14:19.7333   4248      7  INFO  The Windows Service has started
2022-07-12 16:14:19.8740   4248      8  INFO  https://<servername>/    8  Secure connection established. Server at [::ffff:<IPAddress>]:443 identified by thumbprint: AE207588254D18D02BFE082FA8A16F8CD2E087B2, using protocol Tls12
2022-07-12 16:14:43.8998   4248      8 ERROR  https://<servername>/    8  Unexpected exception executing transaction.
System.IO.IOException: Unable to read data from the transport connection: An established connection was aborted by the software in your host machine. ---> System.Net.Sockets.SocketException: An established connection was aborted by the software in your host machine
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   at System.Net.FixedSizeReader.ReadPacket(Byte[] buffer, Int32 offset, Int32 count)
   at System.Net.Security._SslStream.StartFrameHeader(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security._SslStream.StartReading(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security._SslStream.ProcessRead(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.SslStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   at System.IO.StreamReader.ReadBuffer()
   at System.IO.StreamReader.ReadLine()
   at Halibut.Transport.Protocol.MessageExchangeStream.ReadLine()
   at Halibut.Transport.Protocol.MessageExchangeStream.ExpectProceeed()
   at Halibut.Transport.Protocol.MessageExchangeProtocol.ExchangeAsSubscriber(Uri subscriptionId, Func`2 incomingRequestProcessor, Int32 maxAttempts)
   at Halibut.Transport.SecureClient.ExecuteTransaction(ExchangeAction protocolHandler, CancellationToken cancellationToken)
2022-07-12 16:14:43.8998   4248      8 ERROR  https://<servername>/    8  Halibut client exception: The server https://<servername>/ aborted the connection before it was fully established. This usually means that the server rejected the certificate that we provided. We provided a certificate with a thumbprint of '79875A7EFF305B142274E19F6095E79749458469'. Retrying in 38.0 seconds
Halibut.HalibutClientException: The server https://<servername>/ aborted the connection before it was fully established. This usually means that the server rejected the certificate that we provided. We provided a certificate with a thumbprint of '79875A7EFF305B142274E19F6095E79749458469'.
   at Halibut.Transport.SecureClient.HandleError(Exception lastError, Boolean retryAllowed)
   at Halibut.Transport.SecureClient.ExecuteTransaction(ExchangeAction protocolHandler, CancellationToken cancellationToken)
   at Halibut.Transport.PollingClient.ExecutePollingLoop(Object ignored)

paul.calvert · 12 July 2022 17:18

Hi @tgillitzer1,

Thanks for getting in touch!

The various timeouts for Polling tentacles are configured within the assigned Machine Policy. The easiest option here might be to increase the Connect Timeout value to give the tentacles the necessary time to re-connect.

In the cases where the disconnection causes an immediate failure, there isn’t a simple solution for that.
You could perhaps look at enabling Guided Failure for the relevant projects and creating a Runbook that executes an API script to check for any deployments waiting for intervention and approves them.

We have a sample script for approving interruptions that may be a good starting point. You’d need to add a section that checks the api/<spaceID>/interruptions?pendingOnly=True endpoint to retrieve the Interruption-ID. This endpoint does return both Guided Failure and Manual Intervention interruptions, so, you may need to add some logic to limit it to Guided ones only.

Regards,
Paul

hardKOrr · 15 July 2022 16:07

Hi, I work alongside tgillitzer1 and have an additional question.

The Connect Timeout setting in the web UI shows that it is representative for listening tentacles (and SSH targets). Does this affect our polling tentacles as well, or should the change be to Polling Request Queue Timeout ?

Thanks!

britton.riggs · 15 July 2022 16:28

Hi @hardKOrr,

Thanks for following up on this, in taking a deeper dive on this it looks like you are right on this one (good catch)!

In checking the error message originally provided I do see that the request is failing here:

A request was sent to a polling endpoint, but the polling endpoint did not collect the request within the allowed time

And in checking the Machine Policies page I see the following description for the Polling Request Queue Timeout policy setting:

The amount of time that the server will wait for a polling tentacle to collect the next request before cancelling the request.

Based on the error message and the specific language on that policy setting I would say this is the timeout you’d want to modify in this case.

As @paul.calvert mentioned previously this will not help in the cases where there is an immediate connection failure, but should help when there is some latency in the request collection process by the tentacle.

Apologies for the miscommunication there, but thank you for raising this request back up.

I hope this helps, but feel free to reach back out if you have any other questions or concerns.

Best regards,

Britton

system · 15 August 2022 16:28

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.