Tentacle connection issues

Hi,

After the upgrade to Octopus version 3 we are seeing more and more tentacle connection issues of the type shown in this error message:

10:36:40 Info | Deploying package ‘d:\Octopus\Files\feeds-my-feed\My-Package.1.0.1845_6E09938C7B67EC47A6AFE994415ECC64.nupkg’ to machine 'https://mymachine:10933/'
10:38:05 Error | An error occurred when sending a request to ‘https://mymachine:10933/’, before the request could begin: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

This happens when deploying to tentacles on remote locations. We never had this type of problem with Octopus 2.6. We always have to re-run deployment 3 times or more to get a successful deployment. It may fail on one of the tentacles or on multiple tentacles. It’s very unstable.

Any ideas on what is going on here and how to fix it?

/Jan

Hi Jan,

As part of the move to 3.0, we made substantial changes to Octopus\Tentacle communications.

We recently realized that 2.6 had some retry capability that wasn’t implemented in 3.0. We re-implemented retries on communication failures (in certain scenarios) very recently. This shipped in version 3.0.20. If you are not using a version >= 3.0.20, are you able to upgrade your Octopus server to the latest version?

This should make the behavior much more similar to 2.6, and will hopefully resolve your issue. Please let me know if this helps?

Regards,
Michael

Hi,

We are using version 3.0.21 and are still having this issue

/Jan

Jan,

After investigating this further, it seems the retry capabilities added were regarding copying packages to Tentacles. Your issue seems to be occurring when simply sending a message to the Tentacle. Unfortunately 2.6 was more tolerant of message transmission failures.

I have raised an issue to improve this. You can track this.

I apologize for the inconvenience this is causing you.

Regards,
Michael

Hi Michael,

Thanks for investigating this further and identifying the problem.

I hope that you can give this issue high priority. At the moment this is our main deployment roadblock with Octopus. Our one-click deployment is now perceived as having to click once, twice, three times and more and pray that the deployment will complete. It is not inspiring confidence in the operations people deploying to our production environments.

All the best,
Jan

Hi,

Any news on this? When can we expect a fix? This is really hurting our deployments.

/Jan

I’m seconding this.

Running deployments now is a constant pain having to wait for package acquirement to fail, re-run, some other command failing, re-run and so on.

I’m thirding this.

We are running into WAY more failures being on 3.x, compared to being on 2.x.
This is causing substantial delays in my work.

we have the same issue as well. Looking forward for the fix.

Hi,

We are also facing the same issue and our deployment activities are blocked.
Do you have a workaround for this?

Regards,
Allen

Paul and Allen,
If possible, could you capture the task log of a failed deployment, and upload it here? This is a secure location.

It will assist us in verifying it is the same issue.

Regards,
Michael

Hi,

Thanks for looking into this. I’ve now uploaded two task logs. Both fail with the same error. One fail during package acquisition and one during step deployment.

Regards,
Jan

Hi

I’m also getting this error a lot. It happens when acquiring packages, running deployment steps, or applying retention policies. I’ve uploaded a task log as well.

Thanks

Hi folks,

Thank-you to those who uploaded logs. They were very helpful.

The communication library used by Octopus Deploy 3.x (Halibut) has some time-limits set, but these can be overridden by config. If you are willing to experiment a little, I think it is worth trying.

<add key="Halibut.ConnectionErrorRetryTimeout" value="00:15:00"/>
<add key="Halibut.TcpClientConnectTimeout" value="00:02:00"/>
<add key="Halibut.TcpClientHeartbeatSendTimeout" value="00:02:00"/>
<add key="Halibut.TcpClientHeartbeatReceiveTimeout" value="00:02:00"/>

The config lines above are those I believe may assist in your situations. These should be added to the appSettings section of Octopus.Server.exe.config, located by default at C:\Program Files\Octopus Deploy\Octopus.

I’ll briefly explain what each of these value does, and you can see the default values in the code.

ConnectionErrorRetryTimeout: If an error occurs when sending message, and Halibut believes it is safe to retry, it will retry up to 5 times, or until this timeout is exceeded. Increasing this value will allow your failed messages to be retried for longer.

TcpClientConnectTimeout: When there are no connections in the pool, or the pooled connections have expired, Halibut makes a new connection. This timeout applies when making this connection.

TcpClientHeartbeatSendTimeout and TcpClientHeartbeatReceiveTimeout: When taking a connection from the pool, small heartbeat request/response messages are sent to verify the connection is still valid. These timeouts apply to this process.

The sample values I have provided above are significantly longer than the defaults. I would be very interested to hear if configuring these ease your issues.

Regards,
Michael

Thanks Michael, I’ll try your settings.

We are also trying out these settings. Will have to run a couple of weeks to see if it helps.

/Jan

Hi,

any updates on your experiences with these settings?

We have encountered some timeout issues with a server we have a poor connection to and will use these settings to see what difference they make.

Hi,

These settings seem to have helped a lot with these issues during Acquire packages, but we now see this issue during Apply retention policy on Tentacles.

An error occurred when sending a request to ‘https://xxxxxxx:10933/’, before the request could begin: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

/Jan

For us this actually happened while running a deployment script, the packages had already been uploaded.

If the settings don’t apply to the connection post-upload is there something else that could be tried?

It does look a lot better now, although we still see some issues from time to time.

-Thomas