Tentacle connection issues

Jan_Lnsetteig · 3 September 2015 09:22

Hi,

After the upgrade to Octopus version 3 we are seeing more and more tentacle connection issues of the type shown in this error message:

10:36:40 Info | Deploying package ‘d:\Octopus\Files\feeds-my-feed\My-Package.1.0.1845_6E09938C7B67EC47A6AFE994415ECC64.nupkg’ to machine 'https://mymachine:10933/'
10:38:05 Error | An error occurred when sending a request to ‘https://mymachine:10933/’, before the request could begin: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

This happens when deploying to tentacles on remote locations. We never had this type of problem with Octopus 2.6. We always have to re-run deployment 3 times or more to get a successful deployment. It may fail on one of the tentacles or on multiple tentacles. It’s very unstable.

Any ideas on what is going on here and how to fix it?

/Jan

Michael_Richardson · 3 September 2015 22:55

Hi Jan,

As part of the move to 3.0, we made substantial changes to Octopus\Tentacle communications.

We recently realized that 2.6 had some retry capability that wasn’t implemented in 3.0. We re-implemented retries on communication failures (in certain scenarios) very recently. This shipped in version 3.0.20. If you are not using a version >= 3.0.20, are you able to upgrade your Octopus server to the latest version?

This should make the behavior much more similar to 2.6, and will hopefully resolve your issue. Please let me know if this helps?

Regards,
Michael

Jan_Lnsetteig · 4 September 2015 05:57

Hi,

We are using version 3.0.21 and are still having this issue

/Jan

Michael_Richardson · 7 September 2015 23:10

Jan,

After investigating this further, it seems the retry capabilities added were regarding copying packages to Tentacles. Your issue seems to be occurring when simply sending a message to the Tentacle. Unfortunately 2.6 was more tolerant of message transmission failures.

I have raised an issue to improve this. You can track this.

I apologize for the inconvenience this is causing you.

Regards,
Michael

Jan_Lnsetteig · 8 September 2015 06:09

Hi Michael,

Thanks for investigating this further and identifying the problem.

I hope that you can give this issue high priority. At the moment this is our main deployment roadblock with Octopus. Our one-click deployment is now perceived as having to click once, twice, three times and more and pray that the deployment will complete. It is not inspiring confidence in the operations people deploying to our production environments.

All the best,
Jan

Jan_Lnsetteig · 16 September 2015 10:55

Hi,

Any news on this? When can we expect a fix? This is really hurting our deployments.

/Jan

thomas_rikardsen · 16 September 2015 11:20

I’m seconding this.

Running deployments now is a constant pain having to wait for package acquirement to fail, re-run, some other command failing, re-run and so on.

peter_bollwerk · 16 September 2015 16:17

I’m thirding this.

We are running into WAY more failures being on 3.x, compared to being on 2.x.
This is causing substantial delays in my work.

Paul_Nguyen · 12 October 2015 22:19

we have the same issue as well. Looking forward for the fix.

Allen_Thomas · 15 October 2015 04:52

Hi,

We are also facing the same issue and our deployment activities are blocked.
Do you have a workaround for this?

Regards,
Allen

Michael_Richardson · 15 October 2015 05:48

Paul and Allen,
If possible, could you capture the task log of a failed deployment, and upload it here? This is a secure location.

It will assist us in verifying it is the same issue.

Regards,
Michael

Jan_Lnsetteig · 15 October 2015 06:18

Hi,

Thanks for looking into this. I’ve now uploaded two task logs. Both fail with the same error. One fail during package acquisition and one during step deployment.

Regards,
Jan

Warren_Reed · 23 October 2015 08:43

Hi

I’m also getting this error a lot. It happens when acquiring packages, running deployment steps, or applying retention policies. I’ve uploaded a task log as well.

Thanks

Michael_Richardson · 26 October 2015 23:25

Hi folks,

Thank-you to those who uploaded logs. They were very helpful.

The communication library used by Octopus Deploy 3.x (Halibut) has some time-limits set, but these can be overridden by config. If you are willing to experiment a little, I think it is worth trying.

<add key="Halibut.ConnectionErrorRetryTimeout" value="00:15:00"/>
<add key="Halibut.TcpClientConnectTimeout" value="00:02:00"/>
<add key="Halibut.TcpClientHeartbeatSendTimeout" value="00:02:00"/>
<add key="Halibut.TcpClientHeartbeatReceiveTimeout" value="00:02:00"/>

The config lines above are those I believe may assist in your situations. These should be added to the appSettings section of Octopus.Server.exe.config, located by default at C:\Program Files\Octopus Deploy\Octopus.

I’ll briefly explain what each of these value does, and you can see the default values in the code.

ConnectionErrorRetryTimeout: If an error occurs when sending message, and Halibut believes it is safe to retry, it will retry up to 5 times, or until this timeout is exceeded. Increasing this value will allow your failed messages to be retried for longer.

TcpClientConnectTimeout: When there are no connections in the pool, or the pooled connections have expired, Halibut makes a new connection. This timeout applies when making this connection.

TcpClientHeartbeatSendTimeout and TcpClientHeartbeatReceiveTimeout: When taking a connection from the pool, small heartbeat request/response messages are sent to verify the connection is still valid. These timeouts apply to this process.

The sample values I have provided above are significantly longer than the defaults. I would be very interested to hear if configuring these ease your issues.

Regards,
Michael

Warren_Reed · 27 October 2015 11:33

Thanks Michael, I’ll try your settings.

Jan_Lnsetteig · 30 October 2015 08:31

We are also trying out these settings. Will have to run a couple of weeks to see if it helps.

/Jan

vasily_sliouniaev · 18 November 2015 13:45

Hi,

any updates on your experiences with these settings?

We have encountered some timeout issues with a server we have a poor connection to and will use these settings to see what difference they make.

Jan_Lnsetteig · 18 November 2015 13:51

Hi,

These settings seem to have helped a lot with these issues during Acquire packages, but we now see this issue during Apply retention policy on Tentacles.

An error occurred when sending a request to ‘https://xxxxxxx:10933/’, before the request could begin: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

/Jan

vasily_sliouniaev · 18 November 2015 13:55

For us this actually happened while running a deployment script, the packages had already been uploaded.

If the settings don’t apply to the connection post-upload is there something else that could be tried?

thomas_rikardsen · 18 November 2015 14:08

It does look a lot better now, although we still see some issues from time to time.

-Thomas