Connection recovery

Server and Tentacles: 3.4.0-beta0002

We have got Tentacles on some physical Servers (i.e. not VMs) in our test lab. These are created from the same base image but with potential slight differences in the installed version of an application we are testing. They are physically plugged in to the same switch and share the same external connection to the Internet.

We noticed that one of the Tentacles was consistently worse than others with it’s health checks and struggled to maintain a connection to the OD server. I experimented with Tentacles recovering from a severed connection with the following results.

Test #1 - Good Tentacle

Time Action
- Health check OK.
11:03:00 Pull out network cable. Tentacle logs no change.
- Health check fail (timeout 2 mins).
- Waiting on Tentacle to notice…
11:12:55 Noticed. Retries connection every ~5 seconds.
11:14:00 Put network cable back in.
11:14:03 Connected.
- Health check OK.

Noticed in ~10 mins and reconnected successfully from OD server point of view once network was available. Good result with 10 mins being acceptable for our purposes.

Test #2 - Good Tentacle

Time Action
- Health check OK.
14:24:10 Pull out network cable. Tentacle logs no change.
- Health check fail (timeout 2 mins).
- Waiting on Tentacle to notice…
14:33:35 Noticed. Retries connection every ~5 seconds.
14:38:26 Put network cable back in.
14:38:28 Connected.
- Health check OK.

Repeated the same test again to see if it was a one-off, noticed in ~9.5 mins and reconnected successfully.

Test #3 - Bad Tentacle

Time Action
- Health check OK.
14:53:15 Pull out network cable. Tentacle logs no change.
- Health check fail (timeout 2 mins).
- Waiting on Tentacle to notice…
15:02:55 Noticed. Retries connection every ~5 seconds.
15:05:11 Put network cable back in.
15:05:19 Connected.
- Health check fail (timeout 2 mins).
- Health check fail (timeout 2 mins).
15:12:47 Restarted Tentacle from Octopus Manager and connected.
- Health check fail (timeout 2 mins).
- Health check fail (timeout 2 mins).
15:29:15 Reinstall Tentacle from Octopus Manager and connected.
- Health check fail (No response was received from the endpoint within the allowed time) after 12 mins.
- Uninstall from Programs and Features, install fresh and connected. This reuses the Octopus install folder so end up with same Tentacle thumbprint.
- Health check fail (No response was received from the endpoint within the allowed time) after 12 mins.
- Uninstall from Programs and Features, delete Octopus folder, install fresh and connected. New Tentacle with new thumbprint.
- Health check OK. Finally…

This is where it fell apart completely. Both test rigs are connected to the same switch, so I expected them to have similar connectivity.

I’m not sure where to go from here. Is there a known problem with having to reissue a Tentacle thumbprint?

Hi Shannon,

Thanks for reaching out! Sorry to hear that you’re having connection issues.

There has been some further work around connection stability in the more recent 3.3.x releases since the 3.4.0-beta0002 release. Specifically 3.3.21 dealt with some connection issues. These fixes will get rolled up in the RTW of the 3.4 release that will be out soon.
Upgrading from 3.4 -> 3.3.x is unfortunately not supported, so to try the latest 3.3.x release, you’d have to re-install.

Overall though, the fact that the Tentacle never came back online is disconcerting. Can you confirm if they are polling or listening Tentacles?

Is the Tentacle installed before the base image was taken? If that is the case, we might be hitting an issue with Tentacles with the same thumbprints (https://github.com/OctopusDeploy/Issues/issues/2637). You can run

Tentacle.exe show-thumbprint --nologo

on each Tentacle to see each value. If they are the same that might help us track the issue down.

Also, can you send through the logs from the bad Tentacle?

One final thought - do the servers have the same hardware configuration? Specifically, do they all have the same NIC?

Regards,
Matt

Thanks for impressively quick response. A quick reply in return; I’ll have to wait until I’m back in the office Monday to look at some of the things.

We upgraded to 3.4.0-beta0002 from 3.3.17, so possibly missed some of the connection fixes. We need elastic environments so we will wait until 3.4 RTW before hopefully picking up those. Do you have a timeline for RTW?

The tentacles are polling. The final install locations are again physical machines distributed at customer sites that are NAT’ed meaning listening mode won’t work for us.

Tentacles are installed after imaging; I assumed that reusing thumbprints wouldn’t be supported, so we were going to work on something like syspreping systems after imaging to take care of unique things like tentacles. The way I read that GitHub issue, it sounds like reusing thumbprints are in fact supported? If so, that could ease our development.

I’m afraid the logs from the bad tentacle got deleted along with everything else at the time of reinstalling with removing all tentacle traces.

The hardware are these kinds of industrial boxes, that should be non-changing components so NICs and everything else should be identical (will have to double check on Monday). We basically lock in to a hardware configuration for 5+ years.

http://www.nexcom.com/Products/industrial-computing-solutions/industrial-fanless-computer/core-i-performance

On the subject of connection issues, I wrote a PowerShell script to analyze the log files and pull out outage information (start/end/duration):

For one day that I looked at, the bad tentacle dropped the TCP connection 40+ times, but seemed to reestablish with the server in ~5 seconds. But I suspect the health checks were not indicating it was connected.

Hi Shannon,

We are very close to a 3.4 RTW - keep your eyes peeled! One of the other new features in 3.4 is the proxy support (see http://docs.octopusdeploy.com/display/OD/Proxy+Support) - this might be helpful in your situation.

In theory, the thumbprints are meant to be unique. We are aiming to make it less of a problem if they are not unique, but installing Tentacles after imaging is probably best.

The reason I ask regarding NICs is that I’ve read of issues where weirdness happens at the network driver / hardware level that are very difficult to diagnose. If the NIC was different between the boxes, then that might give weight to that theory.

Is the issue still occurring on that Tentacle after the reinstall? If it is, we have a tool - https://github.com/OctopusDeploy/TentaclePing - that is designed to help diagnose these issues. If you run TentaclePong on the Octopus server and run TentaclePing on the Tentacle that is having issues, then that will help diagnose if there are issues at the NIC/network level.

Let me know how you get on with TentaclePing/TentaclePong and hopefully we can track this down.

Hope that helps!

Cheers,
Matt

Looking forward to 3.4 RTW! Do you mean proxy support could be helpful in the general sense? I don’t see how it is applicable right now, so just making sure I’m not missing a specific meaning. Initially I thought there may be a way to use listening Tentacles behind NAT through a proxy, but after reading the docs I don’t think so.

Unique thumbprints: great to have best practice confirmed.

TentaclePing: can you confirm that it can be used to diagnose polling tentacles? The README specifically mentions listening tentacles only.

Checked the bad tentacle to confirm, still getting ~30-70 mini outages a day (I removed some outliers that were 10+ minutes):

So it looks like the reinstall hasn’t helped with the bad tentacle. For contrast, the good tentacle has ~5 outages/day, but the durations are ~1-5 minutes.

Each box has the same dual-NICs, but on the good Tentacle one of the NICs was BIOS disabled. I’m not sure if that is a contributing factor because the same NIC for both boxes is physically connected:

Bad Tentacle Good Tentacle Physically connected
Intel 82574L Enabled Disabled No
Intel 82579LM Enabled Enabled Yes

To rule it out, I’ve reversed the situation:

Bad Tentacle Good Tentacle Physically connected
Intel 82574L Enabled Disabled Disabled Enabled No
Intel 82579LM Enabled Enabled Yes

I’ll let them run like this for a week or so to see if it affects things.

Hi Shannon,

Ahh - the fact that they have different NIC configurations sounds interesting. I look forward to hearing how it goes from there. I know I’ve had issues around multi-NIC servers in the past.

Regarding proxy support, I think I misread your comment about NAT’ed customer sites - apologies for that.

In the TentaclePing download, we have both TentaclePing, which is designed to be used from the Octopus Server talking to a listening Tentacle and TentaclePong which is designed to run on the Tentacle server, and mimic a listening Tentacle.

So, in your case, you’d want to:

Run TentaclePong.exe 10945 on the bad Tentacle.
Run TentaclePing.exe <badtentacle> 10945 from the Octopus Server.

This should help us determine if its a network issue or something within the Tentacle.

Let me know how it goes.

Cheers,
Matt