Connectivity lost to Tentacle. Service restart does not fix

ops · 15 April 2020 20:13

We have a (admittedly busy) Dev server with a Tentacle installed that deploys a number of sites (20-30) and services (8-10) and a Cloud hosted server instance.

For no particular reason the service loses connection with our Cloud instance and never reconnects. We simply get the 01:00:00 timout messages repeatedly on the health-checks and it constantly fails on the Deploy Site step when acquiring the package.

Our other Tentacles seem fine and fast.

A service restart does not fix the issue.

There is nothing in the logs indicating an issue with connectivity around the time.

A Reinstall via the Octopus app does not seem to fix it.

A Delete and then Install New Tentacle did appear to finally fix something.

We checked the firewall rules on the server and network and everything appeared correct and nothing had changed in either respect.

We don’t know what to do about this (this is not the first time it has happened on this server) however we cannot see any reasons or factors in causing the problem. We deployed to that server successfully earlier today and yet 6 hours later the Tentacle is not responding.

Without any further leads or information we don’t know what to do, and at the same time we understand that you might not be able to do anything either! I just wanted to submit this report that there might be an issue somewhere that might need looking at if someone can repro it? Vague I know.

If anyone however has seen this behaviour before then any quicker workarounds would be much appreciated.

dane.falvo · 21 April 2020 01:01

Hi Ops,

Thank you for reaching out! That is an odd issue but not neccesarily completely foreign to us here on the support team.

Firsty, we have quite a few troubleshooting steps available here for both Linux and Windows based Tentacles:
https://octopus.com/docs/infrastructure/deployment-targets

Specifically:
https://octopus.com/docs/infrastructure/deployment-targets/windows-targets or
https://octopus.com/docs/infrastructure/deployment-targets/linux

I would recommend giving that document a thorough read-through. If none of the suggestions in the above link help you, could you please provide me with the answers to the questions below so I can get some more context and provide you with better answers.

What type of tentacle are you deploying?
Is it on-prem or a cloud server that the tentacle is on? (if cloud, which cloud?)
Is the server the tentacle on behind a Proxy?
What OS is on the server the tentacle is on?
From your experience, does this seem to happen at a specific time everyday or perhaps a set amount of time after the install or completely random? Does it coincide with a scheduled task on that server?
Can you provide a quick run-down of what you tried from the aforementioned troubleshooting links and what were the results

I’m looking forward to hearing back from you so we can solve this issue.

ops · 21 April 2020 10:37

Thanks for getting back to me, and I totally appreciate this is a vague report. We are not sure how to reproduce currently but I can let you know the answers to your queries.

We only use Listening Tentacles
The Tentacles are on on-prem (HyperV) VMs in our hosting centre
The server is behind a proxy for serving public requests but we have a specific deployment DNS entry and firewall rule to allow 10933 traffic from our Octopus Cloud instance directly.
All our Tentacles are currently on Windows Server (2016/2019).
I will see if I can narrow down some circumstances around the failure. So far it has seemed random. No set time after install and we have other Tentacles running that have not so far failed. I will see about coinciding scheduled tasks
We haven’t seen the issue occur again this week so I will go through the troubleshooting guide(s) next time it happens and feedback the result.

Thanks again for looking at this for us. I will let you know if we discover anything else about this issue.

ops · 30 April 2020 16:24

Our DB jump-box is currently the biggest culprit.
This time there wasn’t the issue with it being unable to restart.
We deployed a version of our software at 10:05 this morning and that was all green.
Went to deploy the next version at 16:36 (BST) and the Acquire Packages stepped failed with a timeout connecting to the tentacle

An error occurred when sending a request to ‘https://tentacle domain:10933/’, before the request could begin: The client was unable to establish the initial connection within the timeout 00:01:00.
15:42:02 Verbose | at Halibut.Transport.SecureListeningClient.HandleError(Exception lastError, Boolean retryAllowed)
| at Halibut.Transport.SecureListeningClient.ExecuteTransaction(Action1 protocolHandler, CancellationToken cancellationToken) | at Halibut.HalibutRuntime.SendOutgoingHttpsRequest(RequestMessage request, CancellationToken cancellationToken) | at Halibut.HalibutRuntime.SendOutgoingRequest(RequestMessage request, CancellationToken cancellationToken) | at Halibut.ServiceModel.HalibutProxy.Invoke(MethodInfo targetMethod, Object[] args) | --- End of stack trace from previous location where exception was thrown --- | at System.Reflection.DispatchProxyGenerator.Invoke(Object[] args) | at generatedProxy_1.StartScript(StartScriptCommand ) | at Octopus.Server.Orchestration.Targets.Tentacles.TentacleRemoteEndpointFacade.ExecuteCommand(StartScriptCommand command) in C:\buildAgent\work\47b5b27d3625d6ca\source\Octopus.Server\Orchestration\Targets\Tentacles\TentacleRemoteEndpointFacade.cs:line 61 | at Octopus.Server.Orchestration.ServerTasks.Deploy.ActionExecution.Immediate.ExecutionTargets.TentacleExecutionTarget.Execute(ScriptCollection bootstrapperScripts, IReadOnlyList1 boostrapperArguments, IReadOnlyList1 files, Nullable1 forceIsolationLevel, Boolean raw, Nullable1 isolationMutexTimeout) in C:\buildAgent\work\47b5b27d3625d6ca\source\Octopus.Server\Orchestration\ServerTasks\Deploy\ActionExecution\Immediate\ExecutionTargets\TentacleExecutionTarget.cs:line 47 | at Octopus.Server.Orchestration.ServerTasks.Deploy.ActionExecution.Immediate.ImmediateExecutor.RunOnTarget(ScriptCollection finalBootstrapperScripts, IReadOnlyList1 bootstrapperArguments, IReadOnlyList1 allFiles, Nullable1 isolation, Nullable1 isolationMutexTimeout, ICalamariFlavour flavour, IReadOnlyList1 deploymentTools) in C:\buildAgent\work\47b5b27d3625d6ca\source\Octopus.Server\Orchestration\ServerTasks\Deploy\ActionExecution\Immediate\ImmediateExecutor.cs:line 183
| at Octopus.Server.Orchestration.ServerTasks.Deploy.ActionExecution.Immediate.ImmediateExecutor.ExecuteCalamari(ICalamariFlavour calamariFlavour, String calamariCommand, IReadOnlyList1 calamariArguments, IReadOnlyList1 files, IReadOnlyList1 deploymentTools, VariableCollection extraVariables, TargetManifest targetManifest, Nullable1 isolationMutexTimeout) in C:\buildAgent\work\47b5b27d3625d6ca\source\Octopus.Server\Orchestration\ServerTasks\Deploy\ActionExecution\Immediate\ImmediateExecutor.cs:line 138
| at Octopus.Server.Orchestration.ServerTasks.Deploy.ActionExecution.CommandBuilders.CalamariCommandBuilder.Execute() in C:\buildAgent\work\47b5b27d3625d6ca\source\Octopus.Server\Orchestration\ServerTasks\Deploy\ActionExecution\CommandBuilders\CalamariCommandBuilder.cs:line 143
| at Octopus.Server.Orchestration.ServerTasks.Deploy.Steps.Acquire.PackageAcquisitionExecutionHandlers.<>c__DisplayClass3_0.b__0(IActionHanderContext ctx) in C:\buildAgent\work\47b5b27d3625d6ca\source\Octopus.Server\Orchestration\ServerTasks\Deploy\Steps\Acquire\PackageAcquisitionExecutionHandlers.cs:line 32
| at Octopus.Server.Orchestration.ServerTasks.Deploy.ActionDispatch.AdHocActionDispatcher.InvokeActionHandler(Machine target, ActionHandlerInvocation actionHandler, ActionAndTargetScopedVariables actionAndTargetScopedVariables, IExecutor executor, TargetManifest targetManifest) in C:\buildAgent\work\47b5b27d3625d6ca\source\Octopus.Server\Orchestration\ServerTasks\Deploy\ActionDispatch\AdHocActionDispatcher.cs:line 167
| at Octopus.Server.Orchestration.ServerTasks.Deploy.ActionDispatch.AdHocActionDispatcher.Dispatch(Machine machine, ActionHandlerInvocation actionHandler, VariableCollection variables) in C:\buildAgent\work\47b5b27d3625d6ca\source\Octopus.Server\Orchestration\ServerTasks\Deploy\ActionDispatch\AdHocActionDispatcher.cs:line 62
| at Octopus.Server.Orchestration.ServerTasks.Deploy.Steps.Acquire.PackageAcquisitionExecutionHandlers.FindPackage(String packageId, IVersion version, String packageHash, PackageSearchOptions packageSearchOptions) in C:\buildAgent\work\47b5b27d3625d6ca\source\Octopus.Server\Orchestration\ServerTasks\Deploy\Steps\Acquire\PackageAcquisitionExecutionHandlers.cs:line 40
| at Octopus.Server.Orchestration.Targets.Common.PackageStagingMediator.TryFindPackageInCache(StoredPackage packageToPush, PackageAcquisitionExecutionHandlers execs, StagedPackage& stagedPackage) in C:\buildAgent\work\47b5b27d3625d6ca\source\Octopus.Server\Orchestration\Targets\Common\PackageStagingMediator.cs:line 118
| at Octopus.Server.Orchestration.Targets.Common.PackageStagingMediator.PushPackage(PackageIdentity packageIdentity, StoredPackage packageToPush, IDeploymentPackageDownloader deploymentPackageDownloader, DeltaCompressionOptions deltaCompressionOptions, PackageCacheOptions packageCacheOptions, PackageAcquisitionExecutionHandlers execs) in C:\buildAgent\work\47b5b27d3625d6ca\source\Octopus.Server\Orchestration\Targets\Common\PackageStagingMediator.cs:line 99
| at Octopus.Server.Orchestration.Targets.Common.PackageStagingMediator.DownloadPackageAndPushToAgent(PackageIdentity packageIdentity, IDeploymentPackageDownloader deploymentPackageDownloader, DeltaCompressionOptions deltaCompressionOptions, PackageCacheOptions packageCacheOptions, PackageAcquisitionExecutionHandlers execs) in C:\buildAgent\work\47b5b27d3625d6ca\source\Octopus.Server\Orchestration\Targets\Common\PackageStagingMediator.cs:line 68
| at Octopus.Server.Orchestration.ServerTasks.Deploy.Steps.Acquire.AcquireMachinePackageTask.Acquire(AcquiredPackageMap acquiredPackageMap, ITaskContext taskContext, LogContext logContext) in C:\buildAgent\work\47b5b27d3625d6ca\source\Octopus.Server\Orchestration\ServerTasks\Deploy\Steps\Acquire\AcquireMachinePackageTask.cs:line 101
| at Octopus.Server.Orchestration.ServerTasks.Deploy.ExecutionTaskController1.ExecuteWithTransientErrorDetection(Action action, DeploymentTarget deploymentTarget) in C:\buildAgent\work\47b5b27d3625d6ca\source\Octopus.Server\Orchestration\ServerTasks\Deploy\ExecutionTaskController.cs:line 636 | at Octopus.Server.Orchestration.ServerTasks.OctoThreadClosure1.Execute() in C:\buildAgent\work\47b5b27d3625d6ca\source\Octopus.Server\Orchestration\ServerTasks\OctoThreadClosure.cs:line 71

We logged into the server and found that the service was not running.
There is nothing logged in the Tentacle logs regarding an error or shutdown. The server uptime is over 9 days so it’s not a restart. Nothing in the Windows Event log.
Just… stopped.

It’s becoming frustrating that the tentacles just seem unreliable.
We have enough work to do without constantly babysitting Windows services. However we do also understand that this really is quite a vague issue to deal with. There’s no network interruptions from what we can see. Is there anything else we might be missing? Or that we can try?

Thanks.

dane.falvo · 8 May 2020 02:47

Hi Ops,

Sorry for the delay in getting back to you.

I suspect the issue you’re seeing is most likely going to be an environmental issue, as we haven’t been getting reports of this behaviour elsewhere. So at the moment, the best I can do is give you some ideas around where else to look. If a service has stopped, you would expect there to be some details around what terminated it, whether it’s in the Octopus Logs, server logs or elsewhere.

Common culprits that I have seen affect Tentacle services include Anti-virus and DSC configurations.

I can propose a workaround, however I fully understand this is not the advice you are after.

There is a tool built-in to the Octopus Tentacles called the Service Watchdog.
The watchdog will automatically restart the tentacle service if it stops.
Details are here: https://octopus.com/docs/administration/managing-infrastructure/service-watchdog

Please let me know if you discover anything, as you’ve got me very curious about the issues you are facing.

Regards,

Dane

ops · 13 May 2020 14:32

Hi Dane,

Thanks for getting back to me.

I too have been slow getting back to you so apologies for that.

The watchdog service might be useful… and the document you linked is actually very telling!

“This typically occurs during a restart after Windows Updates have been installed.”

This would be our most likely scenario as we are mainly noticing this issue after updates, and we have recently improved our Windows Update schedule.

The “Why Does It Happen” section is also ringing bells for us, not only in the fact that we don’t really know why it happens, but the suspicion that it might be to do with slow CLR initialisation might also lead us to the cause of other issues we also have.

We recently switched to Net Core for our main service worker, and I have been investigating why this service too does not start or stop reliably. It could all be related.

I feel that this might be the best solution to our problem for now so we are very grateful to you for pointing out this hidden gem of functionality.

I will update you how we get on.

Thanks,

Mike

ops · 15 May 2020 13:55

Another day another timeout.
It’s not an issue that can be solved with Watchdog it seems.
That is running normally. However our build server (TeamCity) has tried to contact the Tentacle and recieved a timeout error.

[14:36:13]
Failed: Acquire packages
[14:36:13]
Acquiring packages
[14:36:13]
Making a list of packages to acquire
[14:36:13]
The step failed: Activity failed with error 'An error occurred when sending a request to ‘https://“tentacle url”:10933/’, before the request could begin: The client was unable to establish the initial connection within the timeout 00:01:00.
[14:36:13]
The client was unable to establish the initial connection within the timeout 00:01:00.’.

I’ve checked the machine and it is running, the service is running, and the logs show the client authenticating at that time. No other log lines until a watchdog check shows up 4 minutes later.

I say contact the Tentacle. I of course mean TC is talking to Octopus Server (Cloud) and that in turn can’t talk to the Tentacle

ops · 15 May 2020 14:17

Is there anything we might be missing in network gateway setup that might be causing issues with the Server talking to our Tentacle?
If the Server Azure worker restarts and / or moves IP is it going to mess up a routing table somewhere?
I’m just guessing as network engineering is not my first discipline and my colleague set up our firewalls and routers. Is there anything I should be checking there?
I might get them to look at the logs at least and see if there is any unusual blocking occurring.

dane.falvo · 19 May 2020 09:48

Hey Mike,

Next step will be to get a copy of the Trace logs for your DB Jump-box Tentacle.

Turn on your trace logging: https://octopus.com/docs/support/log-files#Logfiles-Changingloglevelstentacle

Wait for the next time it fails - unless you’ve figured out a way to force the failure.
Get a copy of the Tentacle log file and send it over. (https://octopus.com/docs/support/log-files#Logfiles-Findingthelogfiles)

I will turn this ticket into a Private message so that you can safely send through your log files without exposing any sensitive details.

I’m thinking it might be a failing dependency for the octopus tentacle. Hopefully the tentacle trace log will be able to shed some light on it.

Regards,

Dane

ops · 19 May 2020 12:09

Thanks Dane.

I’ve enabled the trace log and we will respond next time we spot it.

Many thanks,

Mike

ops · 17 September 2020 12:52

Looks like I might have to reopen this issue.
Our DB jump box seems ok for now, but our new linux tentacles are falling apart.
They seem to fail every day.
Server logs show it being unable to communicate with the tentacle. The tentacle just seems to stop listening. It’s logs match the times shown. One hour it’ll be there and the next hour it won’t
I’ve already enabled Trace logging on one of them to see if we can catch the error.

This is causing us issues as it is failing deploys for no real reason.

ops · 17 September 2020 14:35

2020-09-17 14:10:45.5140  86913      5  INFO  listen://[::]:10933/              5  Accepted TCP client: [::ffff:13.66.133.169]:7360
2020-09-17 14:10:45.5140  86913      5 TRACE  listen://[::]:10933/              5  Performing TLS server handshake
2020-09-17 14:10:45.8390  86913     58 TRACE  listen://[::]:10933/             58  Secure connection established, client is not yet authenticated, client connected with Tls12
2020-09-17 14:10:45.9698  86913     58 TRACE  listen://[::]:10933/             58  Begin authorization
2020-09-17 14:10:45.9698  86913     58  INFO  listen://[::]:10933/             58  Client at [::ffff:13.66.133.169]:7360 authenticated as 4C808322E57AAD30A9B009817540F5F118A54FC0
2020-09-17 14:10:45.9698  86913     58 TRACE  listen://[::]:10933/             58  Begin message exchange
2020-09-17 14:10:46.3056  86913     58 TRACE  listen://[::]:10933/             58  Received: IScriptService::StartScript[1] / c6426903-685c-44f5-8d95-62de6f6fc742
2020-09-17 14:10:46.3056  86913     59 TRACE  [ServerTasks-33555] [RunningScript] [Read Lock] [no locks] Trying to acquire lock.
2020-09-17 14:10:46.3056  86913     59 TRACE  [ServerTasks-33555] [RunningScript] [Read Lock] ["ServerTasks-33555" (has a read lock)] Lock taken.
2020-09-17 14:10:46.3155  86913     58 TRACE  listen://[::]:10933/             58  Sent: Halibut.Transport.Protocol.ResponseMessage
2020-09-17 14:10:46.6080  86913     59 TRACE  [ServerTasks-33555] [RunningScript] [Read Lock] ["ServerTasks-33555" (has a read lock)] Releasing lock.
2020-09-17 14:10:46.8122  86913     58 TRACE  listen://[::]:10933/             58  Received: IScriptService::GetStatus[2] / 495b22bb-7906-41bd-8675-da97e3a1bbf2
2020-09-17 14:10:46.8243  86913     58 TRACE  listen://[::]:10933/             58  Sent: Halibut.Transport.Protocol.ResponseMessage
2020-09-17 14:10:47.1229  86913     58 TRACE  listen://[::]:10933/             58  Received: IScriptService::CompleteScript[3] / 2dd72212-8e7e-4f9a-a4e1-b028e584839d
2020-09-17 14:10:47.1247  86913     58 TRACE  listen://[::]:10933/             58  Sent: Halibut.Transport.Protocol.ResponseMessage
2020-09-17 14:20:47.1265  86913     58 TRACE  listen://[::]:10933/             58  No messages received from client for timeout period. Connection closed and will be re-opened when required

Logs are showing it closing it’s connection and then it doesn’t respond to the next call. Both our linux targets are currently hanging.
A Check Health connectivity call is also failing.
At the moment only a restart of the target service seems to bring it back but this is a poor solution. We also cannot run a watchdog as this is Ubuntu

Any suggestions?

ops · 17 September 2020 14:37

@dane.falvo not sure if you’re able to pick this up again?
Or anyone in @hello?

dane.falvo · 18 September 2020 00:01

Thanks Mike, Jeremy,

We will close this ticket, but reference it in the new ticket that Jeremy has opened up as this seems to be different to the initial topic.

Regards,

Dane