OD Error on some servers when triggering Server Restart on deployment target

Lastbuilders · 23 September 2022 08:30

Hi,
OD Version 2022.1.2121

We are working on a project to replace an existing deployment process with OD and have encountered an issue with some servers where a step to restart the server is failing regularly.

The step triggering the restart is as follows:

The step proceeding this polls the server to confirm when it has restarted before proceeding with other steps.

The error we are seeing is:
Name Value
21:52:55 Verbose | ---- -----
21:52:55 Verbose | PSVersion 5.1.17763.2931
21:52:55 Verbose | PSEdition Desktop
21:52:55 Verbose | PSCompatibleVersions {1.0, 2.0, 3.0, 4.0…}
21:52:55 Verbose | BuildVersion 10.0.17763.2931
21:52:55 Verbose | CLRVersion 4.0.30319.42000
21:52:55 Verbose | WSManStackVersion 3.0
21:52:55 Verbose | PSRemotingProtocolVersion 2.3
21:52:55 Verbose | SerializationVersion 1.1.0.1
21:52:55 Verbose | PowerShell Environment Information:
21:52:55 Verbose | OperatingSystem: Microsoft Windows NT 10.0.17763.0
21:52:55 Verbose | OsBitVersion: x64
21:52:55 Verbose | Is64BitProcess: True
21:52:55 Verbose | CurrentUser: NT AUTHORITY\SYSTEM
21:52:55 Verbose | MachineName: {}
21:52:55 Verbose | ProcessorCount: 4
21:52:55 Verbose | CurrentDirectory: C:\Octopus\Work\20220922205252-357084-120
21:52:55 Verbose | CurrentLocation: C:\Octopus\Work\20220922205252-357084-120
21:52:55 Verbose | TempDirectory: C:\Windows\TEMP
21:52:55 Verbose | HostProcess: powershell (8144)
21:52:55 Verbose | TotalPhysicalMemory: 16776756 KB
21:52:55 Verbose | AvailablePhysicalMemory: 13034448 KB
21:52:55 Verbose | Invoking target script C:\Octopus\Work\20220922205252-357084-120\Script.ps1 with parameters.
21:52:56 Error | error occurred when sending a request to ‘https://{}:10933/’, after the request began: messageEnvelope is null
** | messageEnvelope is null
22:04:38 Error | The task was canceled
22:04:38 Verbose | Octopus.Server.Orchestration.ServerTasks.Deploy.ForcedGuidedFailureException
| at Octopus.Server.Orchestration.ServerTasks.Deploy.Guidance.HandleInterruption(Exception ex, String actionName, Boolean actionIsRequiredToRun, Maybe1 callbackOnExclude, ITaskLog taskLog, CancellationToken cancellationToken, Boolean wasLastAttempt) in Guidance.cs:line 181 | at Octopus.Server.Orchestration.ServerTasks.Deploy.Guidance.ExecuteWithGuidance(EitherAsyncOrSync callback, String actionName, Boolean actionIsRequiredToRun, Maybe1 callbackOnExclude, ITaskLog taskLog, CancellationToken cancellationToken) in Guidance.cs:line 111
| at Octopus.Server.Orchestration.ServerTasks.Deploy.Guidance.Execute(EitherAsyncOrSync callback, String actionName, Boolean actionIsRequiredToRun, ITaskLog taskLog, Maybe1 callbackOnExclude, CancellationToken cancellationToken) in Guidance.cs:line 78 | at Octopus.Server.Orchestration.ServerTasks.Deploy.PlannedStepControllers.ProcessStepController.<>c__DisplayClass10_1.<ExecuteActionAndInitLoggingContext in ProcessStepController.cs:line 276 | at Octopus.Server.Orchestration.ServerTasks.Deploy.TransientErrorDetectionExecutor.Execute(Func2 action, ExecutionPlan plan, ITaskLog taskLog, CancellationToken cancellationToken, DeploymentTarget deploymentTarget) in TransientErrorDetectionExecutor.cs:line 49
| at Octopus.Server.Orchestration.ServerTasks.Deploy.PlannedStepControllers.ProcessStepController.<>c__DisplayClass10_0.<ExecuteActionAndInitLoggingContext in ProcessStepController.cs:line 281
| at Octopus.Server.Infrastructure.Orchestration.UnitsOfWork.UnitOfWorkExecutor.<>c__DisplayClass6_04.<Execute in UnitOfWorkExecutor.cs:line 147 | at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWorkExtensionMethods.DoAsync(IUnitOfWork unitOfWork, Func1 action, CancellationToken cancellationToken, String name) in UnitOfWorkExtensionMethods.cs:line 75
| at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWorkExtensionMethods.DoAsync(IUnitOfWork unitOfWork, Func1 action, CancellationToken cancellationToken, String name) in UnitOfWorkExtensionMethods.cs:line 75 | at Octopus.Server.Infrastructure.Orchestration.UnitsOfWork.UnitOfWorkExecutor.Execute[T1,T2,T3,T4](Func6 action, CancellationToken cancellationToken, String name) in UnitOfWorkExecutor.cs:line 150
| at Octopus.Server.Orchestration.ServerTasks.Deploy.PlannedStepControllers.ProcessStepController.ExecuteActionAndInitLoggingContext(ExecutionPlan plan, ExecutionPlanner planner, PlannedStep step, DeploymentTarget targetContext, PlannedAction action, ITaskLog taskLogForTarget, ITaskLog taskLogRoot, CancellationToken cancellationToken) in ProcessStepController.cs:line 300
| --Inner Exception–
| The task was canceled
22:04:38 Fatal | The action RestartServer on {}failed

Is there any guidance on how we resolve this as it is blocking our project currently?

I also attempted running the following command from the OD Server on behalf of the deployment target but received the following error:
OpenError: Connecting to remote server servername failed with the following error message : Access is denied.
Invoke-Command -ScriptBlock { Restart-Computer } -ComputerName ServerName
Kind Regards,
Lastbuilders

clare.martin · 23 September 2022 09:54

Good morning @Lastbuilders,

Thank you for contacting Octopus Support and sorry to see you are seeing errors with one of your scripts running inside Octopus.

The first error messageEnvelope is null is interesting, we have a forum post here with a user having a similar error which we responded to by mentioning this error is usually environmental.

Are you able to connect to that target at all from the Octopus Server (via a heath check in the Octopus UI and also a ping request on the Octopus server itself).

The second error you are seeing when running it from the OD server on behalf of the target looks like permissions. Is your Octopus Server service running as a local system account or a domain account? If running under a domain account does that account have permissions to restart machines remotely?

I know there are a few group policy settings that will block users from running certain commands (restarts of machines remotely is usually one an IT admin would set on a domain).

I am wondering if you can run the same script on the Octopus Server outside of Octopus (using powershell running as the domain account of the Octopus Server Service if you have one setup - if its running over Local System just run the script in powershell as an admin) and see if it lets you restart the server.

It may also be that the tentacle service account setup doesnt have permissions to restart the server either so if your Octopus server wont let you run the script via powershell it may be worth looking at your group policy settings and seeing if those accounts are blocked from restarting machines.

You alluded to the fact this is intermittent though which is slightly confusing, are there machines that always work, I wonder if their tentacle services are running under the same accounts as this one which is erroring out?

Let me know what you think and the outcomes of running the script outside of Octopus and we can dive into this a bit further.

Kind Regards,

Clare

Lastbuilders · 23 September 2022 10:20

Hi @clare.martin , Thanks for getting back to me so promptly.

Unfortunately I don’t have access to these servers to run commands on them as they are managed by our Operations teams but I can request them to run any tests we think useful.

The unusual thing is it is intermittent ok. Just as background the OD solution is configured as follows:

A Hosted OD install in EMEA which is deploying to 5 environments across 4 globally dispersed Regions.

When we run the OD Release on our QA\System Test instances in EMEA and our Sandbox instance in North America it is running fine including the Server restart.

When we try and run it on our NA UAT instances it gives this error until our Ops team does a restart outside of OD. Then it will typically succeed but then fail again on the next deployment with the same envelope null error.

The steps prior to the restart are installing some windows features, libraries and other project dependencies though these steps always succeed. It is the Restart step which seems to be the issue.

The restart step is used to release any files etc which could block the deployment.

After the restart step lots of other steps run to install Websites\COM+ libraries etc.

Kind Regards,
Lastbuilders

clare.martin · 23 September 2022 11:25

Hey @Lastbuilders,

Thank you for that information that actually really helps explain what is going on here.

Are your QA\System Test and NA UAT instances in the same region or are they in different regions, does one environment have tighter restrictions than the other being that one of them is a test environment? Are the tentacles all the same spec across the two environments (OS and tentacle versions all the same?).

It definitely looks environmental to me unfortunately, if the deployments are working to one environment but not the other (on your NA UAT instances). And the NA UAT instances sometimes allows for a restart but sometimes does not it seems like there is something going on with the tentacles associated with the NA UAT environment.

From what you have said it looks like you update windows and some of its features and that always succeeds, but the restart does not happen every time (a manual one by the ops team gets the deployment going again). It could possibly be that the updates you are performing to that machine lock the files that are used when performing remote commands to a machine, I am just guessing here based on my Server Infra knowledge.

It would be worth doing the following:

Run the deployment on your NA UAT environment but to one tentacle.
If it fails to restart the first thing I would do is get a tentacle log from that tentacle (feel free to send that to us if you want, we can send you a secure link to upload it to). Check the log for anything that jumps out at you error wise.
Log onto the tentacle and make sure there are no Windows messages saying the server needs a restart, sometimes if you run certain Windows updates it can block the running of remote tasks to that server (ie the remote restart) so you may have to manually restart the server if that is the case.
I would get your ops team to go onto the Octopus Server and try running the restart script through powershell (remember to run it as the domain account that runs the Octopus service - or if local system run the powershell window as admin). If you get permissions issues that needs investigating. If you can reboot the server then we can rule that out.

You did not mention how you were running the task but the best way of doing this would be to have a worker within that environment that has the ability to run remote powershell commands on all the targets and then using:

Invoke-Command -ScriptBlock { Restart-Computer } -ComputerName #{Octopus.Machine.Name}

I hope my suggestions help, let me know if it happens again and if you manage to get a log of the tentacle. Also, if the tentacle machine specs are different on the two environments that may factor in too.

Kind Regards,

Clare

Lastbuilders · 23 September 2022 13:27

Hi, We ran the step to restart the server on a single node across 4 regions with 2 failed and 2 succeeded. Which logs do you require? The build log or the OD logs at [drive]:\Octopus Deploy\Tentacle\Logs?

Thanks,
Lastbuilders

Lastbuilders · 23 September 2022 13:39

HI @clare.martin ,
Just to answer your queries.

Regarding tentacle versions one environment without the issue is 6.1.1271 and one with the issue is 6.1.1304.

Kind Regards,
Lastbuilders

clare.martin · 23 September 2022 13:54

Hey @Lastbuilders,

The Tentacle logs at [drive]:\Octopus Deploy\Tentacle\Logs would be perfect please, you can send them via our secure link I created for you here.

Thank you for the clarification on the tentacles too, are they all the same OS environments?

Let me know once the tentacle logs have been uploaded and I can take a look for you.

Kind Regards

Clare

Lastbuilders · 23 September 2022 14:23

Hi @clare.martin ,

Those logs are uploaded now. All servers are 2019.

Kind Regards,
Lastbuilders

Lastbuilders · 23 September 2022 15:23

Hi,

We also completed deployments without the Restart Server step and they all succeeded so the issue does definitely appear related to that action on some environments whereas the same restart step works elsewhere.

Kind Regards,
Lastbuilders

clare.martin · 23 September 2022 15:31

Hey @Lastbuilders,

Thanks for the logs, I have taken a look and it seems like the Tentacle service on the logs from today all start and stop fine.

Is the machine BRSPVXAGEMSUAT2 for Octopus EMEA working, the other two logs have one error message (shown below) but the EMEA logs look fine.

The Tentacle from the US (QASFVUA----) has this in the logs (notice it mentions Boolean shutdownCheck):

2022-09-23 08:35:14.7050   3160     28  INFO  Stopping the Windows Service
2022-09-23 08:35:14.7206   3160     28  INFO  listen://[::]:10933/             28  Listener stopped
2022-09-23 08:35:14.7206   3160     28  INFO  The Windows Service has stopped
2022-09-23 08:35:14.7519   3160     33  INFO  listen://[::]:10933/             33  Socket IO exception: [::ffff:172.27.4.140]:14227
System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'SslStream'.
   at System.Net.Security.SslState.CheckThrow(Boolean authSuccessCheck, Boolean shutdownCheck)
   at System.Net.Security._SslStream.ProcessFrameBody(Int32 readBytes, Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security._SslStream.StartFrameHeader(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security._SslStream.StartReading(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security._SslStream.ProcessRead(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)

The same with the Octopus CA logs from TORLVUAX-----:

2022-09-23 09:16:51.7710   2580     17  INFO  Stopping the Windows Service
2022-09-23 09:16:51.7710   2580     17  INFO  listen://[::]:10933/             17  Listener stopped
2022-09-23 09:16:51.7710   2580     17  INFO  The Windows Service has stopped
2022-09-23 09:16:51.8574   2580      6  INFO  listen://[::]:10933/              6  Socket IO exception: [::ffff:172.27.4.140]:14790
System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'SslStream'.

Do those timings correspond to when you deployed and seen the failures? There is a forum post here where one of our users had the same issue and he mentioned he was not able to deploy apps until he restarted the server. The issue does not have a resolution though unfortunately.

There is also this user who runs into the same error message, its not Octopus related but this users code would not run (he got the error message you are seeing) until he rebooted his machine then the code would run.

This looks to potentially be coming from .net, there is an issue relating to it here and from the sounds of it it looks like something has been disposed of that the tentacle needs access to in order to deploy. This would potentially stop Octopus from actually running the restart script perhaps?

Once your ops team restart the server the tentacle then has access to the object it requires and continues to deploy. Unfortunately I don’t know why this would happen, what the disposed object is or why it only happens on tentacle version 6.1.1304. Have you tried upgrading the tentacles at all? We have not seen this error come up other than that one forum post so this does look to be environmental rather than an issue with Octopus but it would be worth upgrading a tentacle to see if the newer versions handle that disposed object better?

Also, are the tentacles all running the same .net versions?

Does the issue happen on BRSPVXAGEM---- for Octopus EMEA as that doesnt have the error in the log and is running on 6.1.1304 which is the same as the other two tentacles?

I look forward to hearing from you,

Kind Regards,

Clare

Lastbuilders · 27 September 2022 11:20

Hi @clare.martin,

I have asked our Operations team to investigate the issue further. Unfortunately I am unable to recreate the issue on the Development and System Test environments I have access to. I do not believe it is tentacle related as we have seen the issue on servers with a given tentacle version i.e. 6.1.1304 whereas other servers with the same version work as expected.

Any guidance on how to troubleshoot this further would be appreciated.

Kind Regards,
Lastbuilders

clare.martin · 27 September 2022 14:54

Hey @Lastbuilders,

We have been discussing this ticket in our 3pm support meeting as it is quite confusing as to why some servers are not rebooting.

Our Lead engineer remembered a forum post on this topic which may be of use to you.

Apart from what I have already suggested I cant think of anything else that would help. The only other thing that was mentioned was to include a health check step on the target after the windows update install step:

Install Windows Updates
Run Octopus Tentacle Health Check
If health check passes run reboot script

This might allow you to narrow it down, if the health check fails it means the Octopus server is not able to communicate with the tentacle in order to get it to run the shutdown script. There is no evidence in the logs that the tentacle has lost connection to the server other than the SSL error I posted but again that seems more .net orientated, the tentacle might need Octopus to try and talk to it in order for the logs to suggest a failed connection attempt.

Let me know if that article helps and if my suggestions help, the Ops team definitely need to try a remote restart of those servers after a failed Octopus deployment to see if that script actually runs after windows updates have been installed.

The health check option after a failed deployment is also a good idea, this doesn’t need to be added to the project really for testing, you would need a deployment to fail with the message null error and then you could manually run a health check via the Octopus UI to see if it passes. If it fails then you know why the restart doesn’t work (because Octopus cant talk to the tentacle to send the script over) and you can look at why that is happening.

My guess is it will be some .net library or something that has been updated which means the tentacle looses connection to the server.

Let me know how you get on.

Kind Regards,

Clare

system · 28 October 2022 14:55

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.