Octopus Deployments Not working

Pradeep.Duraisamy · 13 September 2023 05:02

Hi Team,

When we trigger a new deployment in some of the targets the deployment is not getting completed. It is in progress for log time still we cancel the task. When we trigger again we are able to deploy it. We are not sure why it is not deploying the applications randomly. Please advise us on the issue.

Thanks.

finnian.dempsey · 13 September 2023 05:33

Hi @Pradeep.Duraisamy,

Great to hear from you again, thanks for reaching out!

I’d be happy to take a look into what’s going on with this hanging task, could you please send through the Task logs to our secure upload portal?

We have made a bunch of improvements to recently which could help here, such as Step Retries in 2023.2, which automatically retries any steps that fail after a specific time.

Looking forward to getting to the bottom of this, feel free to reach out with any questions at all!

Best Regards,

Pradeep.Duraisamy · 13 September 2023 06:04

hi finnian.dempsey

I have uploaded the file

finnian.dempsey · 13 September 2023 06:24

Hi @Pradeep.Duraisamy,

Cheers for that, confirming I’ve received it ok!

It looks like it’s failing to check a machine’s health, could you please confirm if the target machine is using the default policy and if the health check has been modified at all?

Example of Default Health Check:

I can see that the health check script exited with exit code -1, could you please send through the Tentacle Logs for that timeframe as well?

There’ll likely be a corresponding windows event for that exit code -1 with more info about what caused the failure but I’d also be curious if this happens repeatedly or just sometimes? Are you manually triggering the deployment or is it from a build server? Do you use the same method each time?

Feel free to let us know if you have any questions at all!

Best Regards,

Pradeep.Duraisamy · 13 September 2023 07:54

Hi Dempsey,

Before cancelling task logs.

| == Running: Release packages ==
|
| Running: XXXXXXXXX
02:48:12 Info | Releasing package lock for XXXXXXXXXXXXXXXXXXXXXXXX.
02:48:12 Verbose | Acquiring isolation mutex RunningScript with NoIsolation in ServerTasks-708643
02:48:12 Verbose | Executable directory is C:\Windows\system32\WindowsPowershell\v1.0
02:48:12 Verbose | Executable name or full path: C:\Windows\system32\WindowsPowershell\v1.0\PowerShell.exe

clare.martin · 13 September 2023 08:04

Hey @Pradeep.Duraisamy,

Thank you for those logs but they don’t tell us much I am afraid, are you able to answer the questions Finnian mentioned specifically:

Does this happens repeatedly or just sometimes.
Does it happen to the same machines or different machines each time.
Are you manually triggering the deployment or is it from a build server.
Do you use the same method for deployment each time.

Can you also send over the tentacle logs Finnian mentioned (you can use the same secure link he sent over).

One thing to try also would be our tentacle ping tool as that is our gold standard tool for connection testing between your Octopus Server and your tentacle, if you can install the tentacle ping on your tentacle and ping your Octopus server whilst you are deploying it will tell you if there are any connection errors.

Can you please send over the results of a tentacle ping during deployment when the deployment fails with the same errors you are seeing above.

I look forward to hearing from you and getting those logs.

Kind Regards,
Clare

Pradeep.Duraisamy · 13 September 2023 08:51

I have attached the tentacle logs.

clare.martin · 13 September 2023 09:32

Hey @Pradeep.Duraisamy,

Thank you for those logs they paint a good picture of what is going on here, this looks like you have an intermittent networking issue on the affected machines. A few things from the logs that stand out:

System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host
2023-09-10 19:55:11.8999 15564 13 ERROR listen://[::]:10933/ 13 Unhandled error when handling request from client: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
System.IO.IOException: Received an unexpected EOF or 0 bytes from the transport stream.
2023-09-10 19:59:32.7114 15564 34 ERROR listen://[::]:10933/ 34 Unhandled error when handling request from client: xxxxxxxxxxxxxxxxxxxxxxxxxxxx
System.IO.IOException: The handshake failed due to an unexpected packet format.
System.Security.Authentication.AuthenticationException: A call to SSPI failed, see inner exception. —> System.ComponentModel.Win32Exception: The client and server cannot communicate, because they do not possess a common algorithm
System.Security.Authentication.AuthenticationException: A call to SSPI failed, see inner exception. —> System.ComponentModel.Win32Exception: The token supplied to the function is invalid.

The latter two errors we do have a page about here which may be of help but the fact this works intermittently would usually mean your SSL and TLS settings should be correct. Usually when we see those errors the tentacle will not communicate with the Octopus server at all.

I would speak to your networking team here as I suspect something like a proxy or firewall is blocking your tentacles from communicating with the Octopus Server sometimes.

I would also run the tentacle ping I suggested as you should see connection dropouts randomly as your tentacles are not able to communicate with your Octopus server on regular occasions.

Hopefully the networking logs show blocked requests and you can get this sorted, reach out if you need further assistance but the tentacle ping and networking logs should show connection errors and dropouts.

Kind Regards,
Clare

Pradeep.Duraisamy · 13 September 2023 09:42

I am manually cancelling the tasks as it is taking much time. Errors are because of manually cancelling the tasks?

clare.martin · 13 September 2023 09:49

Hey Pradeep,

Those errors are networking errors and not from manually cancelling tasks from my experience, the error is coming from a health check from your Octopus Tentacle.

The initial task you sent over was a health check task which is the one that was failing.

If you have a deployment task where the tentacle is getting stuck and then you manually cancel it please upload the task log for that hanging task (deployment) to the secure site so we can take a look but this does look like it may be getting stuck due to those errors I mentioned previously and rather than cancelling the task in Octopus its getting stuck as it cant contact the tentacle, you then have to manually cancel it.

If we can get some logs which show the task getting stuck we may be able to spot the networking issues.

Kind Regards,
Clare

Pradeep.Duraisamy · 13 September 2023 09:53

Hi Clare.Martin,

Attached the task logs for the reference.

Tentacle Ping Result:
D:\TentaclePing.1.1.0>TentaclePong.exe 10933
Listening on port 10933
Listening, hit to exit…

Thanks

clare.martin · 13 September 2023 10:23

Hey @Pradeep.Duraisamy,

Thank you for those logs, it looks like its pending at the Aquire package stage:

02:56:54   Verbose  |       Successfully finished IIS AppPool - Stop on XXXXXXXXXXXXXXXXX
                    |     
                    |   == Pending: Acquire packages ==

But this would happen if the tentacle connection did drop out, the tentacle ping results you posted up are not quite what we are after, it needs to look like this:

If you use this page to install tentacleping.exe (Not tentaclepong) on your tentacle and then run the command:

tentaclePing.exe <IPAddressOfServer>

ie tentaclePing.exe 192.168.0.201

You should see successes as I have but you will need to do this during a deployment as that is when it is getting stuck, unfortunately this will be a case of running the tentacleping.exe continuously to a target that you either deploy to regularly, or a test target that you can deploy to to test the networking out.

I would also speak to your networking team and see if there are connections getting blocked as they will be able to tell you straight away rather than having to keep the tentacle ping running.

Kind Regards,
Clare

Pradeep.Duraisamy · 13 September 2023 11:08

Hi Clare,

We triggered a deployment in one server and we could see that it got stuck.

On Running the tentacle ping continuously from Octopus Server to target deployment server we could see it as successfully connected during the issue.

Attached the recent task log and tentacle ping details in the secure portal

Thanks

clare.martin · 13 September 2023 11:56

Hey @Pradeep.Duraisamy,

Thank for for those recent logs, I can see a successful ping which is great!

Your recent log has confused me a bit though as that failure is a failure to stop the app services pool on one of your machines.

The machine below (ending in APP2) fails on the IIS stop and you then cancel the deployment as it looks like it gets stuck there not the acquire package stage:

                    |   == Failed: Step 2: XXXXXXX.Services-AppPool- Stop ==
05:55:39   Fatal    |     The step failed: The operation was canceled.
05:55:39   Info     |     The operation was canceled.

That step failed due to a bootstrap error which is usually down to antivirus software blocking a PowerShell script that needs to run to complete the deployment:

                    |       MachineName: XXXXXXAPP2
                    |       ProcessorCount: 4
                    |       CurrentDirectory: C:\Windows\system32
                    |       TempDirectory: C:\Users\XXXXXXX\AppData\Local\Temp\
                    |       HostProcessName: Octopus.Server
                    |       PID: 5792
05:54:11   Verbose  |       Executing XXXXXXX.Services-AppPool- Stop (type Run a Script) on XXXXXXXAPP2
05:54:11   Verbose  |       Acquiring isolation mutex RunningScript with NoIsolation in ServerTasks-708671
05:54:11   Verbose  |       Executable directory is C:\Windows\system32\WindowsPowershell\v1.0
05:54:11   Verbose  |       Executable name or full path: C:\Windows\system32\WindowsPowershell\v1.0\PowerShell.exe
05:55:39   Verbose  |       Starting C:\Windows\system32\WindowsPowershell\v1.0\PowerShell.exe in working directory 'D:\Octopus\Work\20230913105411-708671-276' using 'OEM United States' encoding running as 'NT AUTHORITY\SYSTEM'
05:55:39   Verbose  |       Swallowing OperationCanceledException while waiting for last of the process output.
05:55:39   Verbose  |       Swallowing OperationCanceledException while waiting for last of the process output.
05:55:39   Verbose  |       Bootstrapper did not return the bootstrapper service message
                    |       System.Exception

Do we have a task log of that same deployment running successfully so we can compare the two, if so are you able to upload that to our secure site please.

Alongside this can we also get a process JSON of your deployment so we can see what scripts are being run, could you upload that to the secure site too please.

It may be that AV is blocking some PowerShell scripts from running and that’s why this task is getting stuck, we see this a lot for customer instances. We do have some documentation on what you need to whitelist for your deployments to work if you wanted to check that out and make sure you have those exclusions setup in your AV whitelist.

Hopefully we can get to the bottom of this for you soon!
Kind Regards,
Clare

Pradeep.Duraisamy · 13 September 2023 12:22

Hi Clare,

I have attached the successful task log and json files in the secure site. We have excluded the path D:\Octopus and PowerShell folders from the antivirus agent already. If antivirus blocks the path, then it should be blocked permanently, am i right?

Thanks

clare.martin · 13 September 2023 14:45

Hey @Pradeep.Duraisamy,

Thanks for the other logs and for you confirming you have whitelisted those folders, it could be that you have on-access scanning on your AV which will scan those files initially per logon session and the deployment is timing out for some deployments as AV is re-scanning those files on some deployments.

I am wondering if you are able to temporarily disable AV on the machine ending in CAPP2 and seeing if that alleviates this issue.

If it does not can we get a process dump from that same machine, the calamari process will need to be captured so if you can follow the process below the line that states:

When capturing a process dump for Tentacle.exe , please also capture any child Calamari.exe processes. To do this, follow the process below.

You can upload that to our secure site once it has completed, you will need to grab it on every deployment unfortunately until the deployment gets stuck, as our engineers will need to see the deployment from the start of the process to see why its getting stuck.

Hopefully the next deployment that machine fails the first time so you dont have to keep grabbing process dumps until it does fail.

I am sorry we have not managed to get to the bottom of this issue yet, hopefully the AV disable and / or a process dump will help.

Kind Regards,
Clare

s.subramanian · 14 September 2023 05:37

Hi Clare,

The requested dumps are attached in secure site. Kindly check and guide us further.

clare.martin · 14 September 2023 08:21

Hey @s.subramanian and @Pradeep.Duraisamy,

Thank you for providing us with the dump and some more screenshots, I will collate all the information we have so far and the dump and send that to our engineers, Support are unable to analyse process dumps so once I hear back from the engineers I will let you know.

They work in Australia so have gone for the day so it will be tomorrow morning before I can get any updates out to you unfortunately.

In the meantime, I noted on your latest screenshot of the task in Octopus that you have a warning saying:

‘Waiting for the script in ServerTasks708715 to finish as this script requires that no other Octopus scripts are executing on this target at the same time.’

Usually when we see this a reboot of the machine that is running that task helps to clear the hanging task and you are then able to re-deploy. I realise you can re-deploy after manually cancelling the task in Octopus, and this usually works but when was the last time those 4 machines in that task were rebooted as it may be worth just giving them all a quick reboot if you can to see if that helps alleviate the issue permanently.

I will get the engineers to take a look at the dump but I would recommend a reboot of all of those machines in that deployment if you can just to rule out something internally that is getting stuck on those machines which a reboot may help fix.

I will get back to you once I hear back from our engineers.

Kind Regards,
Clare

clare.martin · 15 September 2023 08:10

Good morning @Pradeep.Duraisamy and @s.subramanian,

Our engineers took a look at all the logs we have collected and the process dump and they do agree with my thoughts that this is antivirus related unfortunately.

They did agree the bootstrapper error in your logs is consistent with what we see when antivirus is checking a file Octopus needs to run a script (in this case a PowerShell file).

They then took a look at the process dump which they mentioned, it is attempting to run the PowerShell profile script:

C:\Windows\system32\config\systemprofile\Documents\WindowsPowerShell\Microsoft.PowerShell_profile.ps1

The engineers mentioned you could check that script’s contents, but they suspect that AV is preventing it from running, the below image shows the part in the process dump where this is happening:

The next step in the troubleshooting process would be to rule out your Anti virus and ideally you would need to temporarily disable that on the machine ending in CAPP2 (to begin with as that is the one that seems to get stuck all the time in your logs). We see customers who it looks like have whitelisted all the relevant folders and when they then go and disable AV it allows the deployment to work so that would be the best way to rule out AV being the cause of this issue.

It is likely that AV is scanning those PowerShell scripts on occasion when they run which is why this works sometimes and not others.

I am sorry I do not have better news on this for you but let me know if this continues to happen after disabling AV on the affected machines and we can take another look at this for you.

Kind Regards,
Clare

Pradeep.Duraisamy · 15 September 2023 08:31

Hi Clare,

Thanks for yours support. Could you please advise what all folders need to be whitelisted from AV. Do we need to whitelist this folder also C:\Windows\system32\config\systemprofile\Documents\WindowsPowerShell ?

Thanks