When we trigger a new deployment in some of the targets the deployment is not getting completed. It is in progress for log time still we cancel the task. When we trigger again we are able to deploy it. We are not sure why it is not deploying the applications randomly. Please advise us on the issue.
I can see that the health check script exited with exit code -1, could you please send through the Tentacle Logs for that timeframe as well?
There’ll likely be a corresponding windows event for that exit code -1 with more info about what caused the failure but I’d also be curious if this happens repeatedly or just sometimes? Are you manually triggering the deployment or is it from a build server? Do you use the same method each time?
Feel free to let us know if you have any questions at all!
One thing to try also would be our tentacle ping tool as that is our gold standard tool for connection testing between your Octopus Server and your tentacle, if you can install the tentacle ping on your tentacle and ping your Octopus server whilst you are deploying it will tell you if there are any connection errors.
Can you please send over the results of a tentacle ping during deployment when the deployment fails with the same errors you are seeing above.
I look forward to hearing from you and getting those logs.
Thank you for those logs they paint a good picture of what is going on here, this looks like you have an intermittent networking issue on the affected machines. A few things from the logs that stand out:
System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host
2023-09-10 19:55:11.8999 15564 13 ERROR listen://[::]:10933/ 13 Unhandled error when handling request from client: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
System.IO.IOException: Received an unexpected EOF or 0 bytes from the transport stream.
2023-09-10 19:59:32.7114 15564 34 ERROR listen://[::]:10933/ 34 Unhandled error when handling request from client: xxxxxxxxxxxxxxxxxxxxxxxxxxxx
System.IO.IOException: The handshake failed due to an unexpected packet format.
System.Security.Authentication.AuthenticationException: A call to SSPI failed, see inner exception. —> System.ComponentModel.Win32Exception: The client and server cannot communicate, because they do not possess a common algorithm
System.Security.Authentication.AuthenticationException: A call to SSPI failed, see inner exception. —> System.ComponentModel.Win32Exception: The token supplied to the function is invalid.
The latter two errors we do have a page about here which may be of help but the fact this works intermittently would usually mean your SSL and TLS settings should be correct. Usually when we see those errors the tentacle will not communicate with the Octopus server at all.
I would speak to your networking team here as I suspect something like a proxy or firewall is blocking your tentacles from communicating with the Octopus Server sometimes.
I would also run the tentacle ping I suggested as you should see connection dropouts randomly as your tentacles are not able to communicate with your Octopus server on regular occasions.
Hopefully the networking logs show blocked requests and you can get this sorted, reach out if you need further assistance but the tentacle ping and networking logs should show connection errors and dropouts.
Those errors are networking errors and not from manually cancelling tasks from my experience, the error is coming from a health check from your Octopus Tentacle.
The initial task you sent over was a health check task which is the one that was failing.
If you have a deployment task where the tentacle is getting stuck and then you manually cancel it please upload the task log for that hanging task (deployment) to the secure site so we can take a look but this does look like it may be getting stuck due to those errors I mentioned previously and rather than cancelling the task in Octopus its getting stuck as it cant contact the tentacle, you then have to manually cancel it.
If we can get some logs which show the task getting stuck we may be able to spot the networking issues.
If you use this page to install tentacleping.exe (Not tentaclepong) on your tentacle and then run the command:
ie tentaclePing.exe 192.168.0.201
You should see successes as I have but you will need to do this during a deployment as that is when it is getting stuck, unfortunately this will be a case of running the tentacleping.exe continuously to a target that you either deploy to regularly, or a test target that you can deploy to to test the networking out.
I would also speak to your networking team and see if there are connections getting blocked as they will be able to tell you straight away rather than having to keep the tentacle ping running.
Thank for for those recent logs, I can see a successful ping which is great!
Your recent log has confused me a bit though as that failure is a failure to stop the app services pool on one of your machines.
The machine below (ending in APP2) fails on the IIS stop and you then cancel the deployment as it looks like it gets stuck there not the acquire package stage:
| == Failed: Step 2: XXXXXXX.Services-AppPool- Stop ==
05:55:39 Fatal | The step failed: The operation was canceled.
05:55:39 Info | The operation was canceled.
That step failed due to a bootstrap error which is usually down to antivirus software blocking a PowerShell script that needs to run to complete the deployment:
| MachineName: XXXXXXAPP2
| ProcessorCount: 4
| CurrentDirectory: C:\Windows\system32
| TempDirectory: C:\Users\XXXXXXX\AppData\Local\Temp\
| HostProcessName: Octopus.Server
| PID: 5792
05:54:11 Verbose | Executing XXXXXXX.Services-AppPool- Stop (type Run a Script) on XXXXXXXAPP2
05:54:11 Verbose | Acquiring isolation mutex RunningScript with NoIsolation in ServerTasks-708671
05:54:11 Verbose | Executable directory is C:\Windows\system32\WindowsPowershell\v1.0
05:54:11 Verbose | Executable name or full path: C:\Windows\system32\WindowsPowershell\v1.0\PowerShell.exe
05:55:39 Verbose | Starting C:\Windows\system32\WindowsPowershell\v1.0\PowerShell.exe in working directory 'D:\Octopus\Work\20230913105411-708671-276' using 'OEM United States' encoding running as 'NT AUTHORITY\SYSTEM'
05:55:39 Verbose | Swallowing OperationCanceledException while waiting for last of the process output.
05:55:39 Verbose | Swallowing OperationCanceledException while waiting for last of the process output.
05:55:39 Verbose | Bootstrapper did not return the bootstrapper service message
Do we have a task log of that same deployment running successfully so we can compare the two, if so are you able to upload that to our secure site please.
Alongside this can we also get a process JSON of your deployment so we can see what scripts are being run, could you upload that to the secure site too please.
It may be that AV is blocking some PowerShell scripts from running and that’s why this task is getting stuck, we see this a lot for customer instances. We do have some documentation on what you need to whitelist for your deployments to work if you wanted to check that out and make sure you have those exclusions setup in your AV whitelist.
Hopefully we can get to the bottom of this for you soon!
I have attached the successful task log and json files in the secure site. We have excluded the path D:\Octopus and PowerShell folders from the antivirus agent already. If antivirus blocks the path, then it should be blocked permanently, am i right?
Thanks for the other logs and for you confirming you have whitelisted those folders, it could be that you have on-access scanning on your AV which will scan those files initially per logon session and the deployment is timing out for some deployments as AV is re-scanning those files on some deployments.
I am wondering if you are able to temporarily disable AV on the machine ending in CAPP2 and seeing if that alleviates this issue.
If it does not can we get a process dump from that same machine, the calamari process will need to be captured so if you can follow the process below the line that states:
When capturing a process dump for Tentacle.exe , please also capture any child Calamari.exe processes. To do this, follow the process below.
You can upload that to our secure site once it has completed, you will need to grab it on every deployment unfortunately until the deployment gets stuck, as our engineers will need to see the deployment from the start of the process to see why its getting stuck.
Hopefully the next deployment that machine fails the first time so you dont have to keep grabbing process dumps until it does fail.
I am sorry we have not managed to get to the bottom of this issue yet, hopefully the AV disable and / or a process dump will help.
Thank you for providing us with the dump and some more screenshots, I will collate all the information we have so far and the dump and send that to our engineers, Support are unable to analyse process dumps so once I hear back from the engineers I will let you know.
They work in Australia so have gone for the day so it will be tomorrow morning before I can get any updates out to you unfortunately.
In the meantime, I noted on your latest screenshot of the task in Octopus that you have a warning saying:
‘Waiting for the script in ServerTasks708715 to finish as this script requires that no other Octopus scripts are executing on this target at the same time.’
Usually when we see this a reboot of the machine that is running that task helps to clear the hanging task and you are then able to re-deploy. I realise you can re-deploy after manually cancelling the task in Octopus, and this usually works but when was the last time those 4 machines in that task were rebooted as it may be worth just giving them all a quick reboot if you can to see if that helps alleviate the issue permanently.
I will get the engineers to take a look at the dump but I would recommend a reboot of all of those machines in that deployment if you can just to rule out something internally that is getting stuck on those machines which a reboot may help fix.
I will get back to you once I hear back from our engineers.
The next step in the troubleshooting process would be to rule out your Anti virus and ideally you would need to temporarily disable that on the machine ending in CAPP2 (to begin with as that is the one that seems to get stuck all the time in your logs). We see customers who it looks like have whitelisted all the relevant folders and when they then go and disable AV it allows the deployment to work so that would be the best way to rule out AV being the cause of this issue.
It is likely that AV is scanning those PowerShell scripts on occasion when they run which is why this works sometimes and not others.
I am sorry I do not have better news on this for you but let me know if this continues to happen after disabling AV on the affected machines and we can take another look at this for you.
Thanks for yours support. Could you please advise what all folders need to be whitelisted from AV. Do we need to whitelist this folder also C:\Windows\system32\config\systemprofile\Documents\WindowsPowerShell ?