Slowness issue in many environments when we deploy in Parallel

Hi,
I have raised this issue before but we are facing some slowness in deployments running in parallel.

Error 1: Main issue which we are facing for all the environments.

We used to trigger deployments at Environment level – Which has more than 3 labs (Tenants) and min 30+ applications are deploying in Parallel in One server.

At that time Most of the applications are struck at this step, it is checking Web application is configured at Root level – We have rebooted the server and then issue was resolved temporary.
Again facing the same issue across all environments and all tenants.

Note: Earlier also same case but that time we didn’t face any issue.

Making sure a Web Application “/CIT43/GETULU/SAM/1.0” is configured as a child of “PSH Services” at

As you can see it is taking 3+ hours and usually only takes 30 minutes.

Error 2:

InvalidOperation: The remote server returned an error: (403) Forbidden.
January 12th 2022 09:02:07
Error
At C:\Octopus\Work\20220112033157-2701915-4432\Script.ps1:29 char:18
January 12th 2022 09:02:07
Error

  • … $temp= (Invoke-RestMethod -Uri "$($Octoconnection.OctopusURL)/api …

But inside we are calling Octopus URL only.
$temp= (Invoke-RestMethod -Uri “$($Octoconnection.OctopusURL)/api/Spaces-1/environments?name=$EnvironmentName” -Headers $Octoconnection.OctoAPIKey)

image

Kind Regards,
Micheal Power

Good morning @mikepower79,

Thanks for getting in touch and sorry to hear you are experiencing slowness on some of your deployments.

Is this ticket the same as the one posted here - if so would you mind if we closed that and put all responses on this ticket. It would be easier for fault finding and also would make it easier for other customers to search for the issue and fix if they come across it themselves.

From what I can see the deployment is “getting stuck” whilst waiting for IIS to return a response, and since there are over 30+ deployments to the same server my guess is that IIS is queuing responses.

Also, are you rebooting the target machine (which then allows you to deploy for a few days) or is it the Octopus Server you are rebooting that temporarily fixes the issue? If it is the target machine then it does seem like the issue is environmental with that particular server.

Are you able to run a performance trace on the target machine during a deployment to confirm where the bottleneck lies, if we have that information we can hopefully help diagnose the issue.

I have created you a private upload link so you can send the performance trace to us once you have it, are you able to let us know when you have sent it as we don’t get a notification when a customer uploads a file so we rely on the customer to let us know when they have uploaded it.

If there is anything else you need in the meantime please reach out!

Kind Regards,

Clare Martin

Hi @clare.martin,
Yes it is the same ticket. Go ahead and close the other ticket.
A colleague of mine raised the issue and they have access to the server so I can get him to respond to the ticket.

Kind Regards,
Micheal Power

Hey @mikepower79,

Thank you for confirming we can delete that ticket, I have now done that, I will await the performance trace from you and we can go from there.

@manikantanmca - Please note I have closed your ticket, can you please check this one now as all responses will be posted here for this issue you are experiencing.

Kind Regards,

Clare Martin

Thanks for your mail.

From our company we could not use dotTrace Tool because it is not approved by our company.

We are using Performance Monitor tool with selected counters, If you are fine then please provide process name which we need to take trace.

Please suggest.

Hi @manikantanmca,

Just stepping in for Claire from the Australia based team.

Have you noticed that each of the 30 apps are slow to deploy or are some slow and others are ok? Or does the deployment slow down at a particular section for each app?

Could you please send through Process Dumps of the Tentacle.exe and any Calamari.exe processes on the target instance? That should allow us to see where it’s slow and the command that is being executed.

If you could also please also send through the Task Logs for a slow Deployment with Variable Logging enabled that should provide the all the information we’ll need.

Looking forward to receiving the files, which can be uploaded securely here, and getting these performance issues resolved!

Best Regards,

Hi finnian.dempsey,

Yes all 30+ Apps are slow during deployments. All were struck at deploy Website step.

i have shared Task Log here. And Process dump will take during Deployment and will share with you.

It will check "Making sure a Web Application is configured as a child of “Root Level”
Here all applications are checking “it is exisiting or not” at same Server and same IIS root level.
Please refer 1st Article in this ticket.

Making sure a Web Application “/CIT43/GETULU/SAM/1.0” is configured as a child of “PSH Services” at

21:40:50 Info | Making sure a Web Application “/CIT43/GLOBALG/QTV/1.1” is configured as a child of “PSH Services” at “D:\inetpub\wwwroot\CIT43\GLOBALG\QTV\1.1”…
21:40:55 Verbose | Cannot start this IIS website related task yet. There is already another task running that cannot be run in conjunction with any other task. Please wait…
21:41:00 Verbose | Cannot start this IIS website related task yet. There is already another task running that cannot be run in conjunction with any other task. Please wait…
21:41:05 Verbose | Cannot start this IIS website related task yet. There is already another task running that cannot be run in conjunction with any other task. Please wait…
21:41:10 Verbose | Cannot start this IIS website related task yet. There is already another task running that cannot be run in conjunction with any other task. Please wait…
21:41:15 Verbose | Cannot start this IIS website related task yet. There is already another task running that cannot be run in conjunction with any other task. Please wait…
21:41:20 Verbose | Cannot start this IIS website related task yet. There is already another task running that cannot be run in conjunction with any other task. Please wait…

In this step most of the deployments are waiting min 1 hour to maximum hours,If you verify Task log then can find how long it is waiting…

This issue happening when we trigger more than 10+ application together. And if we reboot and trigger 30+ applications then this issue is not happening.

Hi @manikantanmca and @mikepower79,

Thanks for getting us those logs, as you mentioned you can see in the logs it is taking over 2 hours each time to make sure the web application is configured at the root level, the logs, unfortunately, point to the target server struggling to manage that many tasks at once. This is confirmed as you mentioned you can deploy just under 10 applications to it and it works fine.

If you reboot the target server I imagine because the IIS service has restarted it has killed the backlog of tasks it has to perform, there is probably one getting stuck somewhere and creating that backlog.

Unfortunately without the performance trace we cannot help you as we need to see where the server is struggling. Even with the performance trace all we can really do is diagnose that the fault does lie with the target server and not Octopus itself.

There is a great website here which gives you some ideas of how to monitor the performance of IIS to see if the issue lies there.

Have you looked in event viewer for IIS events to see if you can see any processes that are getting stuck or erroring out?

You may be able to get some information from the IIS Log Parser available with Windows.

As you mentioned using the performance monitoring tool this page will probably help you do some server diagnostics as it shows you what counters to select for IIS Performance.

I am sorry we could not do more to help at this time, we will await the upload of the process dumps as that will also give us a good idea of where it might be getting stuck, what package etc. Please let us know when you have uploaded those as we do not get notification when a customer uploads files.

In the meantime it would be worth having a look at those websites I mentioned to try and set up some IIS performance monitors so you are able to fault find if we can definitely confirm it is the target server.

Kind Regards,

Clare Martin

Thank you for your reply/Response.
We will check once we re-trigger all the deployments together and come back to you.

Could you please share me the link again to upload Dump file.
Last shared link was expired.

Also our Dump file zip size is around 1GB, Taken more than 4 four Dump file from calamari.exe and Powershell.exe.

Afternoon @manikantanmca,

Thank you for getting back to us, here is the secure link for you to upload those files.

1GB should upload with no issues but please let us know if you are having trouble.

Please can you reply when those files have uploaded as we don’t get a notification when they do so rely on customers to inform us the files are there for us to look at.

Kind Regards,

Clare Martin

Good Evening, Thanks Clare.martin for your quick reply.

Uploaded dump file, Kindly check and let us know.

These are process when deployments are in strck.

Black color is Service account which i darked.

Good afternoon @manikantanmca,

Thank you for sending those process dumps and the screenshot, I will get the process dumps sent to engineering to check and I will get back to you as soon as I have an answer for you to see if we can pinpoint a potential environmental issue.

Please reach out if you need anything in the meantime,

Kind Regards,

Clare Martin

Hi Clare, Good Afternoon.
Just FYI, what you suggested earlier we have those variabled already in all our projects.
Please find the value.

image

Good morning @manikantanmca,

Thank you for that information, I did see you had set that variable in one of your other forum posts but thank you for confirming that is set on your instance.

I have got some information back from our engineers which I have posted below:

IIS doesn’t like being modified by two different processes in parallel. All sorts of things happen from config file corruption to hangs. We go to some lengths to put locks around our steps that modify IIS so they run in sequence when required. However, if they have other steps from the library or self written that will cause problems.The `Cannot start this IIS website related task yet’ message is printed when we are waiting on that lock mentioned above. It’s possible if a step terminates that the lock is left hanging (and why a reboot helps).

The engineers have looked at your dumps and unfortunately none of those point the engineers in the right direction, they have asked if you would not mind providing us with the dumps for this specific powershell (the one running inside Calamari)?

That way they can check what command is actually being run in each of the deployments.

image

The engineers do agree it’s environmental - IIS deployments will run sequentially regardless of whether they’re set up as parallel or not, so it’s pointless to run them parallel. All the IIS operations would just get stuck behind each other. So we’d expect to see one deployment doing something IIS-related and everything else waiting on a mutex.

The Tentacle mutex is more robust than the IIS mutex so the engineers have said customers should not be running these deployments in parallel, it would only cause more harm than good. We do allow turning this off, but we do warn it is at their own risk due to this kind of problem.

I hope this helps, if you are happy to provide us with those dumps we can look into them for you, but I am afraid the above information does coincide with what you are experiencing (deployments getting stuck as they are waiting on others). And so this would suggest you may need to separate your projects and deploy say 10 web applications at once (which you have suggested works) instead of 30 to ensure you are consistently deploying properly and not having to potentially kill a process if one of the 30 deployments gets stuck.

I hope this helps, please let me know how you would like to proceed, if you would like us to look at the dumps for you I can create you another secure link for you to upload them to. If you are happy trialling separating those applications out into different projects and see if that works better for you we are happy to help advise.

I look forward to hearing from you,

Kind Regards,

Clare Martin

Thank you.

i have uploaded Powershell Dump also in that same .ip file

Same dump file is there In that last shared loaction.

Hi @manikantanmca,

As you mentioned, I can see you have uploaded the process dumps and it seems like those powershell ones are associated with the calamari processes. I have gone back to our engineers to clarify exactly what they need to help you here, it may be they need something else but once I know I will let you know.

I am sorry this is taking a while to resolve. As suggested though, this issue is happening because IIS does not scale well with many concurrent deployments. If our engineers do find where that particular deployment got stuck, it doesn’t mean it will get stuck with the same application the next time. Due to the nature of the way you have this set up, you are likely to see different applications getting stuck each time, so there would be no way of knowing what got stuck on each deployment, if one does get stuck.

The only way to stop this from happening is to change your deployment process to accommodate for less applications getting deployed at once. You can achieve this by editing the Octopus.Action.Maxparallelism variable down to 15 or 10 (you have it at 20 at the moment) and see if that helps… I only say this as I am cautious this will be the answer even if our engineers find out why that deployment got stuck, it could be different each time.

I will let you know what our engineers say but I am mindful of their answer and the fact it will outcome in us advising you to change your deployment configuration anyway.

I am curious to know your thoughts on this, considering what the evidence suggests are you content to change your deployment process so you can be sure it will work everytime rather than you creating another process dump for us to suggest a deployment change when the results are found?

Kind Regards,

Clare Martin

Hi @manikantanmca,

I have a response from our engineers and they have confirmed unfortunately the dumps you sent were the incorrect ones.

Please see their response as it explains it better than I can:

The process dump files the customer provided were:

  • Tentacle → This is Tentacle.exe. This isn’t helpful for us to troubleshoot unfortunately as it says “i’m running powershell”
  • Calamari → This is Calamari.exe. It also doesn’t provide any useful info for the same reason
  • conhost → This is conhost.exe. This doesn’t help us identify this particular issue.
  • Powershell → This is the powershell.exe residing directly under Tentacle.exe. The problem with this one is it says “i’m running calamari”. We want the powershell.exe dump residing directly under Calamari.exe - this is the one that will actually say what command is being run.

So, what the engineers need is the dump for the PowerShell process shown in the screenshot below (the PowerShell under Calamari, not the one under Tentacle.exe):

I hope this clarifies what the engineers are after in order to troubleshoot your issue. I would change your Octopus.Action.Maxparallelism variable down to 15 or 10 first and test to see if this makes the issue any better before getting dumps for the engineers, it might be setting that variable lower actually fixes the issue. Luckily all they would need is the PowerShell dumps, not any of the others (tentacle/calamari etc).

If changing the variable doesn’t fix the issue please let me know when those PowerShell dumps are created and I can re-issue the secure link for you and get those new dumps sent off.

Sorry for the confusion surrounding the dump files, I admit I don’t have much experience surrounding analysing dump files as this is usually done by our engineers so I was not sure if the files you provided were the right ones.

I look forward to hearing whether the variable change works.

Kind Regards,

Clare Martin

Hi clare.martin,

We could not make these change since other environments may get imapact. (we are having 20+ environments and Teanants(30+) .

So as you suggested we have taken Dump for the powershell process.

Please share the link to upload the sumb files

Good morning @manikantanmca,

Thank you for getting back to us, here is the secure link for you to upload those dump files, are you able to let me know once they are uploaded and I will send those over to the engineers for you,

Kind Regards,

Clare Martin