Azure step failure - Azure credentials have not been set up or have expired

Seems we have settled down all other issues with blocked deployment tasks (“waiting on scripts tasks”), we have v2018.12.1 deployed, etc.
However “Azure credentials expiration” unfortunately is still here, and I couldn’t say it happens less often after all.

Deploying early in the morning - it usually doesn’t happen. However deploying during working hours, when there are more deployments in background (not just our team deployments) - it happens on a quite usual basis, almost at least once during our deployment chain consisting of ~8 deployment projects.

Do you have any more suggestions? Was there any other possible fixes after v2018.2.1 which could help somehow?

@MarkSiedle, @John_Simons, would you have any more input on this issue?

Hi Aleksandras,

Thanks for the additional information and sorry this is still causing you trouble.

Are you using a worker pool for this work, or is everything running from your Octopus Server? If you’re noticing this occur more often when things get busy, can you confirm whether moving this work to a separate worker / worker pool makes any difference?

The server task log that was uploaded showed this occurring from the “Swap and delete deployment slot” step (it looks like the call to Invoke-AzureRmResourceAction triggered the credentials error). When it does fail, is it consistently failing from this Azure PowerShell step? Or have you seen this occur from other Azure PowerShell steps as well?

Also, where did you stand on spinning up a separate instance of Octopus to test this in isolation?

While setting this up initially may seem tedious, once you’ve reproduced it on a separate/isolated instance of Octopus, it opens up some possibilities and allows you to explore options without risk to your production instance. For example, you could then bundle some of the newer (or older - although not recommended) PowerShell modules to confirm whether these would work for you. Or, since the error message is asking to run Connect-AzureRmAccount, you could then experiment with calling this Connect-AzureRmAccount method directly in your scripts and compare this to how the Calamari AzureContext script connects (as perhaps something about the cmdlets you are calling are incompatible with the latest Azure cmdlets Octopus is using).

If it works in isolation, then we could start comparing what’s different. If it fails in isolation, then it narrows it down to the PowerShell you’re running in this project.

We have tried with multiple deployments running concurrently and have not been able to reproduce. The fact that it works sometimes and not others must be very frustrating, so in these cases, it helps to try and reproduce on an isolated instance to confirm that the same scripts (connecting to the same Azure subscription) function as expected, then go from there.

Cheers
Mark

Thank you, @MarkSiedle, for your extensive input and willingness to help.

We are not using worker pool yet, still running directly on Octopus Server target. “Migration” to introduce worker pools in our Octopus instance seems was postponed due to some reasons (don’t know exactly which ones, but probably because it needs more preparation, since it would affect many teams in our company). I don’t own that process in our company, so cannot say anything too much specific.

Regarding exact step - it doesn’t occur only in “Swap and delete deployment” step. But as I mentioned previously - we have a strong feeling that it occurs mainly in Azure PowerShell steps, not other Azure-related steps (like “Deploy an Azure Web App” or “Deploy an Azure Resource Manager template”).

Trying to reproduce it in test instance of Octopus is indeed would be very tedious in our case… We don’t own Octopus instance and everything what is inside in it - e.g. configured enterprise-level Azure subscriptions, and other “high level” secrets. Also since the issue probably is related to general load on Octopus server, it might be hard to reproduce the similar load on the test instance. I acknowledge you don’t have any other better ideas or ways to reproduce it yourself, but test Octopus instance for us also looks like last resort…

Speaking about your experiments trying reproduce it, it seems I asked it previously, but didn’t get an answer - how many (5? 10? 15?) concurrent projects did you try to execute at the same time, and how long they were executing (in case there is some timing issue)? In our case it might be up to concurrent 10-15 projects I think, and execution might take 10-20 minutes.
Are there any Octopus server-level logs we could provide and which could help you to investigate something across different concurrent deployment projects?

Hi Aleksandras, sorry for the slow reply,

It does make sense that you would experience this issue with those steps that make use of the Azure PowerShell modules since some steps like the Azure Web App would only exhibit this behavior if you have pre/post deployment scripts.

The Azure Powershell issue that both John and Mark linked has been closed in favor of this issue. The comments seem to suggest that this should have been fixed in version 1.0.0 of the Az powershell modules. Although we don’t explicitly support these modules, you should be able to try these out in compatibility mode by installing them on a worker and adding the OctopusUseBundledAzureModules with a value of False to your project. I highly recommend making use of workers to test this as the Az modules can not be used on the same machine as the AzureRm modules.

In terms of experimenting we used 10 projects with a number of Azure Powershell scripts, executing concurrently. We also tried setting up the initial context and executing commands after a 30 minute delay in order to eliminate timeout issues. Unfortunately we have not been able to reproduce this exact behavior suggesting there is something environmental at play. We have done everything we can on our end to try and reproduce this and have to rely on you at this point to produce a simple reproduction.

Regards,
Shaun

Ok, thank you @Shaun_Marx.
Seems I need to handover it to our Octopus management team… I feel I don’t have what to try more on my side as well…

Thanks @Aleksandras,

I just wanted to take the opportunity to say that I am really sorry that we couldn’t provide a better answer for you at this point in time. It does however seem that we don’t have any options available unless we can make use of workers since that will go a long way towards isolating and testing things. We would be more than happy to assist in any way that we can in order to make that happen as we believe that is the best option in order to get to the bottom of this issue for you.

Regards,
Shaun