Azure step failure - Azure credentials have not been set up or have expired


(Mark Siedle) #6

Thanks for the additional information and sending your process export through Aleksandras.

Sorry I should have mentioned, that SmartFile/upload location is only visible to Octopus staff and will be deleted when we’ve finished the analysis of your file. Your process won’t be imported anywhere and will only be read manually, so we can see what scripting is being done.

We setup a mini version of your deployment process using our subscription, but unfortunately we haven’t been able to reproduce this yet. But we do think there are some options worth exploring…

One thing that stood out slightly was the use of Save-AzureRmContext in your “Cleanup old ARM deployments” step. We initially thought this may potentially be causing some things to stomp on each other, but if it’s just saving the auth information to a file so it can be used in a separate process (via Start-Job and the Import-AzureRmContext), this shouldn’t be a problem. To rule it out though, you could temporarily disable this cleanup step and see if it has any effect.

Between those versions of Octopus you mentioned, we have upgraded the Azure PowerShell Cmdlets from 5.8 to 6.11, so we’re wondering if something here may be different all of a sudden. To test that theory, you could install your own version of the Azure PowerShell modules instead of the ones we bundle (try grabbing an older version 5.8, or even try the latest version (6.13 I believe), and configuring Octopus to use your custom version instead (see instructions here).

Another possibility (although we think unlikely), may be coming from the introduction of the Azure CLI since your upgrade. Since you don’t appear to be using the CLI anywhere in your scripts, you could try disabling this Azure CLI feature altogether by creating an OctopusDisableAzureCLI variable and setting its value to True. Full details here, then monitor to see if that change has any effect on your deployments.

Sorry we don’t have any exact answers at this stage, but hopefully we can get a reproduction to narrow things down.

Cheers
Mark


(Aleksandras) #7

Hi, Mark,

thanks for trying to reproduce it. I will try to clarify some points you raised.

Did you try to reproduce not so mini version? :slight_smile: And maybe several similar projects running at the same time? I guess it might be related to timing. Our failing step is about ~10-15minutes after deployment started (deployment starts with Azure step).

Regarding Save-AzureRmContext - you are absolutely correct, it saves context for parallelized jobs running with Start-Job. Will try to disable it, but it was working without problems until recently. So probably yes - it’s not likely as you say.

I’m afraid I won’t be able to try installing older Azure PowerShell Cmdlets. As I understand - they should be installed on the Octopus Server itself, but our Octopus instance is used in quite big company, by many teams, and I’m just a regular user there. It would be hard and time-consuming to convince responsible team to try to fix it, until this error is so unclear in transient.

I will try to disable Azure CLI with variable - this part seems not a blocker.:slight_smile:


(John Simons) #8

Hi Aleksandras,

I am sorry that you are running into this issue.
I have done a bit of research and it seems this is a known issue, see https://github.com/Azure/azure-powershell/issues/7110.
The interesting solution Microsoft provides is described in https://github.com/Azure/azure-powershell/issues/7110#issuecomment-426750910.

So I have raised an issue for us to call Disable-AzureRMContextAutosave -Scope Process before authenticating to Azure, but for now, do you want to try to execute Disable-AzureRMContextAutosave -Scope CurrentUser (you need to run this as the user that executes the Powershell scripts), and let us know if this does indeed fixes the randomness you are seeing ?

Regards
John


(Aleksandras) #9

Thanks for follow-up! Very nice you nailed it down!
Any idea how it can be related to the thing that it started to appear only recently? Seems it’s same old Azure PowerShell cmdlets…

Can you clarify a bit - where this command should be executed? In some Azure deployment step? Or in all Azure related steps? Or in the first Azure step in the deployment process?
Or just one time - on Octopus Server? Just a note - I don’t have direct access to our Octopus server, but if benefits are clear and it’s clear how to do it - I could book a ticket for our support team.


(John Simons) #10

Hi Aleksandras,

Because you are running an Azure Powershell step, I would call call Disable-AzureRMContextAutosave -Scope Process at the beginning of the script.
Does this make sense ?

Cheers
John


(Aleksandras) #11

Not only Azure Powershell steps, but also built-in Azure deployment steps (like “Deploy an Azure Resource Manager template”). In this case then we probably need to enable Custom PowerShell feature?

There is a bunch of projects we have, with a bunch of different Azure-related steps - so would take a while to add your suggested script everywhere. So just want to be sure we doing it right. :slight_smile:


(John Simons) #12

Hi Aleksandras,

In that case maybe run a one off Disable-AzureRMContextAutosave -Scope CurrentUse. This should do it permanently for the user account that you are running Octopus as.

Hope this helps

Cheers
John


(Aleksandras) #13

Hi again :slight_smile:

I tried to make more smart workaround instead (I wasn’t able to do Disable-AzureRMContextAutosave for CurrentUser, because I don’t have access to Octopus server, and I didn’t want to do it for Process, because it would be need for too many steps). So I ended with creating new script module in Octopus Script Modules library with simple content:

Write-Host "==> Executing: Setting Disable-AzureRMContextAutosave"
Disable-AzureRMContextAutosave -Scope Process

Then included this script module to all relevant processes. I see it’s being executed in every step, however after it there is another error is thrown. Seems like it breaks something else. Please see log attached below:

15:23:47   Verbose  |       Importing Script Module 'Octopus AzureRMContextAutosave' from 'D:\Octopus\Work\20190121142344-290555-7348\Library_OctopusAzureRMContextAutosave_636836810261879409.psm1'
15:23:49   Info     |       ==> Executing: Setting Disable-AzureRMContextAutosave
15:23:49   Verbose  |       PowerShell Environment Information:
15:23:49   Verbose  |       OperatingSystem: Microsoft Windows NT 6.3.9600.0
15:23:49   Verbose  |       OsBitVersion: x64
15:23:49   Verbose  |       Is64BitProcess: True
15:23:49   Verbose  |       CurrentUser: NT AUTHORITY\SYSTEM
15:23:49   Verbose  |       MachineName: SPCS-SPV-APP-46
15:23:49   Verbose  |       ProcessorCount: 4
15:23:49   Verbose  |       CurrentDirectory: D:\Octopus\Work\20190121142344-290555-7348
15:23:49   Verbose  |       CurrentLocation: D:\Octopus\Work\20190121142344-290555-7348
15:23:49   Verbose  |       TempDirectory: C:\Windows\TEMP\
15:23:49   Verbose  |       HostProcessName: powershell
15:23:49   Verbose  |       TotalPhysicalMemory: 16776756 KB
15:23:49   Verbose  |       AvailablePhysicalMemory: 11168908 KB
15:23:49   Verbose  |       Authenticating with Service Principal
15:23:50   Verbose  |       Account          : xxx
15:23:50   Verbose  |       SubscriptionName : xxx
15:23:50   Verbose  |       SubscriptionId   : xxx
15:23:50   Verbose  |       TenantId         : xxx
15:23:50   Verbose  |       Environment      : AzureCloud
15:23:51   Verbose  |       Invoking target script "D:\Octopus\Work\20190121142344-290555-7348\Script.ps1" with  parameters
15:23:51   Info     |       voncore-int westeurope
15:23:51   Info     |       Checking if resource group voncore-int exists
15:23:51   Error    |       CheckAndCreateResourceGroup : The 'Get-AzureRmResourceGroup' command was found
15:23:51   Error    |       in the module 'AzureRM.Resources', but the module could not be loaded. For
15:23:51   Error    |       more information, run 'Import-Module AzureRM.Resources'.
15:23:51   Error    |       At D:\Octopus\Work\20190121142344-290555-7348\Script.ps1:4 char:2
15:23:51   Error    |       +     CheckAndCreateResourceGroup -Name $OctopusParameters["RM.VonCore. ...
15:23:51   Error    |       +     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15:23:51   Error    |       + CategoryInfo          : ObjectNotFound: (Get-AzureRmResourceGroup:String
15:23:51   Error    |       ) [CheckAndCreateResourceGroup], CommandNotFoundException
15:23:51   Error    |       + FullyQualifiedErrorId : CouldNotAutoloadMatchingModule,CheckAndCreateRes
15:23:51   Error    |       ourceGroup

(Mark Siedle) #14

Hi Aleksandras,

Regarding the error you’re seeing re. “module could not be loaded”, this was a known issue that has been recently fixed in version 2018.12.0. If you are able to upgrade your Octopus to latest, we believe this will solve your problem.

Let me know how you go.

Thanks
Mark


(Aleksandras) #15

Hi, MarkSiedle, thank you for response!

We can update Octopus to latest, but we cannot do it every day or in similar pace (usually our company do it once in 1-2 months). So would like to clarify some things.

  1. https://github.com/OctopusDeploy/Issues/issues/5221 - says it affected version from v2018.11.0, however currently we have v2018.9.15 in our environment. Can it still be relevant?
  2. “module could not be loaded” started to occur only after I added a workaround suggested by @John_Simons . Once I removed it again - it started to work again.
  3. Can you please confirm that our initial issue was also fixed in v2018.12.0 release as well? Seems there are related notes mentioned in release log, just would like to confirm. And if it’s the case - then it will be really easier to convince our administration team to upgrade Octopus. :slight_smile:

(Mark Siedle) #16

Hi Aleksandras,

Sorry, I linked the wrong issue in my last response. This is the issue I should have linked. That will now call Disable-AzureRMContextAutosave -Scope Process for every Azure step, which will save you from having to try and inject it everywhere yourself manually.

In terms of how we arrived here, an upgrade from 2018.6.2 to 2018.9.17 included upgraded PowerShell modules from 5.7.0 to 6.8.1 respectively, so the earlier version of those modules may not have suffered from this Azure PowerShell issue. Which may explain why these problems only started occurring after your upgrade.

Unfortunately we have not been able to reproduce this issue, so we cannot confirm that this will fix the problems you are seeing. But we believe your issue was caused by this known Azure PowerShell issue. So all of these are recommendations are to try and resolve/workaround that.

In terms of your specific points:

  1. This was my bad, see the link I mentioned above.
  2. If you’re seeing “module could not be loaded”, that may be a sign that the fix we’re suggesting may not work with the older version of the Azure cmdlets included in 2018.9.17. We saw this same error recently and had to roll the Azure PowerShell modules back to a specific version (which is included in 2018.12.0), so if you’re seeing this in your older version, then it may have been suffering from a similar problem (in which case, we’re hoping the Azure cmdlets included in 2018.12.0 will solve this)
  3. See above. We have not been able to reproduce, so we cannot confirm at this stage.

If you wanted to confirm this as a fix before upgrading your main/large Octopus instance, you could consider spinning up a separate test instance of Octopus. As long as you matched the version you currently have in production (2018.9.17) and reproduced the issues you are seeing against your Azure subscription, then you could test an upgrade to latest in isolation and confirm the fixes that have gone into 2018.12.0.

Again, sorry for the confusion. Hope this helps to clarify things.

Mark


(Aleksandras) #17

Hi,

sad news… Our main Octopus Deploy instance was upgraded to v2018.12.1. And I just have got same issue:

Set-AzureRmResource : Your Azure credentials have not been set up or have expired, please run Connect-AzureRmAccount to set up your Azure credentials.

Update: seems like it was promotion of deployment, which was created prior upgrade - so might be something was “snapshoted”? I will try full clean and new deployment, then monitor a bit more…


(Aleksandras) #18

Was going to say that it finally works… And it was working quite well for few days without an issue. There were nightly deployments every night, but other than that we weren’t actively deploying everything recently - so want to give it more time to settle down.

Unfortunately today I saw that error popped up again… On completely new deployment.
Maybe it’s not that often now, or maybe it’s just a feeling, but seems like it’s not 100% fixed after all.

I uploaded new full log to https://file.ac/l5rV7JDa4pQ/ May be you could spot something interesting. :wink:
However it might be because other deployments going in parallel…


(Aleksandras) #20

One more finding - we are experiencing some new locks on our deployments (when task is blocked by some other task), like this:

Waiting on scripts in tasks ServerTasks-295474, ServerTasks-295476, ServerTasks-295486, ServerTasks-295488, ServerTasks-295489, ServerTasks-295495 and ServerTasks-295499 to finish. This script requires that no other Octopus scripts are executing on this target at the same time.

We had such issues previously, but not so critical ones (mainly fixed by setting OctopusBypassDeploymentMutex=true). However after we upgraded to v2018.12.1 it seems became much worse. I’m not owning Octopus maintenance in our company, so I don’t know for sure if something additional was done with upgradint to v2018.12.1. Or maybe there are many new projects were added - don’t know, just wild guessing…

Anyway, these locks is probably a different issue, but it might be related to my previous comment. Since steps might be stuck for 5-20 minutes - might it be the reason Azure credentials are expiring?


(John Simons) #21

Hi Aleksandras,

The Azure credentials expiring should not be due to the wait since we only open a new connection after acquiring the lock.

Regarding the “waiting on script tasks…”, could the problem be lack of capacity and hence you may need to add more workers?

Regards
John


(Aleksandras) #22

Ok, if you say so, then upgrade to v.2018.12.1 don’t solve “Azure credentials expiration” issue… Although it seems like it really occurs not so often, but it might be a placebo effect (i.e. assumptions based).

Regarding “waiting on script tasks” - as I mentioned it’s probably a different issue, and our Octopus maintenance team is investigating it. I pointed them your mentioned article, thank you!


(Aleksandras) #23

Seems we have settled down all other issues with blocked deployment tasks (“waiting on scripts tasks”), we have v2018.12.1 deployed, etc.
However “Azure credentials expiration” unfortunately is still here, and I couldn’t say it happens less often after all.

Deploying early in the morning - it usually doesn’t happen. However deploying during working hours, when there are more deployments in background (not just our team deployments) - it happens on a quite usual basis, almost at least once during our deployment chain consisting of ~8 deployment projects.

Do you have any more suggestions? Was there any other possible fixes after v2018.2.1 which could help somehow?


(Aleksandras) #24

@MarkSiedle, @John_Simons, would you have any more input on this issue?


(Mark Siedle) #25

Hi Aleksandras,

Thanks for the additional information and sorry this is still causing you trouble.

Are you using a worker pool for this work, or is everything running from your Octopus Server? If you’re noticing this occur more often when things get busy, can you confirm whether moving this work to a separate worker / worker pool makes any difference?

The server task log that was uploaded showed this occurring from the “Swap and delete deployment slot” step (it looks like the call to Invoke-AzureRmResourceAction triggered the credentials error). When it does fail, is it consistently failing from this Azure PowerShell step? Or have you seen this occur from other Azure PowerShell steps as well?

Also, where did you stand on spinning up a separate instance of Octopus to test this in isolation?

While setting this up initially may seem tedious, once you’ve reproduced it on a separate/isolated instance of Octopus, it opens up some possibilities and allows you to explore options without risk to your production instance. For example, you could then bundle some of the newer (or older - although not recommended) PowerShell modules to confirm whether these would work for you. Or, since the error message is asking to run Connect-AzureRmAccount, you could then experiment with calling this Connect-AzureRmAccount method directly in your scripts and compare this to how the Calamari AzureContext script connects (as perhaps something about the cmdlets you are calling are incompatible with the latest Azure cmdlets Octopus is using).

If it works in isolation, then we could start comparing what’s different. If it fails in isolation, then it narrows it down to the PowerShell you’re running in this project.

We have tried with multiple deployments running concurrently and have not been able to reproduce. The fact that it works sometimes and not others must be very frustrating, so in these cases, it helps to try and reproduce on an isolated instance to confirm that the same scripts (connecting to the same Azure subscription) function as expected, then go from there.

Cheers
Mark


(Aleksandras) #26

Thank you, @MarkSiedle, for your extensive input and willingness to help.

We are not using worker pool yet, still running directly on Octopus Server target. “Migration” to introduce worker pools in our Octopus instance seems was postponed due to some reasons (don’t know exactly which ones, but probably because it needs more preparation, since it would affect many teams in our company). I don’t own that process in our company, so cannot say anything too much specific.

Regarding exact step - it doesn’t occur only in “Swap and delete deployment” step. But as I mentioned previously - we have a strong feeling that it occurs mainly in Azure PowerShell steps, not other Azure-related steps (like “Deploy an Azure Web App” or “Deploy an Azure Resource Manager template”).

Trying to reproduce it in test instance of Octopus is indeed would be very tedious in our case… We don’t own Octopus instance and everything what is inside in it - e.g. configured enterprise-level Azure subscriptions, and other “high level” secrets. Also since the issue probably is related to general load on Octopus server, it might be hard to reproduce the similar load on the test instance. I acknowledge you don’t have any other better ideas or ways to reproduce it yourself, but test Octopus instance for us also looks like last resort…

Speaking about your experiments trying reproduce it, it seems I asked it previously, but didn’t get an answer - how many (5? 10? 15?) concurrent projects did you try to execute at the same time, and how long they were executing (in case there is some timing issue)? In our case it might be up to concurrent 10-15 projects I think, and execution might take 10-20 minutes.
Are there any Octopus server-level logs we could provide and which could help you to investigate something across different concurrent deployment projects?