Azure step failure - Azure credentials have not been set up or have expired

Aleksandras · 18 December 2018 15:08

Hi,

after upgrading Octopus Server from quite old v2018.6.2 version to v2018.9.17 we started to encounter recurring error executiong Azure steps:

Invoke-AzureRmResourceAction : Your Azure credentials have not been set up or have expired, please run Connect-AzureRmAccount to set up your Azure credentials.

Few facts:

Before upgrading from Octopus v2018.6.2 we didn’t have such issue occurring
It’s transient issue and retry always succeeds well
There are several Azure steps in same deployment, which are passing - which steps is failing, is pretty much random
From 8-10 daily deployments at least 1-2 are failing with this issue

It might be related to some Azure SDK/CLI updates bundled with newer Octopus version.
Or can it be something on our infrastructure level as well? Any ideas, possible workarounds?

Full log from failed step:
octopus-azure-failure-log.txt (6.3 KB)

MarkSiedle · 19 December 2018 02:38

Hi Aleksandras,

Thanks for getting in touch and for including the task log.

To help us reproduce, would it be possible for you to provide a deployment process JSON export for one of the projects that typically fails? We’ve setup a secure place you can upload this here. For example, on your project’s deployment process screen, there’s an overflow menu with “Download as JSON”

E.g.

If you’ve just noticed this occurring after upgrading (and nothing else about your project has changed), then this sounds like a bug with the new Azure CLI updates we have bundled with Octopus. We’ll attempt to reproduce this today with various app services and scripts, but an export from your project will hopefully narrow it down.

Cheers
Mark

MarkSiedle · 19 December 2018 03:09

Also, when these deployments fail, could you confirm whether there are multiple Azure-related deployments (or Azure steps) running concurrently in your Tasks list?

Cheers
Mark

Aleksandras · 19 December 2018 06:03

Thank you for your quick response!

I cannot blame Octopus new version for sure, since we saw many different transient errors related to Azure/Octopus/networking over time. However this time it approximately matches the time we upgraded our Octopus server - it’s reproducible quite often and doesn’t disappear completely for ~2 weeks already.

I uploaded project json to location you provided. At first sight there seem was no sensitive information included, but to be secure - please apply some retention policy for this file once you finish investigating it.

Regarding additional facts about failures - I don’t have many exact examples when it happened before me now, but I’m closely investigating it and collecting more data from now on.
It might be true, that there are some concurrently running Tasks or Projects at the same time. The json file I attached was from project which failed tonight (just few hours ago), the failed step wasn’t executed in parallel. However it seems there were other Azure deployment projects running at the same time.

I have a feeling that it might be occurring mainly in Azure Powershell Script steps. However it might be related to the fact that such steps are more often going last in a process (when some connection might expire), while other Azure steps like resource group deployment or webapp deployment are going at the beginning.

There is nothing non-ordinary in our Azure Powershell Scripts steps - just pure logic, without any manual manipulations with Azure context/subscription switching or something like this.

MarkSiedle · 21 December 2018 04:00

Thanks for the additional information and sending your process export through Aleksandras.

Sorry I should have mentioned, that SmartFile/upload location is only visible to Octopus staff and will be deleted when we’ve finished the analysis of your file. Your process won’t be imported anywhere and will only be read manually, so we can see what scripting is being done.

We setup a mini version of your deployment process using our subscription, but unfortunately we haven’t been able to reproduce this yet. But we do think there are some options worth exploring…

One thing that stood out slightly was the use of Save-AzureRmContext in your “Cleanup old ARM deployments” step. We initially thought this may potentially be causing some things to stomp on each other, but if it’s just saving the auth information to a file so it can be used in a separate process (via Start-Job and the Import-AzureRmContext), this shouldn’t be a problem. To rule it out though, you could temporarily disable this cleanup step and see if it has any effect.

Between those versions of Octopus you mentioned, we have upgraded the Azure PowerShell Cmdlets from 5.8 to 6.11, so we’re wondering if something here may be different all of a sudden. To test that theory, you could install your own version of the Azure PowerShell modules instead of the ones we bundle (try grabbing an older version 5.8, or even try the latest version (6.13 I believe), and configuring Octopus to use your custom version instead (see instructions here).

Another possibility (although we think unlikely), may be coming from the introduction of the Azure CLI since your upgrade. Since you don’t appear to be using the CLI anywhere in your scripts, you could try disabling this Azure CLI feature altogether by creating an OctopusDisableAzureCLI variable and setting its value to True. Full details here, then monitor to see if that change has any effect on your deployments.

Sorry we don’t have any exact answers at this stage, but hopefully we can get a reproduction to narrow things down.

Cheers
Mark

Aleksandras · 21 December 2018 07:03

Hi, Mark,

thanks for trying to reproduce it. I will try to clarify some points you raised.

Did you try to reproduce not so mini version? And maybe several similar projects running at the same time? I guess it might be related to timing. Our failing step is about ~10-15minutes after deployment started (deployment starts with Azure step).

Regarding Save-AzureRmContext - you are absolutely correct, it saves context for parallelized jobs running with Start-Job. Will try to disable it, but it was working without problems until recently. So probably yes - it’s not likely as you say.

I’m afraid I won’t be able to try installing older Azure PowerShell Cmdlets. As I understand - they should be installed on the Octopus Server itself, but our Octopus instance is used in quite big company, by many teams, and I’m just a regular user there. It would be hard and time-consuming to convince responsible team to try to fix it, until this error is so unclear in transient.

I will try to disable Azure CLI with variable - this part seems not a blocker.

John_Simons · 14 January 2019 00:00

Hi Aleksandras,

I am sorry that you are running into this issue.
I have done a bit of research and it seems this is a known issue, see https://github.com/Azure/azure-powershell/issues/7110.
The interesting solution Microsoft provides is described in https://github.com/Azure/azure-powershell/issues/7110#issuecomment-426750910.

So I have raised an issue for us to call Disable-AzureRMContextAutosave -Scope Process before authenticating to Azure, but for now, do you want to try to execute Disable-AzureRMContextAutosave -Scope CurrentUser (you need to run this as the user that executes the Powershell scripts), and let us know if this does indeed fixes the randomness you are seeing ?

Regards
John

Aleksandras · 14 January 2019 06:45

Thanks for follow-up! Very nice you nailed it down!
Any idea how it can be related to the thing that it started to appear only recently? Seems it’s same old Azure PowerShell cmdlets…

Can you clarify a bit - where this command should be executed? In some Azure deployment step? Or in all Azure related steps? Or in the first Azure step in the deployment process?
Or just one time - on Octopus Server? Just a note - I don’t have direct access to our Octopus server, but if benefits are clear and it’s clear how to do it - I could book a ticket for our support team.

John_Simons · 14 January 2019 07:00

Hi Aleksandras,

Because you are running an Azure Powershell step, I would call call Disable-AzureRMContextAutosave -Scope Process at the beginning of the script.
Does this make sense ?

Cheers
John

Aleksandras · 14 January 2019 07:15

Not only Azure Powershell steps, but also built-in Azure deployment steps (like “Deploy an Azure Resource Manager template”). In this case then we probably need to enable Custom PowerShell feature?

There is a bunch of projects we have, with a bunch of different Azure-related steps - so would take a while to add your suggested script everywhere. So just want to be sure we doing it right.

John_Simons · 14 January 2019 21:07

Hi Aleksandras,

In that case maybe run a one off Disable-AzureRMContextAutosave -Scope CurrentUse. This should do it permanently for the user account that you are running Octopus as.

Hope this helps

Cheers
John

Aleksandras · 21 January 2019 14:33

Hi again

I tried to make more smart workaround instead (I wasn’t able to do Disable-AzureRMContextAutosave for CurrentUser, because I don’t have access to Octopus server, and I didn’t want to do it for Process, because it would be need for too many steps). So I ended with creating new script module in Octopus Script Modules library with simple content:

Write-Host "==> Executing: Setting Disable-AzureRMContextAutosave"
Disable-AzureRMContextAutosave -Scope Process

Then included this script module to all relevant processes. I see it’s being executed in every step, however after it there is another error is thrown. Seems like it breaks something else. Please see log attached below:

15:23:47   Verbose  |       Importing Script Module 'Octopus AzureRMContextAutosave' from 'D:\Octopus\Work\20190121142344-290555-7348\Library_OctopusAzureRMContextAutosave_636836810261879409.psm1'
15:23:49   Info     |       ==> Executing: Setting Disable-AzureRMContextAutosave
15:23:49   Verbose  |       PowerShell Environment Information:
15:23:49   Verbose  |       OperatingSystem: Microsoft Windows NT 6.3.9600.0
15:23:49   Verbose  |       OsBitVersion: x64
15:23:49   Verbose  |       Is64BitProcess: True
15:23:49   Verbose  |       CurrentUser: NT AUTHORITY\SYSTEM
15:23:49   Verbose  |       MachineName: SPCS-SPV-APP-46
15:23:49   Verbose  |       ProcessorCount: 4
15:23:49   Verbose  |       CurrentDirectory: D:\Octopus\Work\20190121142344-290555-7348
15:23:49   Verbose  |       CurrentLocation: D:\Octopus\Work\20190121142344-290555-7348
15:23:49   Verbose  |       TempDirectory: C:\Windows\TEMP\
15:23:49   Verbose  |       HostProcessName: powershell
15:23:49   Verbose  |       TotalPhysicalMemory: 16776756 KB
15:23:49   Verbose  |       AvailablePhysicalMemory: 11168908 KB
15:23:49   Verbose  |       Authenticating with Service Principal
15:23:50   Verbose  |       Account          : xxx
15:23:50   Verbose  |       SubscriptionName : xxx
15:23:50   Verbose  |       SubscriptionId   : xxx
15:23:50   Verbose  |       TenantId         : xxx
15:23:50   Verbose  |       Environment      : AzureCloud
15:23:51   Verbose  |       Invoking target script "D:\Octopus\Work\20190121142344-290555-7348\Script.ps1" with  parameters
15:23:51   Info     |       voncore-int westeurope
15:23:51   Info     |       Checking if resource group voncore-int exists
15:23:51   Error    |       CheckAndCreateResourceGroup : The 'Get-AzureRmResourceGroup' command was found
15:23:51   Error    |       in the module 'AzureRM.Resources', but the module could not be loaded. For
15:23:51   Error    |       more information, run 'Import-Module AzureRM.Resources'.
15:23:51   Error    |       At D:\Octopus\Work\20190121142344-290555-7348\Script.ps1:4 char:2
15:23:51   Error    |       +     CheckAndCreateResourceGroup -Name $OctopusParameters["RM.VonCore. ...
15:23:51   Error    |       +     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
15:23:51   Error    |       + CategoryInfo          : ObjectNotFound: (Get-AzureRmResourceGroup:String
15:23:51   Error    |       ) [CheckAndCreateResourceGroup], CommandNotFoundException
15:23:51   Error    |       + FullyQualifiedErrorId : CouldNotAutoloadMatchingModule,CheckAndCreateRes
15:23:51   Error    |       ourceGroup

MarkSiedle · 21 January 2019 23:30

Hi Aleksandras,

Regarding the error you’re seeing re. “module could not be loaded”, this was a known issue that has been recently fixed in version 2018.12.0. If you are able to upgrade your Octopus to latest, we believe this will solve your problem.

Let me know how you go.

Thanks
Mark

Aleksandras · 22 January 2019 07:10

Hi, MarkSiedle, thank you for response!

We can update Octopus to latest, but we cannot do it every day or in similar pace (usually our company do it once in 1-2 months). So would like to clarify some things.

https://github.com/OctopusDeploy/Issues/issues/5221 - says it affected version from v2018.11.0, however currently we have v2018.9.15 in our environment. Can it still be relevant?
“module could not be loaded” started to occur only after I added a workaround suggested by @John_Simons . Once I removed it again - it started to work again.
Can you please confirm that our initial issue was also fixed in v2018.12.0 release as well? Seems there are related notes mentioned in release log, just would like to confirm. And if it’s the case - then it will be really easier to convince our administration team to upgrade Octopus.

MarkSiedle · 23 January 2019 02:55

Hi Aleksandras,

Sorry, I linked the wrong issue in my last response. This is the issue I should have linked. That will now call Disable-AzureRMContextAutosave -Scope Process for every Azure step, which will save you from having to try and inject it everywhere yourself manually.

In terms of how we arrived here, an upgrade from 2018.6.2 to 2018.9.17 included upgraded PowerShell modules from 5.7.0 to 6.8.1 respectively, so the earlier version of those modules may not have suffered from this Azure PowerShell issue. Which may explain why these problems only started occurring after your upgrade.

Unfortunately we have not been able to reproduce this issue, so we cannot confirm that this will fix the problems you are seeing. But we believe your issue was caused by this known Azure PowerShell issue. So all of these are recommendations are to try and resolve/workaround that.

In terms of your specific points:

This was my bad, see the link I mentioned above.
If you’re seeing “module could not be loaded”, that may be a sign that the fix we’re suggesting may not work with the older version of the Azure cmdlets included in 2018.9.17. We saw this same error recently and had to roll the Azure PowerShell modules back to a specific version (which is included in 2018.12.0), so if you’re seeing this in your older version, then it may have been suffering from a similar problem (in which case, we’re hoping the Azure cmdlets included in 2018.12.0 will solve this)
See above. We have not been able to reproduce, so we cannot confirm at this stage.

If you wanted to confirm this as a fix before upgrading your main/large Octopus instance, you could consider spinning up a separate test instance of Octopus. As long as you matched the version you currently have in production (2018.9.17) and reproduced the issues you are seeing against your Azure subscription, then you could test an upgrade to latest in isolation and confirm the fixes that have gone into 2018.12.0.

Again, sorry for the confusion. Hope this helps to clarify things.

Mark

Aleksandras · 28 January 2019 07:12

Hi,

sad news… Our main Octopus Deploy instance was upgraded to v2018.12.1. And I just have got same issue:

Set-AzureRmResource : Your Azure credentials have not been set up or have expired, please run Connect-AzureRmAccount to set up your Azure credentials.

Update: seems like it was promotion of deployment, which was created prior upgrade - so might be something was “snapshoted”? I will try full clean and new deployment, then monitor a bit more…

Aleksandras · 31 January 2019 08:13

Was going to say that it finally works… And it was working quite well for few days without an issue. There were nightly deployments every night, but other than that we weren’t actively deploying everything recently - so want to give it more time to settle down.

Unfortunately today I saw that error popped up again… On completely new deployment.
Maybe it’s not that often now, or maybe it’s just a feeling, but seems like it’s not 100% fixed after all.

I uploaded new full log to https://file.ac/l5rV7JDa4pQ/ May be you could spot something interesting.
However it might be because other deployments going in parallel…

Aleksandras · 1 February 2019 11:04

One more finding - we are experiencing some new locks on our deployments (when task is blocked by some other task), like this:

Waiting on scripts in tasks ServerTasks-295474, ServerTasks-295476, ServerTasks-295486, ServerTasks-295488, ServerTasks-295489, ServerTasks-295495 and ServerTasks-295499 to finish. This script requires that no other Octopus scripts are executing on this target at the same time.

We had such issues previously, but not so critical ones (mainly fixed by setting OctopusBypassDeploymentMutex=true). However after we upgraded to v2018.12.1 it seems became much worse. I’m not owning Octopus maintenance in our company, so I don’t know for sure if something additional was done with upgradint to v2018.12.1. Or maybe there are many new projects were added - don’t know, just wild guessing…

Anyway, these locks is probably a different issue, but it might be related to my previous comment. Since steps might be stuck for 5-20 minutes - might it be the reason Azure credentials are expiring?

John_Simons · 4 February 2019 05:40

Hi Aleksandras,

The Azure credentials expiring should not be due to the wait since we only open a new connection after acquiring the lock.

Regarding the “waiting on script tasks…”, could the problem be lack of capacity and hence you may need to add more workers?

Regards
John

Aleksandras · 4 February 2019 07:49

Ok, if you say so, then upgrade to v.2018.12.1 don’t solve “Azure credentials expiration” issue… Although it seems like it really occurs not so often, but it might be a placebo effect (i.e. assumptions based).

Regarding “waiting on script tasks” - as I mentioned it’s probably a different issue, and our Octopus maintenance team is investigating it. I pointed them your mentioned article, thank you!