Azure Application Services Health Check Failes

Ruan.vanderMerwe · 19 August 2021 09:18

Hi,

We are experiencing a strange issue. We have multiple Azure-hosted Application Services that we deploy to from our on-prem Octopus server. This has been working fine for a long time, but after we upgraded to the latest version of Octopus we saw that the health checks to these applications services are failing with the following error code:
Running rollback behaviours…
August 19th 2021 11:10:23Error
Object reference not set to an instance of an object.
August 19th 2021 11:10:23Error
System.NullReferenceException
August 19th 2021 11:10:23Error
at Microsoft.Azure.Management.ResourceManager.Fluent.Core.RestClient.RestClientBuilder.WithEnvironment(AzureEnvironment environment)
August 19th 2021 11:10:23Error
at Microsoft.Azure.Management.ResourceManager.Fluent.Core.AzureConfigurable`1.BuildRestClient(AzureCredentials credentials)
August 19th 2021 11:10:23Error
at Microsoft.Azure.Management.Fluent.Azure.Configurable.Microsoft.Azure.Management.Fluent.Azure.IConfigurable.Authenticate(AzureCredentials credentials)
August 19th 2021 11:10:23Error
at Calamari.Azure.AzureClient.CreateAzureClient(ServicePrincipalAccount servicePrincipal)
August 19th 2021 11:10:23Error
at Calamari.AzureAppService.HealthCheckBehaviour.d__2.MoveNext()
August 19th 2021 11:10:23Error
— End of stack trace from previous location where exception was thrown —
August 19th 2021 11:10:23Error
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
August 19th 2021 11:10:23Error
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
August 19th 2021 11:10:23Error
at Calamari.Common.Plumbing.Pipeline.PipelineCommand.d__16.MoveNext()
August 19th 2021 11:10:23Error
— End of stack trace from previous location where exception was thrown —
August 19th 2021 11:10:23Error
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
August 19th 2021 11:10:23Error
at Calamari.Common.Plumbing.Pipeline.PipelineCommand.d__13.MoveNext()
August 19th 2021 11:10:23Error
— End of stack trace from previous location where exception was thrown —
August 19th 2021 11:10:23Error
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
August 19th 2021 11:10:23Error
at Calamari.Common.Plumbing.Pipeline.PipelineCommand.d__13.MoveNext()
August 19th 2021 11:10:23Error
— End of stack trace from previous location where exception was thrown —
August 19th 2021 11:10:23Error
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
August 19th 2021 11:10:23Error
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
August 19th 2021 11:10:23Error
at Calamari.Common.CalamariFlavourProgramAsync.d__4.MoveNext()
August 19th 2021 11:10:23Fatal
The remote script failed with exit code 100

We are still able to deploy to these applications, but we have health checks set up every one hour and it fails due to the above issue. We have done and confirmed the following.

Ensure the SPN account is valid and working
Ensure the slot we are deploying to/ checking the health is active and running
Deleted the old connection and re-established a new one (Same error)
Removed the infrastructure on the Azure side and recreated it

None of the solutions seems to be working. We went further and updated the Registry of the VM that is hosting the Octopus server (Server 2012 R2) to allow TLS versions 1.1 and 1.2 and ensured all patches are up to date. We confirmed all the ports are open to and from this VM as well and lastly we asked our DB team to reindex the database.

Any advice or help would be highly appreciated.

Extra info:
VM: Server 2012 R2
Current OD Version: 2021.1.7595

Thanks

paul.calvert · 19 August 2021 10:31

Hi @Ruan.vanderMerwe,

Thanks for getting in touch!

Does the connection pass through a proxy to access the Azure services?
The reason I ask is that we have discovered an issue where Octopus is failing to use the custom proxy configured when performing health checks on Azure Web Apps.

Regards,
Paul

Ruan.vanderMerwe · 19 August 2021 10:44

Hi @paul.calvert ,

Thanks for getting back to me.
We do not make use of any Proxy or load balancers. It is a direct connection

Hope this helps

paul.calvert · 19 August 2021 10:51

Would you be able to download and attach the task log from one of the health checks?
Also, what version of Octopus were you using before the upgrade?

Ruan.vanderMerwe · 19 August 2021 11:05

Sure, will compile the files for you quick and send it through

The last version that worked with the health checks was 2020.6.4671, installed on the 9th of March, and we ran that version until the 14th of June. We then upgraded to 2021.1.7316 where we first started experiencing this issue

ServerTasks-237647.log.txt (7.9 KB)

Please also note, that the option ot upgrade the calamari version also fails

paul.calvert · 19 August 2021 11:45

Thanks for that. The error is extremely vague, so, I will bring this up with our engineers to take a look at to see if they can shed some more light on it.

Would it be possible to get a copy of a successful deployment to this target? You can send it to me in a private message if there is any chance of confidential details being in there.

Ruan.vanderMerwe · 19 August 2021 12:03

Perfect and thanks for the help here @paul.calvert

Let me dig in my archives and send it to you

Ruan.vanderMerwe · 19 August 2021 12:18

@paul.calvert

I am no DB expert, but I ran an Integrity check now against the VM form the UI and it posted these issues

It could be our root cause here, but I do not want to go and “willy nilly” drop these tables. Is there a way we can resolve these issues and then check again?

Update:
I have sent these logs to our DB Guru - He is taking a look and will provide me with an update

Ruan.vanderMerwe · 19 August 2021 12:43

Okay, we have fixed that now by removing the tables (Have a backup)

Tried the health check again, still fails. I Will reboot it tonight to ensure a fresh connection is made, mabey that will resolve this issue then?

paul.calvert · 19 August 2021 13:18

There is no harm in trying a reboot.

I’ll still bring this up with our engineers and also provide them with the deployment log for comparison once I get it.

We have recently come across some similar errors from a few users in relation to Azure, but they were all occurring during the deployment and the fixes were specific to the deployment steps. The fact that this is only affecting your health checks is quite strange.

paul.calvert · 19 August 2021 14:26

Bit of a long shot but have you tried creating a fresh Azure deployment target within Octopus? I’m wondering if there is some issue with the deployment target data then creating a fresh one might get a different result.

Another possible test would be to create one of our free Cloud instances, add an Azure deployment target there and re-test the health check.

Ruan.vanderMerwe · 19 August 2021 14:38

Thanks for the feedback @paul.calvert,

I have tried to create a fresh target in Azure and connect to the Octopus instance, but the same error keep on popping up. The issue with the new ones, is that it then states as unhealthy and then you also cannot deploy to it as Octopus does not allow deployments to unhealthy targets

We are considering moving our On-Prem solution to the Cloud Hosted environment, but we have not done all the research as of yet. With that being said, can one of the engineers (or yourself) set up a meeting with us to talk through the migration? It would be very helpful

paul.calvert · 19 August 2021 19:55

In regards to migrating to a Cloud instance, we recently added a feature that allows users to do this independently.

This feature gives you the ability to export any number of projects at a time and then import them into the Cloud instance. It will include all linked items such as variable sets, tenants and accounts.

The Cloud instances are free for up to 10 deployment targets, so, you could create one to test out the project import/export feature and also test the Azure target health check to see if this is something within the software code or something specific to your on-premise database/environment.

Ruan.vanderMerwe · 20 August 2021 05:09

Awesome, thanks Paul

I took the time yesterday to start the first couple of steps of the transition
Let me poke around a bit and set up the infra on the free tier and report back to you

Thanks for your help thus far
Much appreciated

Ruan.vanderMerwe · 20 August 2021 09:15

Hi @paul.calvert

Interesting, on the Cloud Hosted tier I have set up everything (New account connected with new SPN as well)
Same infrastructure connected - Got the same error

ServerTasks-48.log.txt (11.2 KB)

So now I am not sure if the issue resides with octopus or with the application service
I am digging through the logs now to see if I can find anything

paul.calvert · 20 August 2021 09:30

Would you be happy for me and our engineers to log in to your Cloud instance to investigate this further?

Being able to see the exact configuration may help them figure out the problem.

Ruan.vanderMerwe · 20 August 2021 09:35

Yes sure

The more eyes on this the beter

paul.calvert · 20 August 2021 10:10

I think I may have found the issue.

Within the Azure Account configuration there is a setting to enable the Isolated Azure Environment connection.

If this is disabled, your health check returns successfully.

Ruan.vanderMerwe · 20 August 2021 10:21

Interesting, I would like to know what changed between the last update and the current one when this setting broke the health deployment

Thanks for pointing this out and helping to get this resolved, we really appreciate this

paul.calvert · 20 August 2021 10:28

I am raising this with our engineers to take a look deeper into this.
With the account check being successful with that setting enabled, I would expect the health check to also be successful. Judging from the error stack it may be that the API call we’re sending to Azure for the health check isn’t being formed correctly.