Randomly getting 'System.BadImageFormatException: Index not found' error in Web API after some deployments to an Azure App Service

cloud

(Agustín Ibars) #1

I have an ASP.NET Web API application running on an Azure App Service which is, in turn, deployed via Octopus after a TFS build definition is completed.

After certain successful deployments (no errors happen nor in TFS nor in Octopus), the whole API stops working and all endpoints respond with the following error:

System.BadImageFormatException
Index not found. (Exception from HRESULT: 0x80131124)

and the stack-trace looks like this:

System.BadImageFormatException: Index not found. (Exception from HRESULT: 0x80131124)
   at System.Web.Mvc.Controller.BeginExecute(RequestContext requestContext, AsyncCallback callback, Object state)
   at System.Web.Mvc.Controller.System.Web.Mvc.Async.IAsyncController.BeginExecute(RequestContext requestContext, AsyncCallback callback, Object state)
   at System.Web.Mvc.MvcHandler.<BeginProcessRequest>b__4(AsyncCallback asyncCallback, Object asyncState, ProcessRequestState innerState)
   at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncVoid`1.CallBeginDelegate(AsyncCallback callback, Object callbackState)
   at System.Web.Mvc.Async.AsyncResultWrapper.WrappedAsyncResultBase`1.Begin(AsyncCallback callback, Object state, Int32 timeout)
   at System.Web.Mvc.Async.AsyncResultWrapper.Begin[TState](AsyncCallback callback, Object callbackState, BeginInvokeDelegate`1 beginDelegate, EndInvokeVoidDelegate`1 endDelegate, TState invokeState, Object tag, Int32 timeout, SynchronizationContext callbackSyncContext)
   at System.Web.Mvc.MvcHandler.BeginProcessRequest(HttpContextBase httpContext, AsyncCallback callback, Object state)
   at System.Web.Mvc.MvcHandler.BeginProcessRequest(HttpContext httpContext, AsyncCallback callback, Object state)
   at System.Web.Mvc.MvcHandler.System.Web.IHttpAsyncHandler.BeginProcessRequest(HttpContext context, AsyncCallback cb, Object extraData)
   at System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
   at System.Web.HttpApplication.ExecuteStepImpl(IExecutionStep step)
   at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)

The error is a bit mysterious because it doesn’t seem to related to any particular set of changes in the source code or to any particular action. That’s why it looks random. At the moment, the only way I found to temporarily solve this is re-deploying the same Octopus’ package. That restores the API.

My suspicion is that something happens between Octopus and the Azure Web App.

What I also found is that after the first deployment in Octopus (the one that produces the failure), I saw the following line in the logs:

17:16:00 Info | Successfully deployed to Azure. 0 objects added. 48 objects updated. 0 objects deleted.

Now, after the second deploy in Octopus (the re-deploy of the release that failed), one would expect to have “0 objects updated”, but instead I got the following line:

18:48:57 Info | Successfully deployed to Azure. 0 objects added. 3 objects updated. 0 objects deleted.

Some additional information:
Octopus version is v2018.6.2
TFS version is 15.117.27414.0
ASP.NET app targets .NET 4.5.2

What could be possibly going wrong?


(Shannon Lewis) #3

Hi Agustín,

Thanks for getting in touch. BadImageFormatExceptions are usually the result of a 32bit/64bit mismatch. Is it possible that it’s either building the wrong one or the setting is somehow changing in the Azure App Service configuration?

Regards
Shannon


(Agustín Ibars) #4

Hi Shannon,

Thanks for your answer. I’m aware that BadImageFormatExceptions could be caused by mismatch in the architecture. All the projects within the .NET solution are set as “Any CPU”, should I adjust some of that?

Regarding the Azure App Service configurations, I don’t think they have changed. As I said, the error occurs randomly and deploying for a second time solves the issue, so in that short period of time is too hard for the configurations to change.

I tried now enabling “AppOffline” and see how next depoys behave.

Any additional information you might have, it’ll be really helpful.

Regards,
Agustín


(Shannon Lewis) #5

Apologies for the delay, “Any CPU” should be fine and the config doesn’t sound related if the redeploys that close together can work/not work.

It sounds almost like the files are sometimes being corrupted in transit at times, or something like that. I haven’t been able to reproduce the issue and we haven’t had any other reports, so it’s probably going to be a tricky one to track down.

The only thing I can think of at this point is, could you try changing the File Comparison Method on the step? It will be Timestamp by default, Checksum is slower but is based on comparing the file contents so may help if the files aren’t stamped correctly somehow. The performance difference may not be too bad, depending on your package size.

Please let me know how you go.


(Agustín Ibars) #6

Thanks, I will try that and come back with an answer.
Enabling “AppOffline” didn’t do the trick apparently.


(Agustín Ibars) #7

Hi Shannon,

Unfortunately, the File Comparison change didn’t work. I had again the failure today, but what was weird, is that in the 2nd deploy’s logs, no objects were updated.

1st

2nd


(Agustín Ibars) #8

Would be useful to at least determine if the problem is on the source code of the application I have issues with or, with the deployment process itself.


(Shannon Lewis) #9

Is the 2nd deployment the same package version, or subsequent build with a different patch number or something like that in the version?

When a deploy fails, if you redeploy the same package without another build does this resolve the issue in the site?


(Agustín Ibars) #10

The second deploy is of the same package and same version of the first one. No new releases are created when this happen.
And yes, it solves the issue and the application works fine after that.

In the variables table of the project, there is a PGP Key that has to be inserted in a web.config value, this pgp Key is a long string and contains line breaks. Could this be causing issues?

In addition, I enabled the “remove App_Data” option after each deploy. I’ll post the result of enabling this and if it prevents the failure to happen.

Thank you


(Agustín Ibars) #11

Ok, so I had a new failure today. My current config looks like this


(Agustín Ibars) #12

I think this error started to happen after the Elmah package was added to the solution. Have you heard any issue related to this package?


(Agustín Ibars) #13

Removing ELMAH didn’t change anything and the failure persisted.


(Shaun Marx) #14

Hi Agustín,

If you are consistently seeing an error then there is a chance that it is related to the package, but can’t really confirm unless we can get a bit more information from you. Is there a stack trace or anything in the logs that you can share so that we can help you?

Regards,
Shaun


(Agustín Ibars) #15

Hi Shaun,

The only stack trace that I get when the deployment fails is the attached one in the main body of the post. All API endpoints return the same exception as well as the homepage plus some other pages used for logs and API documentation.
Let me know if there is something else that I should provide.

Thank you.


(Shaun Marx) #17

Hi Agustín,

Apologies, I was under the impression that you were getting a different and more consistent error after your last replies.

If you aren’t consistently getting the bad image exception and no errors while deploying then it’s doubtful that the issue is with the package. My gut feel would be something network related. Do you possibly have a proxy or firewall between Octopus and Azure? Do you have any other web apps being deployed that exhibit the same problem?

We use web deploy under the covers which can be a bit chatty and can get upset if you have a large number of files that need to be uploaded such as npm modules. If upgrading is an option, it could be worthwhile to wait for our next monthly release. The new release closes the loop on our workers’ feature which should allow you to push steps that run on the server such as the deploy web app step to an edge node which may be closer to Azure. This could potentially eliminate any network, firewall or proxy related issues from the equation.

Capturing some network traffic on your octopus server using a tool such as Wireshark when deploying could potentially provide some additional insights as well.

Regards,
Shaun


(Agustín Ibars) #18

I don’t know yet if we have a proxy or firewall between Azure and Octopus, I’ll investigate that. Unfortunately, this error is only happening for this particular solution (no other errors exactly like this were reported by the company).


(Shaun Marx) #21

Hi Agustín, just wanted to follow up on this and see how you are going with this issue?

Have you managed to gather some additional information which could possibly explain what we are seeing here?

One thing we did notice after reviewing the original logs that you provided, is that the timestamps for all files during the comparison seemed to be the same even for things which probably would not have changed i.e. fonts. There used to be a bug in NuGet which could account for some of the discrepancies that you were seeing between the CRC and timestamp based comparison methods. It’s a bit of a long shot, but just wanted to check with you as to how you are creating your package?

Regards,
Shaun


(Agustín Ibars) #22

Hi Shaun,

Thanks for asking. I have not collected any additional information around this.
However, I have not seen this issue anymore for the past month. I don’t know what could have changed.

Regards


(Agustín Ibars) #24

Greetings,

I have started to have this issue again really often. Did you find anything related to it?

Thank you.


(John Simons) #25

Hi Agustín,

Really sorry to hear that this issue is back again.
We haven’t found anything related to it, or any other users reporting the same issue as you.

I have reviewed all the conversations in this thread and the first message you sent us, you say:

What I also found is that after the first deployment in Octopus (the one that produces the failure), I saw the following line in the logs:

17:16:00 Info | Successfully deployed to Azure. 0 objects added. 48 objects updated. 0 objects deleted.

Now, after the second deploy in Octopus (the re-deploy of the release that failed), one would expect to have “0 objects updated”, but instead I got the following line:

18:48:57 Info | Successfully deployed to Azure. 0 objects added. 3 objects updated. 0 objects deleted.

This points to something wrong when doing the sync with Azure, as you can see in our code, https://github.com/OctopusDeploy/Calamari/blob/a5d18674827e852e2808bb9fca6a864c614a8b15/source/Calamari.Azure/Deployment/Conventions/AzureWebAppConvention.cs#L64-L75 we are using the DeploymentManager (https://msdn.microsoft.com/en-us/library/dd543271(v=vs.90).aspx) from Microsoft to publish the code to Azure.

So what I would like to try next is for you to publish the code via Visual Studio to that same app service and see if that works consistently.

It could also be that there is something wrong with the app service/web app in Azure, so it may be useful to recreate a fresh app service and web app and see if that fixes this issue.

I will wait for your results.

Regards
John