We have some issues with some of our deploys, where they sometimes fail with the error “Stack empty”, somewhere down in the Octopus Server stack. The full callstack is included in the paste below.
This seems to happen especially if the deployment happens right after another deploy in the previous environment, or right after the release is created. I am not 100% sure that this is a reliable pattern though.
The deployment process it’s trying to deploy in this example is a Windows Service deployment, but it never reaches the actual steps in the process. It fails even before pulling packages. The deployment process that fails is part of a larger parent project, where we have multiple “deploy a release” steps. Only one failed with this error, the rest passed. Other times we have seen different releases fail to deploy. Seems a bit random.
Is this a known issue? It seems to have started to happen after we migrated to a new Octopus server where we use a newer version of Octopus Deploy. We currently use 2021.2 (Build 7727)).
Thanks a lot for the quick reply! I will certainly try this out.
Is it just a matter of adding a variable to the project with the name OctopusBypassDeploymentMutex and setting it to false on the parent (orchestrator) project? So I assume it’s default “true”? Because I haven’t added this with any value so far.
We haven’t done any resource monitoring. We can do that - however, this does seem like a bug in the code to my eyes. But we’ll have a look at the resource usage.
We have some strict rules when it comes to what we can include in logs we are sharing. So I have “anonymized” the project names and server names by hashing (so you should be able to correlate the different logs). But in any case, I’ve uploaded:
Task execution log for the orchestrator project that fails (note that even though the final “Deploy a release” task failed in the orchestrator project, the subproject deployment itself actually succeeded).
Process definition json for the orchestrator project.
Server logs for the time
Let me know if you need anything else related to this. This is currently creating a lot of disturbances for us unfortunately, so really hoping for a workaround or quick solution to this.
Just some more info - one of the environments we deploy to only have two of the steps in this orchestration project enabled (using the environment conditions). And even that environment fails often with the same error. So this doesn’t seem to be load related, and does not seem to be related to a high number of simultaneous jobs.
We agree that the issue is within the code somewhere, we’re just aiming to gather as much data as possible to compare your situation with the other users that have encountered this to enable our engineers to track down the cause more quickly.
If you track the issue my colleague linked earlier, our engineers will update it with a note on which versions the fix is released to once complete.
Is there any updates to this issue that could be shared? For us this is getting a bit painful, so knowing when a fix is planned would be very helpful for us.
I am following the Github issue, but currently the only thing that has happened is more confirmed cases + the one who were assigned are now not assigned anymore. Anything that can be shared of what’s going on behind the scenes?
I can completely understand that this issue could be awkward to work around. Unfortunately I don’t have any updates at the moment. The issue is with our engineering team and needs to work through our usual processes (e.g. triage, assignment, development, testing and so on). I’ll let the team know you’re keen for this to be fixed and I will let you know if I can provide any further context.
I hope this is helpful. Please let me know if you have any questions.