"Raw Kubernetes Yaml" check fails with "An item with the same key has already been added."

goce.dokoski · 26 June 2023 09:10

I have a deployment which contains variuos “Deploy Raw Kubernetes YAML” steps, all of which worked until recently.

I did some small changes, not sure if related, but since today all of the steps succesfully create the yaml resources

but fail on what seems like the end of the step which checks the k8s resource, even though the check is disabled.

Octopus Server version: 2023.3.1807

According to the log, the kubernetes resources are succesfully created.

Successfully authenticated with the Azure CLI
Creating kubectl context to AKS Cluster in resource group RG.DEV called connectedquality-aks (namespace master) using a AzureServicePrincipal
Applying Batch #1 for YAML matching ‘customresource.yml’
‘Deployment/…’ created.
‘Service/…’ created.
“/usr/local/bin/kubectl” version --client --short --request-timeout=1m
Client Version: v1.21.3+k3s1
Found kubectl and successfully verified it can be executed.

But right after that:

System.ArgumentException: An item with the same key has already been added. Key: f49a9cfa-c29b-4b00-ac3d-5726aa616723
at System.Collections.Generic.Dictionary2.TryInsert(TKey key, TValue value, InsertionBehavior behavior) at System.Collections.Generic.Dictionary2.Add(TKey key, TValue value)
at System.Linq.Enumerable.ToDictionary[TSource,TKey,TElement](IEnumerable1 source, Func2 keySelector, Func2 elementSelector, IEqualityComparer1 comparer)
at Calamari.Kubernetes.ResourceStatus.ResourceStatusChecker.CheckStatusUntilCompletionOrTimeout(IEnumerable1 resourceIdentifiers, ITimer timer, Kubectl kubectl, Options options) in C:\BuildAgent\work\e0cefbed4ad11812\source\Calamari\Kubernetes\ResourceStatus\ResourceStatusChecker.cs:line 59 at Calamari.Kubernetes.ResourceStatus.ResourceStatusReportExecutor.ReportStatus(String workingDirectory, ICommandLineRunner commandLineRunner, Dictionary2 environmentVars) in C:\BuildAgent\work\e0cefbed4ad11812\source\Calamari\Kubernetes\ResourceStatus\ResourceStatusReportExecutor.cs:line 81
at Calamari.Kubernetes.Conventions.ResourceStatusReportConvention.Install(RunningDeployment deployment) in C:\BuildAgent\work\e0cefbed4ad11812\source\Calamari\Kubernetes\Conventions\ResourceStatusReportConvention.cs:line 21
at Calamari.Deployment.ConventionProcessor.RunInstallConventions() in C:\BuildAgent\work\e0cefbed4ad11812\source\Calamari.Shared\Deployment\ConventionProcessor.cs:line 71
at Calamari.Deployment.ConventionProcessor.RunConventions() in C:\BuildAgent\work\e0cefbed4ad11812\source\Calamari.Shared\Deployment\ConventionProcessor.cs:line 29
Running rollback conventions…
An item with the same key has already been added. Key: f49a9cfa-c29b-4b00-ac3d-5726aa616723
System.ArgumentException
at System.Collections.Generic.Dictionary2.TryInsert(TKey key, TValue value, InsertionBehavior behavior) at System.Collections.Generic.Dictionary2.Add(TKey key, TValue value)
at System.Linq.Enumerable.ToDictionary[TSource,TKey,TElement](IEnumerable1 source, Func2 keySelector, Func2 elementSelector, IEqualityComparer1 comparer)
at Calamari.Kubernetes.ResourceStatus.ResourceStatusChecker.CheckStatusUntilCompletionOrTimeout(IEnumerable1 resourceIdentifiers, ITimer timer, Kubectl kubectl, Options options) in C:\BuildAgent\work\e0cefbed4ad11812\source\Calamari\Kubernetes\ResourceStatus\ResourceStatusChecker.cs:line 59 at Calamari.Kubernetes.ResourceStatus.ResourceStatusReportExecutor.ReportStatus(String workingDirectory, ICommandLineRunner commandLineRunner, Dictionary2 environmentVars) in C:\BuildAgent\work\e0cefbed4ad11812\source\Calamari\Kubernetes\ResourceStatus\ResourceStatusReportExecutor.cs:line 81
at Calamari.Kubernetes.Conventions.ResourceStatusReportConvention.Install(RunningDeployment deployment) in C:\BuildAgent\work\e0cefbed4ad11812\source\Calamari\Kubernetes\Conventions\ResourceStatusReportConvention.cs:line 21
at Calamari.Deployment.ConventionProcessor.RunInstallConventions() in C:\BuildAgent\work\e0cefbed4ad11812\source\Calamari.Shared\Deployment\ConventionProcessor.cs:line 71
at Calamari.Deployment.ConventionProcessor.RunConventions() in C:\BuildAgent\work\e0cefbed4ad11812\source\Calamari.Shared\Deployment\ConventionProcessor.cs:line 29
at Calamari.Kubernetes.Commands.KubernetesApplyRawYamlCommand.Execute(String commandLineArguments) in C:\BuildAgent\work\e0cefbed4ad11812\source\Calamari\Kubernetes\Commands\KubernetesApplyRawYamlCommand.cs:line 116
at Calamari.Program.ResolveAndExecuteCommand(IContainer container, CommonOptions options) in C:\BuildAgent\work\e0cefbed4ad11812\source\Calamari\Program.cs:line 57
at Calamari.Common.CalamariFlavourProgram.Run(String args) in C:\BuildAgent\work\e0cefbed4ad11812\source\Calamari.Common\CalamariFlavourProgram.cs:line 80
Process /bin/bash in /etc/octopus/default/Work/20230626085851-3880-689 exited with code 100

sean.stanway · 26 June 2023 10:18

Hi @goce.dokoski

We haven’t had any reports of failures such as this recently. What small changes did you make to the project?

Would you mind if we logged into your cloud instance and took a look? Could you let me know your instance URL and what project is being affected?

Kind Regards
Sean

goce.dokoski · 26 June 2023 10:29

@sean.stanway thank you for the quick response. I wrote you a personal message.

The changes included:

adding a linux listening worker
adding a kubernetes deployment target using the worker and corresponding environment as well as a new target role
configuring the kubernetes deployment steps to also use the new environment, with few variable values differing per environment and target role.

This all worked yesterday, so I’m not sure if it’s related by itself.

sean.stanway · 26 June 2023 10:42

Hi @goce.dokoski

Thanks for sending your details over to me via DM.

Those changes all sound standard. I suspect it will be an issue with the experimental feature that is currently in use to check the status. It seems that even with this disabled it is still attempting to use this or add items to check.

I’ll probably have to get our engineers involved and get them to check your instance to see what might be happening. The skip doesn’t look to work and I suspect there is an underlying issue since there is duplicate keys attempting to be added to the dictionary object we are checking. It may be related to the new deployment target you added and the key isn’t cycling, but we can find that out.

Out of curiosity, if you run this with the status check enabled does it fail with the same error?

Kind Regards

goce.dokoski · 26 June 2023 10:51

Hi @sean.stanway,

Yes, I was also suspecting the same thing, from looking at the log.

I tried few times with and without the check, and always the same result.
Also I just went through all steps and disabled it everywhere, but still the same.

To note, this check was enabled on 2-3 steps before adding the new environment,
then as I wrote things seemed to work fine for some time.

Then the problem started occuring today.

Kind regards,
Goce

sean.stanway · 26 June 2023 11:09

Hi @goce.dokoski

Strange that it only started happening today. I guess there must be some item we keep trying to add even though it’s disabled. I’ll pass this onto our engineers and hopefully we can get this fixed for you. They’re based out of Australia so, unfortunately, it may not be until tonight when someone can take a look at it. I can look at getting the feature disabled in the meantime and seeing if that allows your deployments to go through if this is urgent for you.

Let me know and I’ll see what I can do.
Kind Regards
Sean

goce.dokoski · 26 June 2023 11:18

Hi @sean.stanway

It’s not a production so not utmost urgent,

But of course if it gets disabled it would be very beneficial as we are in the process of defining the deployments, and then we can continue debugging and fine tuning other configuration issues.

Best,
Goce

sean.stanway · 26 June 2023 11:29

Hi @goce.dokoski

I can disable this in the meantime, but I’m unsure if our engineers will want this turned back on if they wish to do testing. We can get it off for now and then one of my colleagues can ask later if they need this back on and request if we can do testing.

I’m just checking internally if we need to do a reprovision of your instance when disabling experimental features. This may require about 20-30 minutes downtime if we do. If we do need downtime, would you want this done as soon as possible?

Kind Regards

goce.dokoski · 26 June 2023 11:46

Hi @sean.stanway,

yes, downtime is fine whenever you need it to disable the feature,
then if engineers need the feature on, they can also reprovision as needed.

Best, Goce

sean.stanway · 26 June 2023 11:53

Hi @goce.dokoski

That’s no problem, I’ll get that sorted out now. Should be about 20-30 minutes from when it goes down.

Let me know if you encounter any issues once it’s back up.

Kind Regards
Sean

goce.dokoski · 26 June 2023 12:51

Hi Sean,

not sure if the instance was restarted, I saw a maintenance mode,
and after few minutes it was up again.

Unfortunately the error still happens,

another blind guess is that maybe the script files on the agent are not refreshed properly.

And one more thing that confused me is that the script that fails is shown with a windows path in the log,
whereas the step runs on a linux worker, but I guess that octopus directs the step on the worker from its own internal windows agent.

Best,
Goce

sean.stanway · 26 June 2023 13:15

Hi @goce.dokoski

I’m sorry to hear that disabling the feature hasn’t resolved this.

The Windows path you are seeing will be the worker as you mentioned. The step will execute on your leased worker, which is an Ubuntu image that I can see, and our code works in the background of that.

With it being the worker, I could release that particular worker you are using so a new one is generated. This may fix the issue, it could be cached data on the worker itself. Do you mind if I release the worker you’re using currently so that you can get a new one?

Kind Regards

goce.dokoski · 26 June 2023 13:26

Hi Sean,

yes please do, let’s see if that would solve it.

Kind regards

sean.stanway · 26 June 2023 13:29

Hi @goce.dokoski

No problem. I’ve set it to be kept for debugging in case the engineers want to have a look at it. When you run a new deployment it should lease a new worker. Let me know how it goes, I’ve got fingers crossed.

Kind Regards
Sean

goce.dokoski · 26 June 2023 13:42

Thanks for the support, unfortunately still the same.
I guess then it’ll be up to the engineering team to work it out.

Pls, let me know whenever there’s something new
Best, Goce

sean.stanway · 26 June 2023 13:46

Dang, was really hoping that might be it. Sorry Goce, I hope that our engineers can see better what might be happening here. It might tomorrow when I get back to you with an update.

Apologies for the convenience once again!
Kind Regards

goce.dokoski · 26 June 2023 14:57

No problem, let’s see.

In case it’s useful… a colleage also created a new project with a simple single yaml deployment,
and the same happens, so it’s on the instance level, or at least to the namespace.

goce.dokoski · 27 June 2023 07:18

Hi Sean,

today the deployments go through without problems.
We are not daring to enable the “k8s success check” yet

Kind regards

sean.stanway · 27 June 2023 08:02

Hi @goce.dokoski

Our engineers found it was another feature (MultiGlobPathsForRawYaml) that was enabled that was causing the problem. It’s currently been disabled on all affected customers, which there was a few, and everything seems to be moving on correctly.

I was curious about why the exception was being thrown in the remote check area when it was another Kubernetes feature that was causing this, but I logged on too late to ask our engineers about this.

Just to test, would you mind if I re-enabled the success check feature to see if the skip is still not working? The failures should be gone now according to our engineer’s findings, but I’m keen to see if this is something we need to look at. It would require another reprovision, so if this will disturb your work day I can schedule this to be done overnight in your maintenance window. Let me know how you feel about this.

Kind Regards
Sean

goce.dokoski · 27 June 2023 14:16

Hi Sean,

we needed the instance until now, but you can now reenable the setting.

Kind Regards,
Goce