"Raw Kubernetes Yaml" check fails with "An item with the same key has already been added."

sean.stanway · 27 June 2023 14:21

I’ll re-enable the setting, but I’ll let it reprovision overnight so the feature will be back in the morning.

If you could run the same test (run deployment with the check to be skipped) and let me know if the skip still doesn’t work, then I can get that raised up as a separate problem.

Kind Regards
Sean

goce.dokoski · 28 June 2023 07:19

Hi Sean, today running a k8s deployment…

the resource gets created successfully
but afterwards the step again gets stuck,

previous error message is not shown, but instead the following message is printed for already 20 minutes,

The experimental k8s check is not enabled, and is not even available to choose from the GUI.

sean.stanway · 28 June 2023 09:06

Hi @goce.dokoski

I’ve set off the feature disabling and I’ll raise up this to our team to see what’s going on with the messages you were getting. I’ll leave it off in the future until I get solid confirmation that the feature is working as intended.

Let me know if any other issues come up on the back of this.
Kind Regards
Sean

goce.dokoski · 28 June 2023 09:33

Hi @sean.stanway

thanks, now the deployments work again,

showing a message
"Resource status check completed successfully because all resources are deployed successfully "

Kind regards,
Goce

sean.stanway · 28 June 2023 09:57

Hi @goce.dokoski

How very odd that those messages are displaying when that feature is disabled. I wonder if the configuration for it was saved somewhere in the background. I’m glad that it’s working now though.

Kind Regards
Sean

goce.dokoski · 28 June 2023 10:26

Hi @sean.stanway

ah unfortunately it is happening again, this time the first two steps go thorugh,
ending with:
“Resource status check completed successfully because all resources are deployed successfully”

and then the third one gets stuck with
"Resource status reported: 1 updates, 0 removals "

sean.stanway · 28 June 2023 10:44

Hi @goce.dokoski

I’ll sign into your instance and see what might be happening. Not sure what could be causing this so I’ll have a look at the latest task log.

Sean

sean.stanway · 28 June 2023 10:58

Hi again @goce.dokoski

While it is doing a resource status check, it does look like your deployment is still going through (as you mentioned in your DM), albeit it’s being slowed down by this check. I can see there were a couple that go cancelled by yourself, presumably due to them taking a long time, but the latest looks to actually be progressing.

I’ll keep an eye on it and see what happens. It may be certain points of the deployment that might stall out.

I’ve also put a request into our engineers to see if there is someone who can tell me more about this feature and check it’s functionality. Having the check turned off, and the feature off as well, should remove all this from your deployments. I’m not sure why they’re still showing. I’m hoping someone can have a look overnight and shed some light on this.

I apologise for the inconvenience this is causing, I hope it doesn’t stall out any further deployments today.
Kind Regards
Sean

goce.dokoski · 28 June 2023 11:14

Hi Sean,

yes there was now one succesfull run where each step took at most 40-50 seconds,
but the previous runs were getting stuck and not ending after 40 minutes,

and now there is another run where a step is again not finishing for 6 minutes already,

so if possible please disable the feature completely, so that we can use the deployments.

Best regards,
Goce

goce.dokoski · 28 June 2023 12:14

The run that passed had disabled the first step that got stuck before.
But the subsequent runs then got stuck on the next one.

I got once:

Resource status check terminated because the timeout has been reached but some resources are still in progress
The remote script failed with exit code 255emphasised text

But now again just a hanging step.

sean.stanway · 28 June 2023 13:14

Hi @goce.dokoski

The feature is currently off on your instance, so I’m not sure what is going on with this here. I’ll ask some of my senior support colleagues and see if there is something we can change on your instance to resolve this.

Kind regards
Sean

sean.stanway · 28 June 2023 13:50

Hi @goce.dokoski

I’ve checked with my other colleagues and it looks like I’ve done all I can from my side regarding this feature, so I’ve raised up a separate request for this with our engineering team. Hopefully it’s just something that is cached in the process or the toggle flag isn’t being respected.

I’ve got fingers crossed that it will be a simple fix once one of the engineers can take a look and we can get your workflow going again without the delays.

Kind Regards
Sean

goce.dokoski · 28 June 2023 14:14

Hi @sean.stanway

Thanks a lot for the assistance.

Currently I could narrow down the problem to one particular step that always gets stuck,
all the other ones work fine now.

It always fails when checking for this resource (slightly anonymized),
also when it is created in another step then the original.

When this is disabled, all other resources including other that are of the Deployment kind are still working fine.

---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: s...
  labels:
    app: ...
    garbage-collection: okay
  annotations:
    reloader.stakater.com/auto: "true"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ...
  template:
    metadata:
      labels:
        app: ...
    spec:
      containers:
        - name: s...
          image: ...
          imagePullPolicy: IfNotPresent
          command: ...
          ports:
            - name: web
              containerPort: ...
          env:
            ...
          volumeMounts:
            - name: ...
              mountPath: ...
              readOnly: true
            - name: ...
              mountPath: ...
              readOnly: false
            - name: ...
              mountPath: ...
          resources:
            requests:
              memory: "2Gi"
              cpu: "100m"
            limits:
              memory: "2Gi"
              cpu: "500m"
      volumes:
        - name: ...
          persistentVolumeClaim:
             claimName: ...
        - name: ...
          secret:
            secretName: ...
        - name: ...
          configMap:
            name: ...

sean.stanway · 28 June 2023 14:56

Hi @goce.dokoski

Thanks for passing that information over. I’ll make sure the engineers are aware of this as it might point them towards a potential sticking point.

Kind Regards
Sean

finnian.dempsey · 29 June 2023 05:05

Hi @goce.dokoski,

Just jumping in from the Australia based team with a quick update from the devs.

It looks like the toggle for MultiGlobPathsForRawYamlFeatureToggle is causing the KubernetesDeploymentStatusFeatureToggle feature to use a different code path that isn’t respecting the ResourceStatusCheck value set on the deployment process.

I’ve created a Github issue for this here to track while the devs work on getting a fix out however a workaround in the meantime would be to disable the toggle for MultiGlobPathsForRawYamlFeatureToggle which would allow the Kube Object Status feature to actually be turned off.

Please let us know if you’d like MultiGlobPathsForRawYamlFeatureToggle disabled during your next maintenance window or if you have any questions at all!

Best Regards,

goce.dokoski · 29 June 2023 08:19

Hi @finnian.dempsey,

yes, then please disable the feature the next possible moment.

Best,
Goce

sean.stanway · 29 June 2023 10:07

Hi @goce.dokoski

I can get this disabled immediately if this is currently blocking your workflow. Just let me know, and I’ll kick off the reprovison and downtime.

Kind Regards
Sean

goce.dokoski · 29 June 2023 11:00

Hi @sean.stanway

yes, please go ahead.

Best,
Goce

sean.stanway · 29 June 2023 11:10

Hi @goce.dokoski

I’ve kicked off the reprovision, so it will be the usual downtime until your instance is back up. Once that’s done, you shouldn’t see any status reporting in your logs and everything should go back to normal.

Kind Regards
Sean

goce.dokoski · 29 June 2023 12:44

Hi @sean.stanway

the deployments now go through, without printing any resource status checks.
So it looks like the check is now finally skipped

Best,
Goce