Kubernetes deployment target health checks failing

Good afternoon,

A little while ago today all of our Kubernetes deployment targets started reporting as unhealthy.

Upon inspection, it seems that kubectl is no longer available when running health checks directly on a worker.

For us the fix is easy, we can simply run a custom docker image in a container. However, this change/regression was unexpected - it looks like the options required to run health checks using a docker image running in a container were only added in the current version, 2021.2 - so I thought it worth bringing to your attention.

For anyone that is interested, the script we use is below, and the logs of the failed health check are below that.

Thanks,
David


Script to create deployment target

param(
    ### The name of the EKS cluster to create a deployment target for.
    [Parameter(Mandatory = $true)]
    [string]$ClusterName,

    ### The name of the AWS account to use to connect to the EKS cluster.
    [Parameter(Mandatory = $true)]
    [string]$AwsAccountName
)

### Describe the EKS cluster that the deployment target will be for.
Write-Highlight "Describing EKS cluster..."
$cluster = aws eks describe-cluster --name $ClusterName --query "cluster" | ConvertFrom-Json

### Create a hash table containing arguments to pass to the function that will create the deployment target.
$newTargetArgs = [ordered]@{
    name = $cluster.name.ToLower()
    octopusRoles = "aws-eks, aws-eks-{0}" -f $cluster.name.ToLower()
    clusterUrl = $cluster.endpoint
    octopusAccountIdOrName = $AwsAccountName
    clusterName = $cluster.name
    namespace = "default"
    updateIfExisting = $true
    skipTlsVerification = $true
    healthCheckContainerImageFeedIdOrName = "Feeds-1482"
    healthCheckContainerImage = "ldx-analytics/deployment:latest"
}

### Create the new deployment target.
Write-Highlight "Creating deployment target..."
New-OctopusKubernetesTarget @newTargetArgs

Health check logs

Task ID:        ServerTasks-1057521
Related IDs:    Machines-3657, Spaces-182
Task status:    Failed
Task queued:    Wednesday, 15 September 2021 3:31:40 PM +00:00
Task started:   Wednesday, 15 September 2021 3:31:57 PM +00:00
Task completed: Wednesday, 15 September 2021 3:32:01 PM +00:00
Task duration:  4 seconds
Server version: 2021.2.7428+Branch.release-2021.2.Sha.d771d6437f879f789be3f86d8c7d4ffa53eb3867
Server node:    octopus-i009472-85f99c4fb-wjsmm

                    | == Failed: Check my-cluster health ==
15:31:57   Info     |   Starting health check for a limited set of machines.
15:31:57   Verbose  |   Health check was requested for 1 machine
15:31:57   Verbose  |   Found 1 matching machine
15:31:57   Info     |   Performing health check on 1 machine.
15:32:01   Verbose  |   Checking for Tentacles to update
15:32:01   Fatal    |   The health check failed. One or more machines were not available.
                    | 
                    |   == Failed: Check deployment target: my-cluster ==
15:31:57   Verbose  |     Performing health check on machine
15:31:57   Verbose  |     Leased worker octopus-worker from pool AWS ECS Linux (lease WorkerTaskLeases-45323).
15:31:57   Verbose  |     Script isolation level: NoIsolation
15:31:58   Verbose  |     Executable directory is /bin
15:31:58   Verbose  |     Executable name or full path: /bin/bash
15:31:58   Verbose  |     No user context provided. Running as current user.
15:31:58   Verbose  |     Starting /bin/bash in working directory '/etc/octopus/Work/20210915153157-1057521-54' using 'Unicode (UTF-8)' encoding running as 'root' with the same environment variables as the launching process
15:31:58   Verbose  |     Process /bin/bash in /etc/octopus/Work/20210915153157-1057521-54 exited with code 0
15:31:58   Verbose  |     Using Calamari.linux-x64 19.4.8
15:31:58   Verbose  |     Script isolation level: NoIsolation
15:31:59   Verbose  |     Executable directory is /bin
15:31:59   Verbose  |     Executable name or full path: /bin/bash
15:31:59   Verbose  |     No user context provided. Running as current user.
15:31:59   Verbose  |     Starting /bin/bash in working directory '/etc/octopus/Work/20210915153159-1057521-55' using 'Unicode (UTF-8)' encoding running as 'root' with the same environment variables as the launching process
15:32:00   Verbose  |     Calamari Version: 19.4.8
15:32:00   Verbose  |     Environment Information:
15:32:00   Verbose  |     OperatingSystem: Unix 4.14.243.185
15:32:00   Verbose  |     OsBitVersion: x64
15:32:00   Verbose  |     Is64BitProcess: True
15:32:00   Verbose  |     Running on Mono: False
15:32:00   Verbose  |     CurrentUser: root
15:32:00   Verbose  |     MachineName: 862e32519743
15:32:00   Verbose  |     ProcessorCount: 2
15:32:00   Verbose  |     CurrentDirectory: /etc/octopus/Work/20210915153159-1057521-55
15:32:00   Verbose  |     TempDirectory: /tmp/
15:32:00   Verbose  |     HostProcess: Calamari (3177)
15:32:00   Verbose  |     Performing variable substitution on '/etc/octopus/Work/20210915153159-1057521-55/Script.sh'
15:32:00   Verbose  |     Executing '/etc/octopus/Work/20210915153159-1057521-55/Script.sh'
15:32:00   Verbose  |     Setting Proxy Environment Variables
15:32:00   Verbose  |     "chmod" u=rw,g=,o= "/etc/octopus/Work/20210915153159-1057521-55/kubectl-octo.yml"
15:32:00   Verbose  |     Temporary kubectl config set to /etc/octopus/Work/20210915153159-1057521-55/kubectl-octo.yml
15:32:00   Error    |     Could not find kubectl. Make sure kubectl is on the PATH. See https://g.octopushq.com/KubernetesTarget for more information.
15:32:00   Verbose  |     Process /bin/bash in /etc/octopus/Work/20210915153159-1057521-55 exited with code 1
15:32:01   Verbose  |     Released worker octopus-worker from lease WorkerTaskLeases-45323
15:32:01   Verbose  |     Exit code: 1
15:32:01   Fatal    |     The remote script failed with exit code 1
15:32:01   Verbose  |     The remote script failed with exit code 1
                    |     Octopus.Server.Orchestration.Targets.Tasks.ActionHandlerFailedException: The remote script failed with exit code 1
                    |     at Octopus.Server.Orchestration.ServerTasks.Deploy.ActionDispatch.SuccessArbitrator.ThrowIfNotSuccessful(IActionHandlerResult result) in ./source/Octopus.Server/Orchestration/ServerTasks/Deploy/ActionDispatch/SuccessArbitrator.cs:line 22
                    |     at Octopus.Server.Orchestration.ServerTasks.Deploy.ActionDispatch.AdHocActionDispatcher.Dispatch(Machine machine, ActionHandlerInvocation actionHandler, ITaskLog taskLog, VariableCollection variables) in ./source/Octopus.Server/Orchestration/ServerTasks/Deploy/ActionDispatch/AdHocActionDispatcher.cs:line 56
                    |     at Octopus.Server.Orchestration.ServerTasks.HealthCheck.Controllers.VirtualTargetHealthController.CheckHealth(Machine machine, ITaskLog taskLog) in ./source/Octopus.Server/Orchestration/ServerTasks/HealthCheck/Controllers/VirtualTargetHealthController.cs:line 93
                    |     at Octopus.Server.Orchestration.ServerTasks.HealthCheck.HealthCheckService.PerformHealthCheck(Machine machine, ITaskLog taskLogForMachine, CancellationToken cancellationToken, IHealthResultCollator healthResultCollator, ExceptionHandling exceptionHandling, Action`2 customAction) in ./source/Octopus.Server/Orchestration/ServerTasks/HealthCheck/HealthCheckService.cs:line 92
                    |     Octopus.Server version 2021.2.7428 (2021.2.7428+Branch.release-2021.2.Sha.d771d6437f879f789be3f86d8c7d4ffa53eb3867)
15:32:01   Verbose  |     Recording health check results
                    |   
                    |   == Failed: Summary ==
15:32:01   Info     |     Unhealthy:
15:32:01   Info     |     - [my-cluster](~/app#/Spaces-182/infrastructure/machines/Machines-3657/settings) at https://BBED18E2E9B9474BBCF7EB88117FA686.gr7.eu-west-2.eks.amazonaws.com, error: The remote script failed with exit code 1
15:32:01   Fatal    |     One or more machines were not available. Please see the output Log for details.

Hi @dgard1981!

Thanks for reaching out, and the great question.

I wonder, has your worker been rebuilt in the time period from when it was working? We don’t actually ship kubectl as part of Octopus, due to the varied compatibility issues between what we would ship, and what k8s our customers use on their clusters. If the worker was rebuilt or otherwise modified, it might explain why it suddenly stopped working.

I’m glad that you did get this working by supplying a HealthCheck container - this is a new feature for k8s clusters, and I’m glad you could use it to get you unstuck!

Please let us know if you have any further questions, and we’ll do our best to get you an answer!

Hi @Justin_Walsh ,

Thanks for the reply. To confirm, nothing changed on our end, and regardless of whether or not we made a deployment, all of our Kubernetes targets became unhealthy.

We were never using our own worker for the health checks previously, so I guess that Octopus worker that is used for these health checks has been updated and kubectl has been removed entirely.

Thanks,
David

Thanks for the update, David.

Just digging into your Octopus Cloud instance’s upgrade history, and I’m not seeing any upgrades since September 1st - does this line up with when the healthchecks started failing?

I can see from your log that the healthchecks are happening on your worker here:
15:31:57 Verbose | Leased worker octopus-worker from pool AWS ECS Linux (lease WorkerTaskLeases-45323). - it might be worth checking to see on the previously succeeding ones to see what and where they were running.

If you would like, we can log in to your cloud instance and take a look too. Just let us know!

Best,

Hi Justin,

It’s possible the behaviour started on the 1st of September and was only picked up yesterday when we did some releases.

Unfortunately I’ve now deleted everything to do with those deployment targets and they will be recreated tomorrow (going from pre-prod to prod) so there is nothing to look at.

From our point of view we’re sorted because we can run the healthcheck in a container. I really just logged for awareness and just in case someone else has the issue and finds the post useful.

Thanks,
David

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.