Health Check Task Immediately Deletes Itself Following Completion?

Hello.

Our company has been using a script referenced in your documentation to do Health Checks via PowerShell. Recently we are running into a major issue where when the health check task completes the task will automatically delete itself right away, rendering the check pointless as we aren’t able to check the status of the health check. See here for reference: Run a health check - Octopus Deploy

Any idea on why this is happening all of the sudden? How can I prevent the task from automatically being deleted?

Thanks,
Mike

Hi Mike,

Thanks for reaching out, sorry to hear you’re having trouble with a task being deleted instantly!

I have to admit that is some strange behaviour but what I believe could be happening is that the Lifecycle being used only allows for keeping the last x releases instead of days:

I’d be happy to take a look into exactly what’s going on, could you please share the Task log of a Health Check? It should indicate towards the end of the file which retention policies are being applied.

However, if you are wanting to execute tasks once a machine becomes healthy or other similar events, I would definitely recommend looking into Deployment Target Triggers as they were created with this use case in mind.

I’d be happy to answer any questions you might have, let me know how you get on!

Best Regards,

Thanks for the reply!

So I grabbed the task log before it deleted itself and I don’t see any information about retention policies.

Aren’t lifecycle retention policies related to releases?

This is a manual task kicked off via PowerShell, not connected to a release or deployment.
I’m using the exact code from Run a health check - Octopus Deploy

Thanks,
Mike

Hi @micharpentier,

Thanks for the update and for the extra information!
Can I ask what version of Octopus you’re running, please?

Kind Regards,
Adam

Hey Adam,

We are currently running Octopus Server version 2021.3.12217.

Thanks,
Mike

@adam.hollow

Hi @micharpentier,

Thank you for being patient while I looked into this.

I had a healthcheck run for all machines in an environment just how you have with the example Health Check script however the task that results from this hasn’t been immediately deleted.

I’m currently spinning up a version of Octopus via Docker that is the same version as yours just to rule out a version specific error but in-case it’s not, it may be worth retrieving some server logs from you to troubleshoot further.

Would you be able to run the health-check script once more and then after the task is deleted, take a copy of the logs in C:\Octopus\Logs on the Octopus Server?
Hopefully we can see if anything strange is going on and work towards a resolution that way.

Kind Regards,
Adam

Just a follow-up here to let you know I’ve completed testing on 2021.3.12217 and unfortunately wasn’t able to reproduce the issue mentioned.

I’ll await your response RE server logs to see if we can assist in troubleshooting further.

Kind Regards,
Adam

Hey @adam.hollow

Could it possibly be a Octopus Client Dll version problem? I am currently using version octopus.client.11.2.3319.

I attached a 30 second server log snippet from when I sent a Health Check request.

OctoServerLogHeatlhCheck.txt (44.3 KB)

FYI I just tried using octopus.client.13.0.3778 and it gave me same issue.

I did some more testing. Using Powershell via Octopus.Client doesn’t work but using Powershell via Rest API does work. Not sure if that will possibly help you triage.

Hi @micharpentier,

Thanks for that additional info and logs!

I’ll get working on trying to reproduce this on my end using that same version of the Octopus Client.

I’ll keep you posted with any updates or if I have any questions!

Best Regards,

Any luck reproducing? @finnian.dempsey @adam.hollow

This is still blocking us.

Thanks!

Hi @micharpentier,

Apologies for not getting back to you sooner.
We’ve unfortunately been unable to reproduce the error you’re experiencing.

On your version of Octopus I was able to successfully run numerous health checks using both the REST API and Octopus.Client without any of the tasks being deleted afterwards.
This was done using the code from the example page that you linked earlier in the thread.

I wondered if you would be able to send over the code that you’re using to run the health checks?
If it’s possible that you’ve made any modifications then it would allow us to test the exact code that you’re using vs our instance internally.

I saw that previously in the thread you sent a screenshot of the audit for these being deleted, is there any further information on those audit logs when you expand the arrow on the right?

Thanks and Kind Regards,
Adam

Below is what I am running. You will need to define octopusURL, octopusAPIKey and EnvironmentName.

$octopusURL = "";
$octopusAPIKey = ""
$spaceName = "Default"
$octopusClientFolderPath = "D:\octopus.client.11.2.3319"
$taskDescription = "Health check initiated for environment $environmentName";
$taskTimeOutAfterMinutes = 2;
$machineTimeoutAfterMinutes = 2;

#Import Octopus Client DLL
Add-Type -Path "$octopusClientFolderPath\lib\net452\Octopus.Client.dll";

# Choose an Environment, a set of machine names, or both.
$EnvironmentName = ""
$MachineNames = @()

$endpoint = New-Object Octopus.Client.OctopusServerEndpoint $octopusURL, $octopusAPIKey
$repository = New-Object Octopus.Client.OctopusRepository $endpoint
$client = New-Object Octopus.Client.OctopusClient $endpoint

try
{
    # Get space
    $space = $repository.Spaces.FindByName($spaceName)
    $repositoryForSpace = $client.ForSpace($space)

    # Get EnvironmentId
    $EnvironmentID = $null
    if([string]::IsNullOrWhiteSpace($EnvironmentName) -eq $False) 
    {
        $EnvironmentID = $repositoryForSpace.Environments.FindByName($EnvironmentName).Id
    }
    
    # Get MachineIds
    $MachineIds = $null
    if($MachineNames.Count -gt 0)
    {
        $MachineIds = ($repositoryForSpace.Machines.GetAll() | Where-Object {$MachineNames -contains $_.Name} | Select-Object -ExpandProperty Id) -Join ", "
    }
    
    # Execute health check
    $healthCheckTaskId = $repositoryForSpace.Tasks.ExecuteHealthCheck($Description,$taskTimeoutAfterMinutes,$machineTimeoutAfterMinutes,$EnvironmentID,$MachineIds).Id
    Start-Sleep -Seconds 30;
    ($repositoryForSpace.Tasks.Get($healthCheckTaskId)).State;
}
catch
{
    Write-Host $_.Exception.Message
}

Powershell Script Output

Hi @micharpentier,

Thanks for providing that extra info!

Unfortunately our reproductions still haven’t been successful, and we’ve raised this with the engineers to help investigate.

I couldn’t help but notice that the code you’ve provided and the output from Powershell have a few differences, such as the missing argument for the WorkerPoolId. Sorry to press this but could you please double check that the code you’ve sent through is exactly the same as the task being deleted?

.ExecuteHealthCheck($Description,$taskTimeoutAfterMinutes,$machineTimeoutAfterMinutes,$EnvironmentID,$MachineIds).Id


(The queue time being 01/01/0001 is intriguing also!)

Could you also please run a System Integrity Check and send through the results if there are any errors?

I’ll keep you posted with any updates or if we require any further info, looking forward to getting to the bottom of this!

Best Regards,

Here’s a screenshot that shows my exact code as well as proof that the task immediately deletes.

We noticed the QueueTime as well today, maybe because the queue time is 01/01/0001 the timeout is being hit immediately or something of that sort?

System Integrity Check

Hi @micharpentier,

Thanks so much for confirming that!

An error in the server logs does suggest that time could the cause of the issue:

 ---> Microsoft.Data.SqlClient.SqlException (0x80131904): The datediff function resulted in an overflow. The number of dateparts separating two date/time instances is too large. Try to use datediff with a less precise datepart.

It seems that because we are calculating the time difference in seconds and the value 01/01/0001 (which is the null value) is far too large. Could you please confirm that the instance time is correct? Does the command [DateTime]::UtcNow return a non-null value?

I’ll make sure to keep you posted with any updates from the devs. In the meantime, to get you unblocked, are you able to use the RestAPI method while we investigate this potential issue with the client?

Feel free to reach out with any questions!

Best Regards,