What is the "Check Tentacle health for Default Machine Policy" in Octopus Deploy, and why does it seem to be running all the time?

Bob_Walker · 31 March 2020 14:47

I’ve noticed we have a task on our instance called Check Tentacle health for Default Machine Policy that seems to kick off every hour and takes quite a bit of time to finish. What is this task? Why does it exist? How can I speed it up?

Bob_Walker · 31 March 2020 15:30

Machine policies test the connectivity and health of your deployment target via machine policies. A subscription can be created to notify the appropriate people if Octopus Deploy is unable to connect to machines. The original thought was if Octopus Deploy cannot see a deployment target then any deployments to that target will fail.

When a deployment target is created it needs to be assigned to a machine policy. If no machine policy is specified then the Default Machine Policy will be chosen automatically. The Default Machine Policy has these default settings:

Run once an hour
Check connectivity
Run a script to check the remaining space on the hard drives

There are also multiple settings which can affect how fast a machine policy will run. By default, Octopus will wait 1 minute to connect to a polling tentacle and will retry that up to 5 times. What this means is at a minimum, it can take upwards to 5 minutes before a machine fails a health check. For polling tentacles, the default is 5 minutes with a 2 minute wait, resulting in 10 minutes before failure.

Originally Octopus attempted to connect to all machines at once during a health check. As time went on, that proved to be a problem once someone had 100s of machines. That would overload all the possible network connections and cause Octopus to lock up. How Octopus works now is it will check 10(ish) machines at a time. If a Octopus Deploy instance had 100 listening tentacles, and all of them were down, it could take upwards of 50 minutes for the health check to finally finish. If they were all polling tentacles, it would be 1 hour, 40 minutes before the health check finally finishes.

It is debatable the value add in doing all that work for all machines in all environments. Typically we have found:

Deployment targets in Dev and Test environments are deployed to multiple times a day. Often times that could be several times an hour, thus negating the need for a health check.
Deployment targets in Staging and Production have one or two deployments per day. If there is any sort of problem the admins would like to know about it.

To speed things up we recommend (this is in order, keep adjusting until you find something that works).
You can adjust machine policies by going to infrastructure -> machine policies:

Determining if running the hard drive space check script adds value. If it does not, then alter the machine policy to connectivity only
Determine how often you want to run the health check. If once an hour is too often, you can back it down to once or twice a day. You can also configure a CRON expression to specify the exact time a health check runs (or not at all)
Adjusting timeout and retry settings.
If you have 100+ machines we recommend creating a machine policy per environment. When you add targets you specify the machine policy for that environment.