Octopus.Server.Schedules.ProcessAutoDeployments taking 30+ minutes

dandrews-dk · 25 September 2018 19:00

We have recently enabled many of our projects to auto-deploy when new instances register with Octopus. During our large scale-out periods 3 times per week, we’ve noticed Octopus gets very very slow to deploy. Many deploy projects get queued (40-60 projects with 4-40 targets) and the single node that we have appears offline. Reviewing the logs we see this happen many times and it gets longer and longer between completions of this task: Octopus.Server.Schedules.ProcessAutoDeployments

According to the error, we should contact the support team if this happens frequently. Can you give us some guidance, please?

Full error here:
The scheduled task Octopus.Server.Schedules.ProcessAutoDeployments started at 9/24/2018 8:26:32 PM +00:00 and took 00:31:12.3072275 which can make this Octopus Server node ‘DKBUILD-OCTO’ appear to be offline. If this happens frequently please report this to the Octopus Support team.

dandrews-dk · 25 September 2018 20:13

I should mention we’re currently using Octopus v2018.4.12

Shane_Gill · 26 September 2018 09:15

Hi,

Thanks for getting in touch.

The issue that causes your node to appear offline has been mitigated as of 2018.6.8: https://github.com/OctopusDeploy/Issues/issues/4533

30 minutes is an extremely long time for that task to run. Can you elaborate on “Octopus gets very slow to deploy”? Do you mean the actual deployments start getting slow, or just the scheduling of automatic deployments?

Seeing your auto deploy logs and full Octopus server log after one of the scale-out periods would be a huge help figuring out why the task is taking so long. You can email them to support@octopus.com (reference this topic) to maintain your privacy.

Cheers,
Shane

dandrews-dk · 26 September 2018 15:26

I will email the logs today. The symptoms are this:

We scale out and between 100-200 instances join Octopus over the course of 15-20 minutes.
Roughly 40-50 octopus projects have auto-deploy triggers run when machines become available for deployment.
Our task cap is currently 10, but projects start deploying and then at some point just stay queued with fewer than 10 tasks running. During this period of time we observer our single Octopus Node displays as offline.
At some point 5 or so minutes later, it comes back online and starts deploying queued tasks.
Another 10-15 tasks complete before Octopus node appears offline again and queued tasks stop deploying.
Each time Octopus appears offline seems to grow longer. For example the second time it happens it stays offline for 10-15 minutes. The third time is ~20 minutes, and the final one during that period was 30+ minutes.
Once this long queue of tasks finally gets processed we have no offline periods (apparently) until the next time we scale out.

Questions:

Is 10 a reasonable task cap or should we increase that?
We are currently running Octopus server on an AWS c5.2xlarge instance. Are there guidelines to right-sizing that?

Shane_Gill · 26 September 2018 22:54

Hi,

Upgrading past 2018.6.8 will resolve the issue with the node appearing offline and the task queue pausing.

It’s difficult to recommend a task cap or instance size because “it depends”. The task cap is designed to let you control how the instance is utilised. If you have resources to spare when 10 deployments are running, increase the cap. If the instance is under pressure but you want more deployment parallelism then increase the instance size. Observe the instance when you are scaling out - is it CPU bound while trying to process the auto deploy triggers, handle web requests and do 10 deployments? How is the SQL Server usage?

Thanks for sending logs, I am having a look at them and will let you know what I spot.

Cheers,
Shane

dandrews-dk · 27 September 2018 21:10

Shane,

After reviewing the instance itself it really isn’t using many resources. We’ve increased the task cap from 10 to 60 and that seemed to work better because it at least processed 60 tasks before appearing offline again. We will schedule an upgrade to at least 2018.6.8 for the near future and hopefully that resolves our issue.

Thanks again,

David Andrews