Octo HA : on server termination, deployments auto cancel for the node

Octo HA : on server termination, deployments auto cancel for the node. Ideally those deployments should have transferred to the other nodes present in the cluster. We are trying to run HA cluster in ECS. We are not able to run more than 17-18 tasks concurrently per server before container dies due to failing health check. On windows dedicated box, we were able to execute around 40 tasks before seeing timeouts. Also it takes around 8 minutes for auto-cancelling the deployments, which seems to be a significant time period. We have attached EFS drive for file storage and SQL 2019 server for DB.

Details:
Octo version: 2020.2.15
Instance type: C5.2xlarge
number of nodes in cluster: 2

Can someone please help us to understand the HA behaviour and if we are missing any setting.

Thanks and Regards,
Devan

1 Like

Hi Devan,

Thank you for reaching out to us with your query.

Please find my responses in-line with your message below:

on server termination, deployments auto cancel for the node. Ideally those deployments should have transferred to the other nodes present in the cluster.

I can confirm that this is the correct behaviour for the system at present - tasks that are in-flight will be cancelled if a node is lost. We recognise that this might not be ideal so we are in the early stages of work which should improve how this is handled. I’d recommend keeping an eye on our blog, the what’s new page and the release notes as any updates will likely be published there.

We are trying to run HA cluster in ECS. We are not able to run more than 17-18 tasks concurrently per server before container dies due to failing health check. On windows dedicated box, we were able to execute around 40 tasks before seeing timeouts.

My understanding from your message is that you started with a single Windows instance and are now running two containers instead. If the containers are running on the same host then they will be sharing resources which will result in the overall performance being similar to what you originally had (if not slightly worse due to the container overhead).

Also it takes around 8 minutes for auto-cancelling the deployments, which seems to be a significant time period.

It will take some time for deployments to be cancelled, as the system will need to recognise the loss of a node, the tasks may need to timeout and so on. This could take some time and so this doesn’t sound particularly unusual and shouldn’t be a cause for concern.

I hope this is helpful. Please let me know if you have any questions.

Best Regards,

Charles