We have migrated to a High Availability cluster design with 30 server nodes ( linux containers ) and 30 listener worker containers ( linux workers ). We are seeing a huge number of connections ( ~1.0k - 1.4k ) at any given moment regarding the database and want to know if the server is correctly disposing of them? Is this high number of DB connections normal?
Just jumping in for Daniel as he is currently off shift as part of our Australian based team. 30 server nodes is quite a lot, that is the most nodes I have seen a customer have having been at Octopus for a year.
I am not sure about the database connection numbers, I have asked our engineering team if this is common considering the amount of nodes you have. I will also ask them if there is a specific tool they use in order to look at DB connections and disconnections as that might help you answer your question.
If our engineering team say that is a high number of connections for the size of your instance I will get back to you and we can grab some logs and see if we can diagnose what is going on here.
I will keep you updated, reach out in the meantime if you need anything further.
I have an update for you from our engineers, they mentioned the SQL connections numbers are what they would expect to see given the high number of Server nodes.
Each node will be using some connections while performing normal background activities. These connections will “pooled” and held open for a while to mitigate the cost of re-establishing them every time they’re needed.
The engineers did mention that Octopus is not really a lightweight service that benefits from a large scale-out. They said that running a handful of more powerful instances rather than more nodes would likely be better and you would then experience less SQL connections.
I did think 30 nodes were a lot and having spoken to our engineers who have been around awhile the highest number they have seen was 10 and that was on a huge instance. Was there any reasoning for the node numbers when deciding how your Octopus infra would be designed, would you be able to scale down the amount of nodes you have and potentially up the resources on some nodes so you are able to have less? That would not only benefit your connection numbers but would be a lot easier to administer from a server Infra point of view.
Obviously the design of your infra is completely up to you and will be different depending on your circumstances but I just wanted to offer our professional opinion to see if it would benefit you.
I hope that helps alleviate any concerns you had surrounding the SQL connection numbers, they are on par with the amount of nodes you do have and the only way to help decrease that is to scale down the amount of nodes you have and potentially up the resources of the nodes you keep in order to keep functionality.
Let me know if there is anything else we can help you with.
Sorry for the delayed response, its been a crazy week. So the reason we need a high number of server nodes is the Task Cap concurrency. We regularly are deploying 150-300 deployments simultaneously during a given change window and need a high throughput on the number of concurrent tasks we are able to run.
When it comes to the number of concurrent deployments on the large instances that your Engineers are referencing what are the metrics for those? Is there a different way in which we can achieve that number of concurrency for deployments while decreasing the number of nodes?
Thank you for your use case for the node numbers you are running, I am wondering if you have seen our Octopus Server Task Cap feature. By default this is set to 5 which means that Octopus instance will run 5 deployments in parallel and the rest will wait until one finishes and it will run the next queued one.
I am not sure what you have yours set to but you can set this value to as many as you like. It should scale up better than linearly. e.g. If 2 CPU 8GB RAM can do 10 tasks, 12 CPU 64GB should be able to do 25.
A task cap of 30 seems to be a good starting point for tuning. So set that and start with lots of resources and then scale down from there and/or scale up the task cap.
Our engineer said they have run it with a task cap of 100, and that works ok on his desktop (but he mentioned it does depend on the deployment).
If you have that task cap set already you can still up that cap whilst upping resources on the VM/server which should allow you to run more tasks in parallel and get rid of some of those nodes.
We are in contact with our Solutions team also who may be able to suggest some things to help you here but the first thing they will ask is what your task cap is set to.
Were you aware of that setting at all and if so what do you have it set to so I can relay that to our Solutions team.
Just to introduce myself, I am Doug and I work as part of the Solutions team here at Octopus.
I have been asked to help have a look at your questions surrounding your Octopus node count.
From reviewing the ticket, I think it is fair to say that having such a high number of Octopus nodes will have its draw backs when it comes to scaling Octopus this way. In your case, connections to the database.
Can I ask what has led you the decision for such a high number of Octopus nodes? I appreciate that you covered that you can deploy between 150-300 deployments, but just wanted to double check in case there were any other business reasons before I made any assumptions.
In case you haven’t seen this in our documentation, we do have recommendations for large scale configurations that explains in a little more detail when thinking about scaling out your Octopus setup.
I think in this instance I would consider the following for your configuration.
Firstly, I would look to increase the compute for a handful of your Octopus nodes (not all of them, but for example I am going to say for 10 of them, but you can scale down accordingly).
For the nodes with the increased compute, I would look to increase the Task cap on each node. You can do this by following this guide. For example, with the correct compute (following on from the recommendations for compute in the docs) look to increase the task cap to 30 for example.
I would then look to decrease the task caps on the remaining 20 nodes (that haven’t had their compute increased), until you hit a point where you can comfortably start to remove your nodes from your HA group as they become redundant (you can keep these around in case you hit a performance spike and need to commission these servers again).
Closely monitor your remaining Octopus nodes performance.
Following this, you could take this further if you wanted to decrease your node count further, this way you can do it in a controlled way that is comfortable for your needs.
The idea here is that you offload the work from 30 nodes down to a select few nodes that has the appropriate amount of compute.
I appreciate that there may be some work from your side here, but if you do have any further questions, please do reach out to us and we would be happy to continue this conversation to help get you to a point of where you need to be.
So the 150-300 concurrent deployments ( or higher in the future as the growth occurs ) is a requirement given that we are deploying the same micro service to hundreds of different tenants that have their own unique configuration. We need to maintain that level of capability while also seeing that deployment time isn’t longer than a few minutes at most each.
We are aware of the Task Cap and yes the linux server container starts up with 5, which we can modify the statup configuration script on that, or do so in the GUI post start up. Prior to HA we found that using a single EC2 instance Octopus would tip over around 30-35 on the Task Cap, and that is why we have the number of Nodes we are currently running. 30 Nodes * 10 Task Cap = 300. So for our current numbers, not taking into account rapid company growth, we couldn’t go lower than 10 server nodes because of the nature of Octopus dying around 30-35 Task Cap. We started with 30 to provide that room for growth, but yes we could scale down slightly for now as you mentioned.
I’ll look at increasing the CPU requests and limits on the containers and seeing if we can scale down the number of current nodes, while increasing the cap from its current of 10 to see if we see any kind of performance increase with the decreased connection pooling.