Hi team,
Use case: Parallel deployment to around 50 production nodes at once.
Current Octo Server Setup: Linux Docker HA mode, 16 vCPU 32 GiB ram (c5.4xlarge instances)
Issue: Whenever the project is deployed to this environment, the CPU usage spikes up massively.
Things Observed:
When previously running on 8 vCPU windows machine, CPU spikes were till 60%
When previously running on 8 vCPU Linux Docker, CPU spikes were till 80-90%
When now running on 16 vCPU Linux Docker, CPU spikes are till 80%
While the CPU maxes out, the latency on the application increases as well erroring out a bunch of API requests made during the deployment and overall usage does see a drop in performance.
Need help in investigating and optimizing the setup ensuring platform reliability.
Thanks for reaching out, and sorry to hear that you’re having issues with load on your HA cluster.
The first place I would suggest investigating is your backend DB - making sure that the metrics (CPU/Memory etc.). While there, you should additionally check your DB indexes and statistics - if you’re dealing with a heavily-fragmented database, the slow queries can account for higher CPU usage and timeouts.
If you do find that your database indexes are fragmented, we have a community step template, that you could run to check periodically via an Octopus runbook and report on this here to help monitor this, if a regular maintenance plan can’t be implemented for some reason.
Once that’s complete, and you’re still experiencing slowness, feel free to email us through a copy of your server logs to support@octopus.com, and we can see if there’s anything that sticks out.
The issue is coming while package extracting. We are extracting packages on 64 servers and on 3 steps concurrently. During the extraction, apis are extremely slow and even there is latency noticed on UI as well. Do you think database fragmentation comes in picture while package extraction as well ? As soon as package extraction completes and deployment moves to rest of the deployment, octo starts working normally.
The package extraction on remote machines is generally handled via Calamari on the deployment target itself, but it could potentially be a disk/network I/O issue on the server if it’s concurrently trying to push your packages to 64 machines at once - do you have any metrics available for your cluster’s shared storage that could help check this?
If tasklogs are being written to this same shared storage, and your API calls are trying to look at the task logs, this could potentially be a culprit if there’s a large disk I/O queue due to the demands of the file being pushed.
We can definitely help dig into this some more, if you would like to send through (via support@octopus.com) the following data:
The Raw task logs for one of the concurrent deployments that are running.
The Octopus server logs from after the concurrent deployments have finished.
A HAR file of the slow API requests while it’s happening.
Additionally, have you increased the value of the Octopus.Acquire.MaxParallelism project variable? This governs how many concurrent machines will be sent packages at once. If this is set too high, it could lead to the problems you’re seeing with I/O exhaustion.
Will provide the logs and everything in sometime. Just to be clear, it’s a single deployment, which has 64 deployment targets running package extractions (not acquire) in parallel for 3 steps (extraction only with xml transformation and variable substitution).
AWS FSX is the shared storage being in use with 32mbps throughput. Haven’t noticed any disk IO or FSX issues, its just CPU usage which topping the charts.
Could we also please get an Octopus Server log from the same time period of those 2 deployment task logs you uploaded? These are in C:\Octopus\Logs by default.
Also can we please get the HAR File showing the slowness in the UI when this is occurring next time?
Unfortunately, we don’t have the har files and we have reduced parallelism between servers for now as a work around. To highlight, with same parallelism we were not facing the issue on windows master node without any container, but with linux containers issue is coming. Both instances with same capacity, and later on we increased the linux container capacity, but issue still is visible. We can provide the logs from deployment happened on windows node if that helps.
That’s interesting, we have a similar report of someone having performance issues in Kubernetes but not Windows in another internal ticket.
I see in your logs you are on 2020.2.15, did the issues crop up after upgrading Octopus to this version, or was it working fine at one point on this version?
Can you please give me more details about your Linux containers to see if there are any similarities with this other ticket?
Sre they hosted locally, what version of the chosen software are you running?
Has that been upgraded prior to the issue occuring?
When you say youre running Windows Master node and not seeing slowdown, does that mean the other nodes are also Windows when you aren’t seeing performance degradation? Can you give me a very high overview of your setup of nodes with no issues, and your setup of nodes with issues (OS/container descriptions)
Anything else you can think of in your setup that might help correlate the problem to the other ticket?
If necessary for privacy reasons please feel free to DM me.
Hi Jeremy,
We are facing performance issues when running the server on Linux containers.
We started facing these issues once we migrated to the Linux hosted server, whilst running the same Octo version we were running on Windows hosted server. As mentioned in the original ticket following are the observations:
While running the Production Deployment:
When previously running on 8 vCPU windows machine, CPU spikes were till 60%
When previously running on 8 vCPU Linux Docker, CPU spikes were till 80-90%
When now running on 16 vCPU Linux Docker, CPU spikes are till 80%
While the CPU maxes out, the latency on the application increases as well erroring out a bunch of API requests made during the deployment and overall usage does see a drop in performance.
We can set up a call as well to help you better understand the scenario. Let me know if you’d be okay with that.
I will pass back what you have given me and I will let you know if a call or something else is necessary. Please feel free to reach out in the meantime.
Also, Please find attached excel, depicting the fragmentation level on the DB.
Is that something that needs attention too? octodb-fragmentation.xlsx (14.4 KB)
I am still discussing this with our engineers. They are thinking that there are certain functions in Linux that are more costly than in Windows causing this. They are investigating currently.
Let me ping them to see if there’s anything to pass back.
createdump is an older tool that might produce dumps that are hard to debug
dotnet-dump is something Octopus Cloud is moving to right now. We are yet to learn how good it is.
The easiest way to take a snapshot of a running process is to shell to a running container using either kubeclt or https://k8slens.dev/ and run one of the above tools. You would have to attach persistent storage to the pod so the dump is preserved in case the process terminates.
Our developers have found that newer OS images on AKS are causing performance issues. Which platform are you using? Which kernel and version is it running on? Had you recently updated the kernel you were running on before you noticed the performance issues? Would you be able to test upgrading to the latest version of that kernel and see if the performance issues go away? We saw that on our side(in AKS). It’s also possible you could downgrade a bit and see if that helps.