Deployments hogging too much CPU Resource

Naman.Kumar · 7 December 2021 11:35

Hi team,
Use case: Parallel deployment to around 50 production nodes at once.
Current Octo Server Setup: Linux Docker HA mode, 16 vCPU 32 GiB ram (c5.4xlarge instances)
Issue: Whenever the project is deployed to this environment, the CPU usage spikes up massively.

Things Observed:
When previously running on 8 vCPU windows machine, CPU spikes were till 60%
When previously running on 8 vCPU Linux Docker, CPU spikes were till 80-90%
When now running on 16 vCPU Linux Docker, CPU spikes are till 80%

While the CPU maxes out, the latency on the application increases as well erroring out a bunch of API requests made during the deployment and overall usage does see a drop in performance.

Need help in investigating and optimizing the setup ensuring platform reliability.

Justin_Walsh · 7 December 2021 15:04

HI @Naman.Kumar!

Thanks for reaching out, and sorry to hear that you’re having issues with load on your HA cluster.

The first place I would suggest investigating is your backend DB - making sure that the metrics (CPU/Memory etc.). While there, you should additionally check your DB indexes and statistics - if you’re dealing with a heavily-fragmented database, the slow queries can account for higher CPU usage and timeouts.

If you do find that your database indexes are fragmented, we have a community step template, that you could run to check periodically via an Octopus runbook and report on this here to help monitor this, if a regular maintenance plan can’t be implemented for some reason.

Once that’s complete, and you’re still experiencing slowness, feel free to email us through a copy of your server logs to support@octopus.com, and we can see if there’s anything that sticks out.

d.jain · 15 December 2021 12:05

hey @Justin_Walsh ,

The issue is coming while package extracting. We are extracting packages on 64 servers and on 3 steps concurrently. During the extraction, apis are extremely slow and even there is latency noticed on UI as well. Do you think database fragmentation comes in picture while package extraction as well ? As soon as package extraction completes and deployment moves to rest of the deployment, octo starts working normally.

This is impacting our production deployments.

Thanks and Regards,
Devan

Justin_Walsh · 15 December 2021 14:34

Hi @d.jain!

The package extraction on remote machines is generally handled via Calamari on the deployment target itself, but it could potentially be a disk/network I/O issue on the server if it’s concurrently trying to push your packages to 64 machines at once - do you have any metrics available for your cluster’s shared storage that could help check this?

If tasklogs are being written to this same shared storage, and your API calls are trying to look at the task logs, this could potentially be a culprit if there’s a large disk I/O queue due to the demands of the file being pushed.

We can definitely help dig into this some more, if you would like to send through (via support@octopus.com) the following data:

The Raw task logs for one of the concurrent deployments that are running.
The Octopus server logs from after the concurrent deployments have finished.
A HAR file of the slow API requests while it’s happening.

Additionally, have you increased the value of the Octopus.Acquire.MaxParallelism project variable? This governs how many concurrent machines will be sent packages at once. If this is set too high, it could lead to the problems you’re seeing with I/O exhaustion.

d.jain · 15 December 2021 14:51

hey @Justin_Walsh ,

Will provide the logs and everything in sometime. Just to be clear, it’s a single deployment, which has 64 deployment targets running package extractions (not acquire) in parallel for 3 steps (extraction only with xml transformation and variable substitution).
AWS FSX is the shared storage being in use with 32mbps throughput. Haven’t noticed any disk IO or FSX issues, its just CPU usage which topping the charts.

Regards,
Devan

Naman.Kumar · 21 December 2021 06:50

We have provided the logs which you asked for.

Can you please analyse and guide us further? The issue still persists and is hampering our production deployments regularly.

adam.hollow · 21 December 2021 12:43

Hi @Naman.Kumar,

Thank you for confirming you’ve provided those logs!
@Naman.Kumar, @d.jain; can you confirm where you’ve uploaded the logs to?

Once we confirm that, we’ll take a look and get back to you with anything that we find.

If you’d like to, you can upload the logs to this secure upload link:
Link attached to @Naman.Kumar: Support Files - Naman.Kumar
Link attached to @d.jain: Support Files - d.jain

Kind Regards,
Adam

Naman.Kumar · 21 December 2021 12:56

Emailed it to: “support@octopus.com”
Subject: “Support case 27457”
Date: 15 December

adam.hollow · 21 December 2021 14:02

Hi @Naman.Kumar,

Thanks for letting me know.

I’ve had a look through the inbox and can’t find anything that matches those details.

Would you please upload the files to the link above? It was maybe picked up by our e-mail filtering.

Thanks and Kind Regards,
Adam

Naman.Kumar · 27 December 2021 12:58

Hey, sorry I missed this, I’ve uploaded the files now. Please check

jeremy.miller · 27 December 2021 15:23

Hey @Naman.Kumar,

Could we also please get an Octopus Server log from the same time period of those 2 deployment task logs you uploaded? These are in C:\Octopus\Logs by default.

Also can we please get the HAR File showing the slowness in the UI when this is occurring next time?

Please let me know if you have any questions.

Best,
Jeremy

d.jain · 28 December 2021 15:08

Hey Jeremy,

Unfortunately, we don’t have the har files and we have reduced parallelism between servers for now as a work around. To highlight, with same parallelism we were not facing the issue on windows master node without any container, but with linux containers issue is coming. Both instances with same capacity, and later on we increased the linux container capacity, but issue still is visible. We can provide the logs from deployment happened on windows node if that helps.

Regards,
Devan

jeremy.miller · 28 December 2021 16:26

Hi Devan,

Thanks for the info.

That’s interesting, we have a similar report of someone having performance issues in Kubernetes but not Windows in another internal ticket.

I see in your logs you are on 2020.2.15, did the issues crop up after upgrading Octopus to this version, or was it working fine at one point on this version?
Can you please give me more details about your Linux containers to see if there are any similarities with this other ticket?
Sre they hosted locally, what version of the chosen software are you running?
Has that been upgraded prior to the issue occuring?
When you say youre running Windows Master node and not seeing slowdown, does that mean the other nodes are also Windows when you aren’t seeing performance degradation? Can you give me a very high overview of your setup of nodes with no issues, and your setup of nodes with issues (OS/container descriptions)
Anything else you can think of in your setup that might help correlate the problem to the other ticket?

If necessary for privacy reasons please feel free to DM me.

Best,
Jeremy

Naman.Kumar · 6 January 2022 13:45

Hi Jeremy,
We are facing performance issues when running the server on Linux containers.
We started facing these issues once we migrated to the Linux hosted server, whilst running the same Octo version we were running on Windows hosted server. As mentioned in the original ticket following are the observations:

While running the Production Deployment:

When previously running on 8 vCPU windows machine, CPU spikes were till 60%
When previously running on 8 vCPU Linux Docker, CPU spikes were till 80-90%
When now running on 16 vCPU Linux Docker, CPU spikes are till 80%
While the CPU maxes out, the latency on the application increases as well erroring out a bunch of API requests made during the deployment and overall usage does see a drop in performance.

We can set up a call as well to help you better understand the scenario. Let me know if you’d be okay with that.

Regards,
Naman

jeremy.miller · 6 January 2022 14:20

Hi Naman,

I will pass back what you have given me and I will let you know if a call or something else is necessary. Please feel free to reach out in the meantime.

Best,
Jeremy

Naman.Kumar · 10 January 2022 16:14

Hey, Any update?

Also, Please find attached excel, depicting the fragmentation level on the DB.
Is that something that needs attention too?
octodb-fragmentation.xlsx (14.4 KB)

jeremy.miller · 10 January 2022 16:16

Hi Naman,

I am still discussing this with our engineers. They are thinking that there are certain functions in Linux that are more costly than in Windows causing this. They are investigating currently.

Let me ping them to see if there’s anything to pass back.

Best,
Jeremy

jeremy.miller · 12 January 2022 13:54

Hi Naman,

Thanks for being patient. We are still investigating whether or not Linux instances are slower than Windows.

If you do find yourself in a place to grab some data for us, I did get some information regarding that.

You have at least 2 different tools you can use.

createdump is an older tool that might produce dumps that are hard to debug
dotnet-dump is something Octopus Cloud is moving to right now. We are yet to learn how good it is.

The easiest way to take a snapshot of a running process is to shell to a running container using either kubeclt or https://k8slens.dev/ and run one of the above tools. You would have to attach persistent storage to the pod so the dump is preserved in case the process terminates.

Here is the link use if you do end up wanting to upload those files: Support - Octopus Deploy

Please let me know if you are able to upload those or if you have any other questions or concerns in the meantime.

Best,
Jeremy

jeremy.miller · 19 January 2022 14:57

Hi Naman,

Our developers have found that newer OS images on AKS are causing performance issues. Which platform are you using? Which kernel and version is it running on? Had you recently updated the kernel you were running on before you noticed the performance issues? Would you be able to test upgrading to the latest version of that kernel and see if the performance issues go away? We saw that on our side(in AKS). It’s also possible you could downgrade a bit and see if that helps.

Please let us know.

Best,
Jeremy

Naman.Kumar · 2 February 2022 07:04

Hey, since the beginning of this thread, I’ve been saying that we are not using AKS, but Docker rather.

EC2 machines running Docker.