Octopus UI/API slowness after Upgrade to 2020.3.9

We have upgraded our production Octopus Server to 2020.3.9 and afterwards it is experience some extreme slowness in the UI and API calls to it.

The scale of the server is :

  • Approximately 8000 Polling tentacles (some 1500ish offline, the rest healthy(ish))
  • Approximately 3000 Tenants for these tentacles
  • Approximately 3000 Tenant Tag Sets (the original implementation created a tagset per tenant for simple association automatically during unattented installations of the tentacle)

The specifications for the VM server is :

  • 12 CPU (usage remains consistently low)
  • 24 GB RAM (usage remains consistently low)

There are occasional network spikes due to Octopus.Server.exe, most of which is connecting to the DB server. Our DBAs have investigated around the DB server and found no long queries or other issues they found to be concerning as far as return time for calls.

Some UI windows and API calls are more problematic than others, thus far the Infrastructure > Overview window is nearly unusable, as well as the Infrastructure > Deployment Targets. Currently I have the Overview page loading with a developer console up in chrome. I am seeing api/serverstatus/health calls taking up to almost 3 minutes now, the time for these calls was steadily increasing as the page was loading.

We have enabled metrics logging, and the highest I have seen the active requests is 7. All logging is only enabled at INFO level, I can get a change to Trace or Debug and provide those files if necessary. The OctopusServer log is riddled with various connection errors, which is a previous topic I had going for tentacles performing a denial of service attack on the server. This upgrade to 2020.3.9 was in part to help alleviate that potential concern, however slowness on the server now is remarkably similar to when tentacles would attempt to connect with incorrect server thumbprint.

Most of the connection errors listed are of ‘System.Net.Sockets.SocketException (10054): An existing connection was forcibly closed by the remote host’. There is the occasional ‘A client at [] connected, and attempted a message exchange, but it presented a client certificate with the thumbprint ‘’ which is not in the list of thumbprints that we trust’, and I understand this error, it is also not very common.

I am working to do a faster turnaround to getting 2020.4.11 installed to see if that may alleviate some of the slowness we are experiencing. Please let me know what additional information I can provide and I will get that taken care of.

Hi @hardKOrr,

Thank you for reaching out.

So it seems like there are two issues -

  1. the general slowness of the API/UI calls, and
  2. the System.Net.Sockets.SocketException issue

So for both issues, you are correct that a Trace log would be spot on perfect for what we need to look at. If you could switch the server logging to trace and then work with your Octopus Server as you normally would, for around an hour then end us a copy of the Server Log, that would be a great amount of data for us to get started with.

Secondly, if you could find one (or multiple) of your relatively troublesome UI page loads, and download a HAR file from it to send across - that would also be very handy. Here’s a quick guide on downloading HAR files. https://octopus.com/blog/selenium/13-capturing-har-files/capturing-har-files

SocketExceptions can be notoriously tricky to determine. Lets see if the Trace files uncover anything. I would also ask you to also consider if there are any common traits between the remote hosts that you are seeing issues from.

For example, similarities in there:
Firewall Setup
Antivirus
Shared network ports (Potential congestion)
Proxy setups etc.

If you login to octopus.com, you will see a support tab, that will allow you to upload your support files. Otherwise, I should be able to set this ticket to a Private ticket - so you could attach the files directly to this ticket.

Please let me know when it’s done, (or if you would like me to change this ticket to a Private Message) so I can start diving into your issues.

Regards,

Dane

I have sent log and har files to support@octopus.com for investigation.
Thus far the issue seems to mainly just have affected the front-end + api, it does not seem to affect the connections to the tentacles. I have at least not heard a direct line saying as such.

Let me know if there is additional information I can provide, or more discovery I can do for you.

thanks

Thanks @hardKOrr.
I’ve received the logs and will be looking through them today.

Hi @hardKOrr,

Wow, they are some extremely poor loading times. I am very sorry that you are facing those issues at the moment. The good news is, looking through the logs you provided, it became clear that the issue is related to a known bug, of which a fix has already been included in 2020.4.8.

Please refer to the following issue in Github for more information: https://github.com/OctopusDeploy/Issues/issues/6638

This version is already available for public release (https://octopus.com/downloads ). I would be really intrigued to know how much time this will save you when loading your environments summary page.

Unfortunately, apart from the upgrade, there is no other solution for this. There is a chance that the “System.Net.Sockets.SocketException” may in fact be related. Please let me know when you perform the upgrade, and if the socketexception error logs continue appearing after the error.

The next thing we need to look at when troubleshooting a polling tentacle is looking for similarities. As you mentioned - you have a lot of Polling tentacles. Have a look for similarities between the polling tentacles that are reporting the socketexceptions.

Check their firewall rules, Anti-virus, Operating system, whether they connect via VPNs or Proxies, etc.

Also, the majority of our really common issues with tentacles (and their fixes) are often listed here.

Please, let me know how you go. If you do get stuck - let me know what you’ve checked and I can help you determine what the next thing to try, should be.

Regards,

Dane

Excellent news about the fix in 2020.4, I am working to push 2020.4.11 through our systems to get onto our production server and hope it will get there by the end of the week. I will report back after the upgrade gets pushed and I get a chance to test and scope out some of the socket error logs as well.

Thanks again for the assistance here!

We have upgraded our production server to 2020.4.11, and while that has helped we are still seeing Infrastructure > Overview times of up to 15 minutes for loading. I am gathering some additional trace and har logs to provide.

We are still seeing socket connection errors, however it is hard to determine if the issue is from the octopus server being slow/taxed, or the tentacle itself. We can health check a machine individually and have the health check fail due to connectivity, and then turn around a few minutes later to try again and have it succeed without any apparent issue or change on the tentacle side. This makes it difficult to narrow down whether specific connections are something we need to investigate on the tentacle side or not.

Hey @hardKOrr,

Not great news re: overview time of 15mins+ even after the upgrade.
I’ve received your new trace logs and will start digging.

Thanks,

Dane.

Hi @hardKOrr,

After more discussions with the team regarding this issue, I would like to request a Performance Trace. We have a recommended way to provide a performance trace: https://octopus.com/docs/support/record-a-performance-trace

If you could provide the performance trace file to the same support email address as you’ve provided the other logs, that would be perfect.

Regards,

Dane.

Additional performance logging has been sent. Please let me know if there is more information I can provide, thanks again for looking into this issue.

Received.

Thank you.

Hi @hardKOrr,

Thank you so much for all of the information you’ve provided and bearing with us while we investigated.

We’ve actually tracked down parts of the code which is likely causing your issue. It seems that with the combination of machine, tag sets, tenants, etc that you have, this portion of the code can slow down the summary page quite dramatically.

This will be going back to the engineers to hopefully architect a solution - but I can’t tell you exactly how long it will be until we see any headway with a solution.

As far as the second issue with System.Net.Sockets.SocketException, there is a possibility that this issue is a symptom/related to the above issue.

At this point, my next steps will be to raise a public issue for visibility and provide as much information as possible to the engineers who will be investigating.

I will touch base soon with a link to the public issue so you can follow along with when to expect the fix.

Regards,

Dane

1 Like