Database connection spike to 100+

Jon_Vaughan2 · 9 January 2020 18:36

Setup: HA connected to SQL Server AWS RDS

I have seen over the last two days that our SQL Server has hit over 100+ connections which has brought down the node, the server and service is still running, but the node in the configuration is offline. This has happened three times over the last two days but hasn’t been seen before this.

I have temporarily resolved the issue by restarting the affected node and just now I have bumped the max connections on each node to 200.

But I’m concerned that this is just a band-aid and there is a more serious underlying issue.

I have looked through the logs at the times the problem occurs, but don’t see anything obvious. I have also looked at what was being deployed at that time to see if there is anything in common, but don’t see anything.

Can I have some pointers on how to track down what is occurring?

Kenneth_Bates · 10 January 2020 01:36

Hi Jon,

Thanks for getting in touch! I can certainly see why that would be concerning. We’d like to dig a bit more into it with some additional information. Could you confirm:

Is the node that’s going down the SQL server node, or the Octopus node?
Which version of Octopus are you currently running?
Could you also send us through your server logs (C:\Octopus\Logs in standard installations)?

I look forward to hearing back and getting to the bottom of this one!

Best regards,

Kenny

Jon_Vaughan2 · 10 January 2020 09:58

Hey Kenny,

The node will be one of the Octopus nodes. But it’s not always the same node.

These are the spikes:

I have sent you the logs via Slack due to the size limitation and potentially sensitive data.

Thanks

Kenneth_Bates · 13 January 2020 03:53

Hi Jon,

Thank you kindly for keeping in touch and providing that extra information. I’ve received those log files via Slack, and I’m confident it has pointed us in the right direction. Fortunately it looks like a previously known and fixed bug in 2019.9.10 LTS (where it looks like you’re currently on 2019.9.8 LTS). The bug was regarding the projects/all endpoint opening unnecessary connections to the database, and you can reference additional info in the bug report below.

Would you be able/willing to apply the upgrade to at least 2019.9.10 (preferably latest) and let me know how much improvement you see as a result? If you need any non-current version, you can find any of them here.

I hope this helps, and I look forward to hearing back!

Best regards,

Kenny

Jon_Vaughan2 · 13 January 2020 10:05

Hi Kenny,

Upgrading the version is possible, but not fully automated yet, so a bit of a pain. But maybe now gives me a chance to fully automate it.

Before I do, I want to be more confident that this is the problem. As we have been running the 2019.9.8 for several months and haven’t seen this issue before, but it is possible that the connections have spiked and never hit > 100 as I don’t have any alarms on the connection count.

Is there any way to get better confidence in that this is the cause? Is it the common case that this connection bug only manifests some time after installation?

Connection metrics for the last two weeks:

We also have a second 2019.9.8 instance, which has never spiked either to my knowledge.

Thanks
Jon

Kenneth_Bates · 14 January 2020 00:23

Hi Jon,

Thanks for keeping in touch! This specific bug was reported by a user after some time of their version being run, though in their case they had a large number of projects which seemed to be related to how many database connections were being opened.

How many projects do you have in this node? Is there a big difference in the total number of projects when compared to the second instance (that hasn’t spiked)? Do you see any performance issues when navigating through them in the web portal? If so, that would most likely be proof of this specific bug being the one biting you. Though if that’s not the case, I would still say there’s a good chance it’s correlated somehow and upgrading would be the first thing to try when possible.

I look forward to hearing back!

Best regards,

Kenny

Jon_Vaughan2 · 14 January 2020 08:49

Hi Kenny,

You’re right. The instance that has started to fail has 1300 ish projects, whereas the second instance is much smaller with less than 500.

As far as performance, the failing instance is much slower display the dashboard (but I guess because it’s showing more projects?) But navigation after that such as into a project or process looks comparable.

Let’s try the upgrade and see.

Thanks for your support

Jon

Kenneth_Bates · 15 January 2020 00:35

Hi Jon,

You’re very welcome. Thanks for following up and confirming those details. I’m very interested to hear how much of an impact the upgrade provides! Don’t hesitate to reach out if you have any questions or concerns along the way.

Best regards,

Kenny

Jon_Vaughan2 · 20 July 2020 15:12

Hi Kenny,

We have the CPU Spike issue again.

AWS RDS CPU usage chart for today:

and a closer look at the spikes in case you can correlate them to log actions

I will send you the logs via Slack.

Thanks
Jon

p.s We are now on 2019.12.1
each time the CPU hit 100% I had to stop the octopus services and watch the connections drop to 0 and then the CPU drop back down.

Kenneth_Bates · 21 July 2020 01:34

Hi Jon,

Thanks for following up and letting us know the status update at this point in time. I’m sorry to hear you’ve hit this once again! I haven’t received these logs, unfortunately. Would you be willing to send them through via email to support@octopus.com? I’ll be able to grab them there, and bring this up to my team with that information to have a deep dive into what could be causing this issue for you.

Thanks and best regards,

Kenny

Jon_Vaughan2 · 21 July 2020 05:52

I mailed the logs over to support.

Interestingly, over night when our deployment numbers and usage is much lower there was no issue:

Kenneth_Bates · 21 July 2020 05:57

Hi Jon,

Thanks for following up and letting me know. I can confirm I’ve received the email with the logs attached. I’ll get looking into those shortly and we can continue the conversation over there. The new usage graph you sent just now seems at first glance to match up with the timeline of your previously sent screenshots? I’m interested to hear if today is the same story.

Best regards,

Kenny

Jon_Vaughan2 · 21 July 2020 06:58

yeah, after 20:00 when the usage had fallen off to a few triggered deployments and a few deployments from our NZ team the system stabilised again.

The system is used 24-7. But has it’s main usage 09:00-19:00 GMT, So i’m expecting it to happen a few times today between these times. Probably before lunch if yesterday was anything to go by.

Jon_Vaughan2 · 21 July 2020 12:35

So far it hasn’t maxed out.

Jon_Vaughan2 · 21 July 2020 18:06

It peaked again, but this time corrected itself.

Kenneth_Bates · 22 July 2020 02:51

Hi Jon,

Thank you kindly for those helpful updates. I’m still looking into this one and I’ll let you know anything we find.

Best regards,

Kenny