RDS Database has recently started filling faster than normal

We have an RDS database that supports our HA cluster.

Generally, we have seen that the CPU and network spikes correlate to the deployment times. And the storage space generally is unchanged.

Over the past couple days, this suddenly changed. We are now seeing huge IO writes and reads, storage space running out, and CPU having sustained loads. From the graphs, you can easily see a sudden change. And I cannot pinpoint what might have happened. Seems to have started at 5-17 18:50 UTC from the metrics.

I have an exported ZIP from the Octopus configuration page if needed.

I will also attach screenshots of the RDS metrics.

Thanks




Hi Paul,

Thanks for getting in touch! I think getting a copy of the ZIP is a great first step in diagnosing what could be going wrong here. I’ve authorized your email address to upload files to the following URL.

You mention that this change was sudden, I’m wondering if you know of any major changes to your configuration or setup? Has the Octopus server been upgraded recently? (If so, which version was it upgraded from?)

Looking forward to hearing from you and getting to the bottom of this.

Best regards,
Daniel

No one on our team can see anything that has been done, except for some changes to processes and minor things like that. No server upgrades or anything like that.

Over the weekend, all these metrics dropped. And that was during a time where our EC2 HA count dropped for the weekend. I’m curious if there was some unknown DB optimization going or if one of the EC2 instances that was killed had some bug that was impacting the DB.

I uploaded the logs

Hi Paul,

Just jumping in for Daniel here, as he’s currently offline as a member of our Australian-based team.

Would you be able to upload a copy of your Octopus Server logs from your HA nodes to that link as well? Especially the logs in the timeframe you were seeing the high load.

Were you/your DBA team able to look at what was running on RDS (with sp_who or similar) at the time to see if there were any long/high-frequency tasks running?

Look forward to hearing from you soon!

I waited until Saturday to look further. The ASG had already scaled down the nodes that were running during that time.

We might just have to close this since I don’t have more details.

Actually, I have some logs. 1 node was left from the old set.

There are some logs files during those time periods that are much larger than normal. And I see some odd DB statements in them but don’t know if it’s concerning.
12800 of these:

2023-05-17 21:19:30.1141   3064     70  INFO  NonCloningRawFullTableCache GetTableInternal WhenTableHasChanged "succeeded" after 027ms.
2023-05-17 18:23:07.9373   3064    211  INFO  "Execute reader" took 561ms in transaction "GetReleasesForProject.GetAll|80006f97-0001-fc00-b63f-84710c7967bb|T599": "SELECT *
FROM (
    SELECT Id,Version,Assembled,ProjectId,ProjectVariableSetSnapshotId,ProjectDeploymentProcessSnapshotId,ChannelId,DataVersion,SpaceId,JSON,
    ROW_NUMBER() OVER (ORDER BY [Assembled] DESC) AS RowNum
    FROM [dbo].[Release]
    WHERE (((([SpaceId] in ('Spaces-1')))))
    AND ([ProjectId] = @projectid)
) ALIAS_GENERATED_1
WHERE ([RowNum] >= @_minrow)
AND ([RowNum] <= @_maxrow)
ORDER BY [RowNum]"

In isolation, those aren’t too worrying, but if you would like to send over that server log from that node, we can take a more broad view of the surrounding activity, to see if there are any smoking guns.

Feel free to use the upload link Daniel provided above.

The logs are uploaded.

Thanks Paul - we’ll take a look at them, and see if anything stands out.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.