Lost SQL connection during deployment

Code version: 2021.3 (Build 12132)

Last night during a scheduled deployment, Octopus seems to have lost SQL connection.

Invalid operation. The connection is closed. System.InvalidOperationException at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ValidateConnectionForExecute(SqlCommand command) at Microsoft.Data.SqlClient.SqlInternalTransaction.Rollback() at Microsoft.Data.SqlClient.SqlTransaction.Dispose(Boolean disposing) at Nevermore.Advanced.ReadTransaction.Dispose() at Octopus.Core.Persistence.Database.QueryExecutorProvider.Dispose() in QueryExecutorProvider.cs:line 138 at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWork.DisposeOurDatabaseTransactionSafely() in UnitOfWork.cs:line 215 at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWork.Dispose() in UnitOfWork.cs:line 229 at Autofac.Core.Disposer.Dispose(Boolean disposing) at Autofac.Util.Disposable.Dispose() at Autofac.Core.Lifetime.LifetimeScope.Dispose(Boolean disposing) at Autofac.Util.Disposable.Dispose() at Octopus.Server.Infrastructure.Orchestration.UnitsOfWork.UnitOfWorkExecutor.Execute[T](Func`3 action, CancellationToken cancellationToken, String name) in UnitOfWorkExecutor.cs:line 90 at Octopus.Server.Orchestration.ServerTasks.Deploy.DeploymentTaskController.PersistPlan(DeploymentPlan plan, CancellationToken cancellationToken) in DeploymentTaskController.cs:line 76 at Octopus.Server.Orchestration.ServerTasks.Deploy.ExecutionTaskController`3.ExecuteBase(ITaskLog taskLogRoot, CancellationToken cancellationToken) in ExecutionTaskController.cs:line 99 at Octopus.Server.Orchestration.ServerTasks.Deploy.DeploymentTaskController.Execute(ITaskLog taskLog, CancellationToken cancellationToken) in DeploymentTaskController.cs:line 56 at Octopus.Server.Orchestration.ServerTasks.RunningTask.<>c__DisplayClass31_1.<WorkerTask in RunningTask.cs:line 171 at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWorkExtensionMethods.DoAsync(IUnitOfWork unitOfWork, Func`1 action, CancellationToken cancellationToken, String name) in UnitOfWorkExtensionMethods.cs:line 75 at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWorkExtensionMethods.DoAsync(IUnitOfWork unitOfWork, Func`1 action, CancellationToken cancellationToken, String name) in UnitOfWorkExtensionMethods.cs:line 75 at Octopus.Server.Orchestration.ServerTasks.RunningTask.WorkerTask(CancellationToken cancellationToken) in RunningTask.cs:line 204

Our DBA noticed one particular query ran an extremely long time and we are wondering if these events were related or possibly a bug?

/*
Missing Index Details from SQLQuery21.sql - DATABASE.Octopus (DOMAIN\USER (225))
The Query Processor estimates that implementing the following index could improve the query cost by 66.9368%.
*/

/*
USE [Octopus]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Release] ([ProjectId],[SpaceId])
INCLUDE ([Version],[Assembled],[ProjectVariableSetSnapshotId],[ProjectDeploymentProcessSnapshotId],[JSON],[ChannelId],[DataVersion])
GO
*/

/*
Missing Index Details from CensusByArea.sql - (local).INCDataVault (DOMAIN\USER (53))
The Query Processor estimates that implementing the following index could improve the query cost by 51.6692%.
*/

/*
USE [Octopus]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Release] ([SpaceId])
INCLUDE ([Version],[Assembled],[ProjectId],[ProjectVariableSetSnapshotId],[ProjectDeploymentProcessSnapshotId],[JSON],[ChannelId])
GO
*/

Hi Todd,

Thanks for getting in touch, and I’m sorry to see you’re hitting this unexpected issue during a scheduled deployment. My guess currently as well is these are related. Would you be able to send us the following information to help us fix this up as quickly as possible?

  • Resulting task log when running a system integrity check (you can run this in the UI under Configuration > Diagnostics). My guess is there will be some missing indexes in the SQL db reported there, and I’ll work out a SQL script to get those re-added if so.
  • Your server log covering the time period that this issue occurred (default location is C:\Octopus\Logs on your Octopus server)

I look forward to hearing back and getting to the bottom of this!

Best regards,

Kenny

Kenny,

What’s the best way to send these logs over?

Hi Todd,

My apologies, I should have mentioned that initially. You can email those to support@octopus.com, or upload them at this secure link.

Best regards,

Kenny

1 Like

Logs have been uploaded. The sql error happened a couple of times during that deployment but here is a timestamp for one of them

20:07:37 Fatal | Invalid operation. The connection is closed.

Hi Todd,

Thank you very much for sending those through. Just wanted to quickly update to let you know I have been having a thorough look through the info sent, and I will let you know immediately once I have anything to pass along. :slight_smile:

I appreciate your patience.

Best regards,

Kenny

1 Like

Hi Todd,

Thanks again for your patience as I was looking into this. To summarize, the system integrity check didn’t report any missing indexes, but rather unexpected ones which look to be the ones implemented from the suggestion from the query processor (from your original message). I doubt those would be causing an issue, but rather this error in your server log I think illuminates the root cause that lines up with the timestamp:

An error occurred while trying to trying fetch the number of active SQL transactions: "Error while executing SQL command in transaction 'SqlTransactionMetricsProducer.UpdateMetrics': The module being executed is not trusted. Either the owner of the database of the module needs to be granted authenticate permission, or the module needs to be digitally signed.
The command being executed was:
SELECT COUNT(*) FROM sys.dm_tran_active_transactions"

This suggests the db doesn’t have trustworthy enabled, so the next step would be to have your DBA enable that and see if the next scheduled deployment succeeds (which I think it will). This is done by running the following SQL (reference Microsoft docs, and before any changes directly in the database it would be worth taking a backup first).

ALTER DATABASE [db] SET TRUSTWORTHY ON;

Could you give that a go and let me know if it resolves this issue?

Best regards,

Kenny

Kenny,

One of my teammates was recently advised to ignore that exact same issue that you pointed out. It’s been in our logs for weeks or more and last night was the first time we had seen any such SQL errors during a deployment.

Hi Todd,

Thank you kindly for your update and further context. I found that conversation so I’m looking through it now, and will likely raise this internally for more eyes. I’ll let you know how I go.

Best regards,

Kenny

Hi Todd,

I read through your teammate’s email chain with us, and brought this up internally. While the error messages were a bit different between then and now, e.g.

VIEW SERVER STATE permission was denied on object 'server', database 'master'. The user does not have permission to perform this action.

and

The module being executed is not trusted. Either the owner of the database of the module needs to be granted authenticate permission, or the module needs to be digitally signed.

the conclusion is this too should be safe to ignore since it’s in the same packet as the sys.dm_tran_active_transactions query. I’ll need to change how we approach this issue.

At the moment I have a sneaking suspicion that you may be hitting a variation of this open bug (this is specific to health checks, however after reading various reports, a couple seem to possibly match your report).

I’m hoping there might be some more details around this original reported error. Would you be happy to upload the task log from that scheduled deployment?

I appreciate your patience, and I look forward to hearing back!

Best regards,

Kenny

1 Like

One more bit of new data. So far this error has surfaced at least once during the last 3 nights. Same project, but also interestingly enough, it’s always the same Octopus node. It happens to be the node that is farthest away from the SQL DB. Another detail that might help, the scheduled deployment project is mainly orchestration that kicks off Deploy A Release steps. I’ll add a couple of the tasks logs.

Hi Todd,

Thanks for following up and for that extra info you’ve noticed. I look forward to having a look at the task logs and hopefully we can get closer to the resolution. :slight_smile:

Best regards,

Kenny

Any ideas about this? We’re still seeing this failure almost every night.

Hi @tfbryan,

Sorry for the delay on this, I am just stepping in while Kenny is offline as part of the AU based team.

I apologize I don’t see the most recent logs in our portal, they might have been automatically purged after a week. Would it be possible to re-upload them here?

I am also wondering if this is an Azure-hosted database? In the past, we’ve seen a default timeout by Azure for longer-running tasks. I’m wondering if this is something you could be running into.

Looking forward to hearing back.

Regards,
Garrett

I re uploaded the logs. Let me know if you need anything else. The DB is not Azure. It’s hosted in our Primary DC and this issue seems to only be occurring on tasks running from the Octo node in our secondary DC.

Hi @tfbryan

Thanks for sending those logs over.
I am not seeing anything particularly telling in the task logs, but we are still reviewing them.

I am wondering if there are any suspicious errors in Event Viewer on that secondary DC that might correspond with the times that these tasks are failing?

It might also be worth us getting your Octopus Server logs as well just to be thorough.

Thanks again,
Garrett

Sorry, I meant to include that server log the first time. It’s been added, an no I don’t see anything going on in the Windows event logs during the time of error.

Hi @tfbryan,

I appreciate the logs. While looking through them, I noticed you seem to be running into this bug here. I suspect this error is probably not causing the issues that started this thread but it might be best to eliminate it and free up resources regardless, the workaround is: Grant the Octopus Deploy server database user the VIEW SERVER STATE database permission.

If possible, could you give that a try and see if it eliminates the error spamming we’re seeing in the server logs?

Additionally, I did see in one of the task logs that we had this error here:

0:07:29 Error | Error while executing SQL command in transaction ‘DeploymentTaskController|ServerTasks-1119407-b8df269f-4ecc-4cc6-bbb4-d66d32c7e238|T263’: A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available.)

This makes me curious to know if you have some kind of load balancer or proxy sitting in between the nodes that might be dropping the connections mid-stream? If so, would it be possible to check with your network team for any errors on that device?

Looking forward to hearing back.

Regards,
Garrett

That’s an interesting find. We have a load-balanced name for the https interface and it currently points directly at the primary data center. The SQL backend is using a MSSQL Availability Group.

I’ll speak to the DBA’s about that VIEW SERVER STATE grant.

1 Like

DBA’s won’t grant permission outside of the Octopus DB. This is the same permission issue that was discussed above. You said this is a bug, is it under development?