Lost SQL connection during deployment

tfbryan · 23 March 2022 22:57

Code version: 2021.3 (Build 12132)

Last night during a scheduled deployment, Octopus seems to have lost SQL connection.

Invalid operation. The connection is closed. System.InvalidOperationException at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ValidateConnectionForExecute(SqlCommand command) at Microsoft.Data.SqlClient.SqlInternalTransaction.Rollback() at Microsoft.Data.SqlClient.SqlTransaction.Dispose(Boolean disposing) at Nevermore.Advanced.ReadTransaction.Dispose() at Octopus.Core.Persistence.Database.QueryExecutorProvider.Dispose() in QueryExecutorProvider.cs:line 138 at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWork.DisposeOurDatabaseTransactionSafely() in UnitOfWork.cs:line 215 at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWork.Dispose() in UnitOfWork.cs:line 229 at Autofac.Core.Disposer.Dispose(Boolean disposing) at Autofac.Util.Disposable.Dispose() at Autofac.Core.Lifetime.LifetimeScope.Dispose(Boolean disposing) at Autofac.Util.Disposable.Dispose() at Octopus.Server.Infrastructure.Orchestration.UnitsOfWork.UnitOfWorkExecutor.Execute[T](Func`3 action, CancellationToken cancellationToken, String name) in UnitOfWorkExecutor.cs:line 90 at Octopus.Server.Orchestration.ServerTasks.Deploy.DeploymentTaskController.PersistPlan(DeploymentPlan plan, CancellationToken cancellationToken) in DeploymentTaskController.cs:line 76 at Octopus.Server.Orchestration.ServerTasks.Deploy.ExecutionTaskController`3.ExecuteBase(ITaskLog taskLogRoot, CancellationToken cancellationToken) in ExecutionTaskController.cs:line 99 at Octopus.Server.Orchestration.ServerTasks.Deploy.DeploymentTaskController.Execute(ITaskLog taskLog, CancellationToken cancellationToken) in DeploymentTaskController.cs:line 56 at Octopus.Server.Orchestration.ServerTasks.RunningTask.<>c__DisplayClass31_1.<WorkerTask in RunningTask.cs:line 171 at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWorkExtensionMethods.DoAsync(IUnitOfWork unitOfWork, Func`1 action, CancellationToken cancellationToken, String name) in UnitOfWorkExtensionMethods.cs:line 75 at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWorkExtensionMethods.DoAsync(IUnitOfWork unitOfWork, Func`1 action, CancellationToken cancellationToken, String name) in UnitOfWorkExtensionMethods.cs:line 75 at Octopus.Server.Orchestration.ServerTasks.RunningTask.WorkerTask(CancellationToken cancellationToken) in RunningTask.cs:line 204

Our DBA noticed one particular query ran an extremely long time and we are wondering if these events were related or possibly a bug?

/*
Missing Index Details from SQLQuery21.sql - DATABASE.Octopus (DOMAIN\USER (225))
The Query Processor estimates that implementing the following index could improve the query cost by 66.9368%.
*/

/*
USE [Octopus]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Release] ([ProjectId],[SpaceId])
INCLUDE ([Version],[Assembled],[ProjectVariableSetSnapshotId],[ProjectDeploymentProcessSnapshotId],[JSON],[ChannelId],[DataVersion])
GO
*/

/*
Missing Index Details from CensusByArea.sql - (local).INCDataVault (DOMAIN\USER (53))
The Query Processor estimates that implementing the following index could improve the query cost by 51.6692%.
*/

/*
USE [Octopus]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Release] ([SpaceId])
INCLUDE ([Version],[Assembled],[ProjectId],[ProjectVariableSetSnapshotId],[ProjectDeploymentProcessSnapshotId],[JSON],[ChannelId])
GO
*/

Kenneth_Bates · 23 March 2022 23:20

Hi Todd,

Thanks for getting in touch, and I’m sorry to see you’re hitting this unexpected issue during a scheduled deployment. My guess currently as well is these are related. Would you be able to send us the following information to help us fix this up as quickly as possible?

Resulting task log when running a system integrity check (you can run this in the UI under Configuration > Diagnostics). My guess is there will be some missing indexes in the SQL db reported there, and I’ll work out a SQL script to get those re-added if so.
Your server log covering the time period that this issue occurred (default location is C:\Octopus\Logs on your Octopus server)

I look forward to hearing back and getting to the bottom of this!

Best regards,

Kenny

tfbryan · 23 March 2022 23:29

Kenny,

What’s the best way to send these logs over?

Kenneth_Bates · 23 March 2022 23:32

Hi Todd,

My apologies, I should have mentioned that initially. You can email those to support@octopus.com, or upload them at this secure link.

Best regards,

Kenny

tfbryan · 23 March 2022 23:46

Logs have been uploaded. The sql error happened a couple of times during that deployment but here is a timestamp for one of them

20:07:37 Fatal | Invalid operation. The connection is closed.

Kenneth_Bates · 24 March 2022 00:28

Hi Todd,

Thank you very much for sending those through. Just wanted to quickly update to let you know I have been having a thorough look through the info sent, and I will let you know immediately once I have anything to pass along.

I appreciate your patience.

Best regards,

Kenny

Kenneth_Bates · 24 March 2022 01:15

Hi Todd,

Thanks again for your patience as I was looking into this. To summarize, the system integrity check didn’t report any missing indexes, but rather unexpected ones which look to be the ones implemented from the suggestion from the query processor (from your original message). I doubt those would be causing an issue, but rather this error in your server log I think illuminates the root cause that lines up with the timestamp:

An error occurred while trying to trying fetch the number of active SQL transactions: "Error while executing SQL command in transaction 'SqlTransactionMetricsProducer.UpdateMetrics': The module being executed is not trusted. Either the owner of the database of the module needs to be granted authenticate permission, or the module needs to be digitally signed.
The command being executed was:
SELECT COUNT(*) FROM sys.dm_tran_active_transactions"

This suggests the db doesn’t have trustworthy enabled, so the next step would be to have your DBA enable that and see if the next scheduled deployment succeeds (which I think it will). This is done by running the following SQL (reference Microsoft docs, and before any changes directly in the database it would be worth taking a backup first).

ALTER DATABASE [db] SET TRUSTWORTHY ON;

Could you give that a go and let me know if it resolves this issue?

Best regards,

Kenny

tfbryan · 24 March 2022 01:57

Kenny,

One of my teammates was recently advised to ignore that exact same issue that you pointed out. It’s been in our logs for weeks or more and last night was the first time we had seen any such SQL errors during a deployment.

Kenneth_Bates · 24 March 2022 02:21

Hi Todd,

Thank you kindly for your update and further context. I found that conversation so I’m looking through it now, and will likely raise this internally for more eyes. I’ll let you know how I go.

Best regards,

Kenny

Kenneth_Bates · 24 March 2022 05:24

Hi Todd,

I read through your teammate’s email chain with us, and brought this up internally. While the error messages were a bit different between then and now, e.g.

VIEW SERVER STATE permission was denied on object 'server', database 'master'. The user does not have permission to perform this action.

and

The module being executed is not trusted. Either the owner of the database of the module needs to be granted authenticate permission, or the module needs to be digitally signed.

the conclusion is this too should be safe to ignore since it’s in the same packet as the sys.dm_tran_active_transactions query. I’ll need to change how we approach this issue.

At the moment I have a sneaking suspicion that you may be hitting a variation of this open bug (this is specific to health checks, however after reading various reports, a couple seem to possibly match your report).

github.com/OctopusDeploy/Issues

SQL Error 995 on health check

opened 09:40PM - 11 Oct 21 UTC

droyad

kind/bug state/triage team/core-platform

### Team - [X] I've ### Severity Intermit…tent problem for several customers a health check there is an I/O error saving the results due to "thread exit or an application request". the cancellation token is cancelled when it get there? Something to do with the async changes? lidOperationException: Invalid operation. The connection is closed. | at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ValidateConnectionForExecute(SqlCommand command) | at Microsoft.Data.SqlClient.SqlInternalTransaction.Rollback() | at Microsoft.Data.SqlClient.SqlTransaction.Dispose(Boolean disposing) | at Nevermore.Advanced.ReadTransaction.Dispose() | at Octopus.Server.Orchestration.ServerTasks.HealthCheck.HealthResultRecorder.SaveResult(MachineHealthResult result, ITaskLog taskLog, CancellationToken cancellationToken) in C:\BuildAgent\work\eb99e602a96dd63\source\Octopus.Server\Orchestration\ServerTasks\HealthCheck\HealthResultRecorder.cs:line 85 | at Octopus.Server.Orchestration.ServerTasks.HealthCheck.HealthResultRecorder.Record(MachineHealthResult result, ITaskLog taskLog, CancellationToken cancellationToken, IHealthResultCollator healthResultCollator) in C:\BuildAgent\work\eb99e602a96dd63\source\Octopus.Server\Orchestration\ServerTasks\HealthCheck\HealthResultRecorder.cs:line 56 | at Octopus.Server.Orchestration.ServerTasks.HealthCheck.HealthCheckService.PerformHealthCheck(Machine machine, ITaskLog taskLogForMachine, CancellationToken cancellationToken, IHealthResultCollator healthResultCollator, ExceptionHandling exceptionHandling, Action`2 customAction) in C:\BuildAgent\work\eb99e602a96dd63\source\Octopus.Server\Orchestration\ServerTasks\HealthCheck\HealthCheckService.cs:line 92 | Octopus.Server version 2021.2.7650 (2021.2.7650+Branch.release-2021.2.Sha.0f176810b157f3de525844fbb6bd147186c70893) occurred while executing a reader for `SELECT TOP 1 * FROM [dbo].[Machine] WHERE [Id] = @Id` | System.Exception: Exception occurred while executing a reader for `SELECT TOP 1 * FROM [dbo].[Machine] WHERE [Id] = @Id` | ---> Microsoft.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The I/O operation has been aborted because of either a thread exit or an application request.) | ---> System.ComponentModel.Win32Exception (995): The I/O operation has been aborted because of either a thread exit or an application request. | at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__188_0(Task`1 result) | at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke() | at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) | --- End of stack trace from previous location --- | at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread) | --- End of stack trace from previous location --- | at Nevermore.Transient.DbCommandExtensions.<>c__DisplayClass3_0.<<ExecuteReaderWithRetryAsync>b__0>d.MoveNext() | --- End of stack trace from previous location --- | at Nevermore.Transient.DbCommandExtensions.<>c__DisplayClass3_0.<<ExecuteReaderWithRetryAsync>b__0>d.MoveNext() | --- End of stack trace from previous location --- | at Nevermore.Transient.RetryPolicy.ExecuteActionAsync[TResult](Func`1 func) | at Nevermore.Transient.DbCommandExtensions.ExecuteReaderWithRetryAsync(DbCommand command, RetryPolicy commandRetryPolicy, CommandBehavior commandBehavior, RetryPolicy connectionRetryPolicy, String operationName, CancellationToken cancellationToken) | ClientConnectionId:689bb87c-aa3f-4c28-8621-cea63dea142d | Error Number:995,State:0,Class:20 | ClientConnectionId before routing:4ed1a8ba-1f2c-4ec7-870b-01e24b467987 | Routing Destination:b5e9fe70b3a9.tr1.norwayeast1-a.worker.database.windows.net,11047 | --- End of inner exception stack trace --- | at Nevermore.Transient.DbCommandExtensions.ExecuteReaderWithRetryAsync(DbCommand command, RetryPolicy commandRetryPolicy, CommandBehavior commandBehavior, RetryPolicy connectionRetryPolicy, String operationName, CancellationToken cancellationToken) | at Nevermore.CommandExecutor.ExecuteReaderAsync(CancellationToken cancellationToken) | at Nevermore.Advanced.ReadTransaction.ExecuteReaderAsync(PreparedCommand preparedCommand, CancellationToken cancellationToken) | at Nevermore.Advanced.ReadTransaction.StreamAsync[TRecord](PreparedCommand command, CancellationToken cancellationToken)+MoveNext() | at Nevermore.Advanced.ReadTransaction.LoadAsync[TDocument,TKey](TKey id, CancellationToken cancellationToken) | at Nevermore.Advanced.ReadTransaction.LoadAsync[TDocument,TKey](TKey id, CancellationToken cancellationToken) | at Octopus.Core.RelationalStorage.RawRelationalTransaction.LoadAsync[TDocument](String id, CancellationToken cancellationToken) in C:\BuildAgent\work\eb99e602a96dd63\source\Octopus.Core\RelationalStorage\RawRelationalTransaction.cs:line 26 | at Octopus.Server.Web.Infrastructure.OctopusQueryExecutor.LoadAsync[TDocument](String id, CancellationToken cancellationToken) in C:\BuildAgent\work\eb99e602a96dd63\source\Octopus.Server\Web\Infrastructure\OctopusQueryExecutor.cs:line 225 | at Octopus.Server.Orchestration.ServerTasks.HealthCheck.HealthResultRecorder.SaveResult(MachineHealthResult result, ITaskLog taskLog, CancellationToken cancellationToken) in C:\BuildAgent\work\eb99e602a96dd63\source\Octopus.Server\Orchestration\ServerTasks\HealthCheck\HealthResultRecorder.cs:line 85 | at Octopus.Server.Orchestration.ServerTasks.HealthCheck.HealthResultRecorder.Record(MachineHealthResult result, ITaskLog taskLog, CancellationToken cancellationToken, IHealthResultCollator healthResultCollator) in C:\BuildAgent\work\eb99e602a96dd63\source\Octopus.Server\Orchestration\ServerTasks\HealthCheck\HealthResultRecorder.cs:line 56 | at Octopus.Server.Orchestration.ServerTasks.HealthCheck.HealthCheckService.PerformHealthCheck(Machine machine, ITaskLog taskLogForMachine, CancellationToken cancellationToken, IHealthResultCollator healthResultCollator, ExceptionHandling exceptionHandling, Action`2 customAction) in C:\BuildAgent\work\eb99e602a96dd63\source\Octopus.Server\Orchestration\ServerTasks\HealthCheck\HealthCheckService.cs:line 92 | Octopus.Server version 2021.2.7650 (2021.2.7650+Branch.release-2021.2.Sha.0f176810b157f3de525844fbb6bd147186c70893) Recording health check results Exception occurred while executing a reader for `SELECT TOP 1 * FROM [dbo].[Machine] WHERE [Id] = @Id` | SQL Error 995 - A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The I/O operation has been aborted because of either a thread exit or an application request.) | The I/O operation has been aborted because of either a thread exit or an application request. Exception occurred while executing a reader for `SELECT TOP 1 * FROM [dbo].[Machine] WHERE [Id] = @Id` | SQL Error 995 - A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The I/O operation has been aborted because of either a thread exit or an application request.) | The I/O operation has been aborted because of either a thread exit or an application request. | Exception occurred while executing a reader for `SELECT TOP 1 * FROM [dbo].[Machine] WHERE [Id] = @Id` | System.Exception: Exception occurred while executing a reader for `SELECT TOP 1 * FROM [dbo].[Machine] WHERE [Id] = @Id` | ---> Microsoft.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The I/O operation has been aborted because of either a thread exit or an application request.) | ---> System.ComponentModel.Win32Exception (995): The I/O operation has been aborted because of either a thread exit or an application request. | at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__188_0(Task`1 result) | at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke() | at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) | --- End of stack trace from previous location --- | at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread) | --- End of stack trace from previous location --- | at Nevermore.Transient.DbCommandExtensions.<>c__DisplayClass3_0.<<ExecuteReaderWithRetryAsync>b__0>d.MoveNext() | --- End of stack trace from previous location --- | at Nevermore.Transient.DbCommandExtensions.<>c__DisplayClass3_0.<<ExecuteReaderWithRetryAsync>b__0>d.MoveNext() | --- End of stack trace from previous location --- | at Nevermore.Transient.RetryPolicy.ExecuteActionAsync[TResult](Func`1 func) | at Nevermore.Transient.DbCommandExtensions.ExecuteReaderWithRetryAsync(DbCommand command, RetryPolicy commandRetryPolicy, CommandBehavior commandBehavior, RetryPolicy connectionRetryPolicy, String operationName, CancellationToken cancellationToken) | ClientConnectionId:e09ffaf5-336d-447d-8650-de5369841aec | Error Number:995,State:0,Class:20 | --- End of inner exception stack trace --- | at Nevermore.Transient.DbCommandExtensions.ExecuteReaderWithRetryAsync(DbCommand command, RetryPolicy commandRetryPolicy, CommandBehavior commandBehavior, RetryPolicy connectionRetryPolicy, String operationName, CancellationToken cancellationToken) | at Nevermore.CommandExecutor.ExecuteReaderAsync(CancellationToken cancellationToken) | at Nevermore.Advanced.ReadTransaction.ExecuteReaderAsync(PreparedCommand preparedCommand, CancellationToken cancellationToken) | at Nevermore.Advanced.ReadTransaction.StreamAsync[TRecord](PreparedCommand command, CancellationToken cancellationToken)+MoveNext() | at Nevermore.Advanced.ReadTransaction.LoadAsync[TDocument,TKey](TKey id, CancellationToken cancellationToken) | at Nevermore.Advanced.ReadTransaction.LoadAsync[TDocument,TKey](TKey id, CancellationToken cancellationToken) | at Octopus.Core.RelationalStorage.RawRelationalTransaction.LoadAsync[TDocument](String id, CancellationToken cancellationToken) in C:\BuildAgent\work\eb99e602a96dd63\source\Octopus.Core\RelationalStorage\RawRelationalTransaction.cs:line 26 | at Octopus.Server.Web.Infrastructure.OctopusQueryExecutor.LoadAsync[TDocument](String id, CancellationToken cancellationToken) in C:\BuildAgent\work\eb99e602a96dd63\source\Octopus.Server\Web\Infrastructure\OctopusQueryExecutor.cs:line 225 | at Octopus.Server.Orchestration.ServerTasks.HealthCheck.HealthResultRecorder.SaveResult(MachineHealthResult result, ITaskLog taskLog, CancellationToken cancellationToken) in C:\BuildAgent\work\eb99e602a96dd63\source\Octopus.Server\Orchestration\ServerTasks\HealthCheck\HealthResultRecorder.cs:line 85 | at Octopus.Server.Orchestration.ServerTasks.HealthCheck.HealthResultRecorder.Record(MachineHealthResult result, ITaskLog taskLog, CancellationToken cancellationToken, IHealthResultCollator healthResultCollator) in C:\BuildAgent\work\eb99e602a96dd63\source\Octopus.Server\Orchestration\ServerTasks\HealthCheck\HealthResultRecorder.cs:line 56 | at Octopus.Server.Orchestration.ServerTasks.HealthCheck.HealthCheckService.PerformHealthCheck(Machine machine, ITaskLog taskLogForMachine, CancellationToken cancellationToken, IHealthResultCollator healthResultCollator, ExceptionHandling exceptionHandling, Action`2 customAction) in C:\BuildAgent\work\eb99e602a96dd63\source\Octopus.Server\Orchestration\ServerTasks\HealthCheck\HealthCheckService.cs:line 92 | Octopus.Server version 2021.2.7580 (2021.2.7580+Branch.release-2021.2.Sha.db69134412ffbf682ee48e345e2354b7834538ef) Recording health check results pus.com/t/deployment-failed-but-all-steps-succeed/27244/9 ploy.slack.com/archives/CNHBHV2BX/p1633963377077000 endesk.com/agent/tickets/76278 seems to be to set Pooling=false in the connection string.

I’m hoping there might be some more details around this original reported error. Would you be happy to upload the task log from that scheduled deployment?

I appreciate your patience, and I look forward to hearing back!

Best regards,

Kenny

tfbryan · 25 March 2022 03:02

One more bit of new data. So far this error has surfaced at least once during the last 3 nights. Same project, but also interestingly enough, it’s always the same Octopus node. It happens to be the node that is farthest away from the SQL DB. Another detail that might help, the scheduled deployment project is mainly orchestration that kicks off Deploy A Release steps. I’ll add a couple of the tasks logs.

Kenneth_Bates · 25 March 2022 04:05

Hi Todd,

Thanks for following up and for that extra info you’ve noticed. I look forward to having a look at the task logs and hopefully we can get closer to the resolution.

Best regards,

Kenny

tfbryan · 1 April 2022 16:39

Any ideas about this? We’re still seeing this failure almost every night.

garrett.dass · 1 April 2022 19:54

Hi @tfbryan,

Sorry for the delay on this, I am just stepping in while Kenny is offline as part of the AU based team.

I apologize I don’t see the most recent logs in our portal, they might have been automatically purged after a week. Would it be possible to re-upload them here?

I am also wondering if this is an Azure-hosted database? In the past, we’ve seen a default timeout by Azure for longer-running tasks. I’m wondering if this is something you could be running into.

Looking forward to hearing back.

Regards,
Garrett

tfbryan · 1 April 2022 20:20

I re uploaded the logs. Let me know if you need anything else. The DB is not Azure. It’s hosted in our Primary DC and this issue seems to only be occurring on tasks running from the Octo node in our secondary DC.

garrett.dass · 1 April 2022 20:58

Hi @tfbryan

Thanks for sending those logs over.
I am not seeing anything particularly telling in the task logs, but we are still reviewing them.

I am wondering if there are any suspicious errors in Event Viewer on that secondary DC that might correspond with the times that these tasks are failing?

It might also be worth us getting your Octopus Server logs as well just to be thorough.

Thanks again,
Garrett

tfbryan · 1 April 2022 21:32

Sorry, I meant to include that server log the first time. It’s been added, an no I don’t see anything going on in the Windows event logs during the time of error.

garrett.dass · 1 April 2022 22:46

Hi @tfbryan,

I appreciate the logs. While looking through them, I noticed you seem to be running into this bug here. I suspect this error is probably not causing the issues that started this thread but it might be best to eliminate it and free up resources regardless, the workaround is: Grant the Octopus Deploy server database user the VIEW SERVER STATE database permission.

If possible, could you give that a try and see if it eliminates the error spamming we’re seeing in the server logs?

Additionally, I did see in one of the task logs that we had this error here:

0:07:29 Error | Error while executing SQL command in transaction ‘DeploymentTaskController|ServerTasks-1119407-b8df269f-4ecc-4cc6-bbb4-d66d32c7e238|T263’: A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available.)

This makes me curious to know if you have some kind of load balancer or proxy sitting in between the nodes that might be dropping the connections mid-stream? If so, would it be possible to check with your network team for any errors on that device?

Looking forward to hearing back.

Regards,
Garrett

tfbryan · 1 April 2022 22:54

That’s an interesting find. We have a load-balanced name for the https interface and it currently points directly at the primary data center. The SQL backend is using a MSSQL Availability Group.

I’ll speak to the DBA’s about that VIEW SERVER STATE grant.

tfbryan · 1 April 2022 23:32

DBA’s won’t grant permission outside of the Octopus DB. This is the same permission issue that was discussed above. You said this is a bug, is it under development?