Last night during a scheduled deployment, Octopus seems to have lost SQL connection.
Invalid operation. The connection is closed. System.InvalidOperationException at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ValidateConnectionForExecute(SqlCommand command) at Microsoft.Data.SqlClient.SqlInternalTransaction.Rollback() at Microsoft.Data.SqlClient.SqlTransaction.Dispose(Boolean disposing) at Nevermore.Advanced.ReadTransaction.Dispose() at Octopus.Core.Persistence.Database.QueryExecutorProvider.Dispose() in QueryExecutorProvider.cs:line 138 at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWork.DisposeOurDatabaseTransactionSafely() in UnitOfWork.cs:line 215 at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWork.Dispose() in UnitOfWork.cs:line 229 at Autofac.Core.Disposer.Dispose(Boolean disposing) at Autofac.Util.Disposable.Dispose() at Autofac.Core.Lifetime.LifetimeScope.Dispose(Boolean disposing) at Autofac.Util.Disposable.Dispose() at Octopus.Server.Infrastructure.Orchestration.UnitsOfWork.UnitOfWorkExecutor.Execute[T](Func`3 action, CancellationToken cancellationToken, String name) in UnitOfWorkExecutor.cs:line 90 at Octopus.Server.Orchestration.ServerTasks.Deploy.DeploymentTaskController.PersistPlan(DeploymentPlan plan, CancellationToken cancellationToken) in DeploymentTaskController.cs:line 76 at Octopus.Server.Orchestration.ServerTasks.Deploy.ExecutionTaskController`3.ExecuteBase(ITaskLog taskLogRoot, CancellationToken cancellationToken) in ExecutionTaskController.cs:line 99 at Octopus.Server.Orchestration.ServerTasks.Deploy.DeploymentTaskController.Execute(ITaskLog taskLog, CancellationToken cancellationToken) in DeploymentTaskController.cs:line 56 at Octopus.Server.Orchestration.ServerTasks.RunningTask.<>c__DisplayClass31_1.<WorkerTask in RunningTask.cs:line 171 at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWorkExtensionMethods.DoAsync(IUnitOfWork unitOfWork, Func`1 action, CancellationToken cancellationToken, String name) in UnitOfWorkExtensionMethods.cs:line 75 at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWorkExtensionMethods.DoAsync(IUnitOfWork unitOfWork, Func`1 action, CancellationToken cancellationToken, String name) in UnitOfWorkExtensionMethods.cs:line 75 at Octopus.Server.Orchestration.ServerTasks.RunningTask.WorkerTask(CancellationToken cancellationToken) in RunningTask.cs:line 204
Our DBA noticed one particular query ran an extremely long time and we are wondering if these events were related or possibly a bug?
Missing Index Details from SQLQuery21.sql - DATABASE.Octopus (DOMAIN\USER (225))
The Query Processor estimates that implementing the following index could improve the query cost by 66.9368%.
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Release] ([ProjectId],[SpaceId])
Missing Index Details from CensusByArea.sql - (local).INCDataVault (DOMAIN\USER (53))
The Query Processor estimates that implementing the following index could improve the query cost by 51.6692%.
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Release] ([SpaceId])
Thanks for getting in touch, and I’m sorry to see you’re hitting this unexpected issue during a scheduled deployment. My guess currently as well is these are related. Would you be able to send us the following information to help us fix this up as quickly as possible?
Resulting task log when running a system integrity check (you can run this in the UI under Configuration > Diagnostics). My guess is there will be some missing indexes in the SQL db reported there, and I’ll work out a SQL script to get those re-added if so.
Your server log covering the time period that this issue occurred (default location is C:\Octopus\Logs on your Octopus server)
I look forward to hearing back and getting to the bottom of this!
Thank you very much for sending those through. Just wanted to quickly update to let you know I have been having a thorough look through the info sent, and I will let you know immediately once I have anything to pass along.
Thanks again for your patience as I was looking into this. To summarize, the system integrity check didn’t report any missing indexes, but rather unexpected ones which look to be the ones implemented from the suggestion from the query processor (from your original message). I doubt those would be causing an issue, but rather this error in your server log I think illuminates the root cause that lines up with the timestamp:
An error occurred while trying to trying fetch the number of active SQL transactions: "Error while executing SQL command in transaction 'SqlTransactionMetricsProducer.UpdateMetrics': The module being executed is not trusted. Either the owner of the database of the module needs to be granted authenticate permission, or the module needs to be digitally signed.
The command being executed was:
SELECT COUNT(*) FROM sys.dm_tran_active_transactions"
This suggests the db doesn’t have trustworthy enabled, so the next step would be to have your DBA enable that and see if the next scheduled deployment succeeds (which I think it will). This is done by running the following SQL (reference Microsoft docs, and before any changes directly in the database it would be worth taking a backup first).
ALTER DATABASE [db] SET TRUSTWORTHY ON;
Could you give that a go and let me know if it resolves this issue?
One of my teammates was recently advised to ignore that exact same issue that you pointed out. It’s been in our logs for weeks or more and last night was the first time we had seen any such SQL errors during a deployment.
I read through your teammate’s email chain with us, and brought this up internally. While the error messages were a bit different between then and now, e.g.
VIEW SERVER STATE permission was denied on object 'server', database 'master'. The user does not have permission to perform this action.
The module being executed is not trusted. Either the owner of the database of the module needs to be granted authenticate permission, or the module needs to be digitally signed.
the conclusion is this too should be safe to ignore since it’s in the same packet as the sys.dm_tran_active_transactions query. I’ll need to change how we approach this issue.
At the moment I have a sneaking suspicion that you may be hitting a variation of this open bug (this is specific to health checks, however after reading various reports, a couple seem to possibly match your report).
I’m hoping there might be some more details around this original reported error. Would you be happy to upload the task log from that scheduled deployment?
I appreciate your patience, and I look forward to hearing back!
One more bit of new data. So far this error has surfaced at least once during the last 3 nights. Same project, but also interestingly enough, it’s always the same Octopus node. It happens to be the node that is farthest away from the SQL DB. Another detail that might help, the scheduled deployment project is mainly orchestration that kicks off Deploy A Release steps. I’ll add a couple of the tasks logs.
I re uploaded the logs. Let me know if you need anything else. The DB is not Azure. It’s hosted in our Primary DC and this issue seems to only be occurring on tasks running from the Octo node in our secondary DC.
I appreciate the logs. While looking through them, I noticed you seem to be running into this bug here. I suspect this error is probably not causing the issues that started this thread but it might be best to eliminate it and free up resources regardless, the workaround is: Grant the Octopus Deploy server database user the VIEW SERVER STATE database permission.
If possible, could you give that a try and see if it eliminates the error spamming we’re seeing in the server logs?
Additionally, I did see in one of the task logs that we had this error here:
0:07:29 Error | Error while executing SQL command in transaction ‘DeploymentTaskController|ServerTasks-1119407-b8df269f-4ecc-4cc6-bbb4-d66d32c7e238|T263’: A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available.)
This makes me curious to know if you have some kind of load balancer or proxy sitting in between the nodes that might be dropping the connections mid-stream? If so, would it be possible to check with your network team for any errors on that device?