Error while executing SQL command in transaction 'DeploymentTaskController

jacobss · 12 September 2022 11:07

Hi we got the following error during a couple deploys.
Can’t find any errors on the SQL Server.

We currently run 2021.3 (Build 12132).

Kind regards
Jacob

Error while executing SQL command in transaction ‘DeploymentTaskController|ServerTasks-1095534-038151d9-ace9-43d4-b0ea-97eee064a9de|T374’: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
Operation cancelled by user.
The command being executed was:
DELETE FROM [dbo].[DeploymentRelatedMachine]
WHERE ([DeploymentId] = @deploymentid)
System.Exception
at Nevermore.CommandExecutor.ExecuteNonQuery()
at Nevermore.Advanced.ReadTransaction.ExecuteNonQuery(PreparedCommand preparedCommand)
at Nevermore.Querying.DeleteQueryBuilder1.Delete(DeleteOptions options) at Octopus.Core.RelationalStorage.RelatedDocumentStore.PopulateDeploymentRelatedMachines(IWriteQueryExecutor transaction, Deployment deployment) in RelatedDocumentStore.cs:line 45 at Nevermore.Advanced.WriteTransaction.UpdateAsync[TDocument](TDocument document, UpdateOptions options, CancellationToken cancellationToken) at Octopus.Core.RelationalStorage.RawRelationalTransaction.UpdateAsync[TKey](IId1 document, CancellationToken cancellationToken, UpdateOptions options) in RawRelationalTransaction.cs:line 247
at Octopus.Server.Web.Infrastructure.OctopusQueryExecutor.UpdateAsync[TKey](IId1 instance, CancellationToken cancellationToken, UpdateOptions options) in OctopusQueryExecutor.cs:line 425 at Octopus.Core.Persistence.Database.ProjectPathDecorators.ProjectPathQueryExecutorDecorator.UpdateAsync[TKey](IId1 document, CancellationToken cancellationToken, UpdateOptions options) in ProjectPathQueryExecutorDecorator.cs:line 259
at Octopus.Server.Orchestration.ServerTasks.Deploy.DeploymentPlanService.SaveStateContainer(Deployment execution, CancellationToken cancellationToken) in DeploymentPlanService.cs:line 115
at Octopus.Server.Orchestration.ServerTasks.Deploy.ExecutionPlanService3.Persist(TExecutionPlan plan, CancellationToken cancellationToken) in ExecutionPlanService.cs:line 61 at Octopus.Server.Orchestration.ServerTasks.Deploy.DeploymentTaskController.<>c__DisplayClass6_0.<PersistPlan in DeploymentTaskController.cs:line 74 at Octopus.Server.Infrastructure.Orchestration.UnitsOfWork.UnitOfWorkExecutor.<>c__DisplayClass3_01.<Execute in UnitOfWorkExecutor.cs:line 87
at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWorkExtensionMethods.DoAsync(IUnitOfWork unitOfWork, Func1 action, CancellationToken cancellationToken, String name) in UnitOfWorkExtensionMethods.cs:line 75 at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWorkExtensionMethods.DoAsync(IUnitOfWork unitOfWork, Func1 action, CancellationToken cancellationToken, String name) in UnitOfWorkExtensionMethods.cs:line 75
at Octopus.Server.Infrastructure.Orchestration.UnitsOfWork.UnitOfWorkExecutor.Execute[T](Func3 action, CancellationToken cancellationToken, String name) in UnitOfWorkExecutor.cs:line 90 at Octopus.Server.Orchestration.ServerTasks.Deploy.DeploymentTaskController.PersistPlan(DeploymentPlan plan, CancellationToken cancellationToken) in DeploymentTaskController.cs:line 76 at Octopus.Server.Orchestration.ServerTasks.Deploy.ExecutionTaskController3.ExecuteBase(ITaskLog taskLogRoot, CancellationToken cancellationToken) in ExecutionTaskController.cs:line 99
at Octopus.Server.Orchestration.ServerTasks.Deploy.DeploymentTaskController.Execute(ITaskLog taskLog, CancellationToken cancellationToken) in DeploymentTaskController.cs:line 56
at Octopus.Server.Orchestration.ServerTasks.RunningTask.<>c__DisplayClass31_1.<WorkerTask in RunningTask.cs:line 171
at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWorkExtensionMethods.DoAsync(IUnitOfWork unitOfWork, Func1 action, CancellationToken cancellationToken, String name) in UnitOfWorkExtensionMethods.cs:line 75 at Octopus.Core.Infrastructure.UnitsOfWork.UnitOfWorkExtensionMethods.DoAsync(IUnitOfWork unitOfWork, Func1 action, CancellationToken cancellationToken, String name) in UnitOfWorkExtensionMethods.cs:line 75
at Octopus.Server.Orchestration.ServerTasks.RunningTask.WorkerTask(CancellationToken cancellationToken) in RunningTask.cs:line 204

–Inner Exception–
SQL Error -2 - Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
Operation cancelled by user.
Microsoft.Data.SqlClient.SqlException
at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction) at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted) at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean isAsync, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest) at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, TaskCompletionSource1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String method)
at Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(TaskCompletionSource1 completion, Boolean sendToPipe, Int32 timeout, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String methodName) at Microsoft.Data.SqlClient.SqlCommand.ExecuteNonQuery() at Nevermore.Transient.DbCommandExtensions.<>c__DisplayClass0_0.<ExecuteNonQueryWithRetry>b__0() at Nevermore.Transient.RetryPolicy.ExecuteAction[TResult](Func1 func)
at Nevermore.CommandExecutor.ExecuteNonQuery()

–Inner Exception–
The wait operation timed out.
System.ComponentModel.Win32Exception

jacobss · 12 September 2022 11:13

Ran Check System Integrity, everything passed ok.

clare.martin · 12 September 2022 11:55

Good afternoon @jacobss,

Thank you for contacting Octopus Support and sorry to hear you are seeing some SQL timeout errors on some of your deployments.

If you re-run those deployments do they run successfully or do they fail with the same timeout errors?
Is it multiple projects you are seeing these errors or just one or two?

Are you able to send over your Octopus Server logs please so we can take a look for you. I have created you a secure link to send the log file over here.

Let me know if re-running those tasks works and when you have uploaded the logs so we can take a look.

Kind Regards,

Clare

jacobss · 12 September 2022 12:03

Hi Clare martin
thank you for your swift reply.

Re-run resulted in a fail, waited for an hour and then it ran ok when we tried re-run again.
We will assemble the log files and upload them to you.

clare.martin · 12 September 2022 12:11

Hey @jacobss,

I can have a look at your logs when you send them over, was it just that project that was having the issue, how many deployments were running during that time?

I only ask this because those are both SQL timeouts and if you re-ran the job later on and it worked that suggests there was possibly a lot of load on the SQL server at that time, which usually is what results in these timeout issues.

I would hazard a guess the logs will just show those SQL timeout issues, I know you said there was no SQL errors on the SQL DB but are you able to look at the performance and event viewer on the SQL server at the time of those deployments and see if they server metrics were overloaded or Event viewer shows any timeouts?

Is this still happening randomly or has it now stopped?

Kind Regards,

Clare

jacobss · 12 September 2022 13:07

Thank you. The logs have been uploaded.

We haven’t seen these kind of errors before.
First time was friday last week and now again this morning.
It was not just that project that got the error. Different projects every time.

I had a look at our sql-server but couldnt find any errors in the errorlog.
index defrag was run this weekend (weekly schedule) on the Octopus database.
What I could see on the monitoring tool for sql-server was that during one of todays error there was a long running select (1 minute 40 sec) which was blocking deletes during that time.
The sql server has other databases but the CPU/Memory/disk performance has been normal.

clare.martin · 12 September 2022 14:48

Hey @jacobss,

Thank you for the logs and the SQL server information.

I have taken a good look at the logs and unfortunately nothing pops out at me as to why this is happening so I am going to get this in front of our engineers to see if they can spot anything I have missed. I can see the queries that are timing out are taking ages to perform (hence the timeout).

There are also a few errors in the logs stating - The resource 'ServerTasks-1094397' was not found - but that should not create any SQL timeouts.

One thing they would ideally like to know is what was the long running select query that was blocking the deletes? Are you able to share it here, if it has any sensitive information please feel free to pop it in a notepad doc and upload it to the same secure link I put in the forum post earlier on.

Also, the three different server logs, are they from 3 different nodes (so you have a HA setup?)

I look forward to hearing from you,

Kind Regards,

Clare

clare.martin · 13 September 2022 07:08

Good morning @jacobss,

I have an update from our engineers that I wanted to share with you, they have a few questions for you to answer if you don’t mind so they can continue their investigations:

The engineer has confirmed my suspicions in that you are running 3 nodes presumably in an HA setup.
(Are we able to confirm this is indeed correct).

They have noticed the HTTP POST request to {spaceId}/deployments timed out consistently for you between 1414 and 1426 across the three nodes. The timeout was reported to be because of a query timing out: DELETE FROM [dbo].[DeploymentRelatedMachine] WHERE ([DeploymentId] = @deploymentid)

The engineer would like to know:

How many machines were you trying to deploy as part of the deployment that failed? Has this increased greatly from previous deployments?
What is your SQL Max Connection Pool Size is set to? If we can double it and monitor for repeated errors that would be great.

Lastly I did notice you are on 2021.3.12132, the 2021.3 builds had quite a few performance issues that we ironed out in later releases, are there any plans for you to upgrade at any point as the 2021.3 builds are no longer supported by us.

I look forward to hearing from you,

Kind Regards,

Clare

jacobss · 13 September 2022 07:51

Good morning

Yes, we are running 3 nodes in a HA setup.

1 process deploying to 1 target
1 process deploying to 2 target
1 process deploying to 4 targets

The sql server has unlimited number of concurrent connections.
There is no parameter for max connection pool size in the connections string on the server nodes.

It looks like the deploy went ok, but failed (with the error message above) after the apply retention policy step. During that problem, we couldn’t start new deploys.

We are a bit hesitant to upgrade due to the Configuration as code option in Features disappeared in the latest version. We have it currently disabled in production but in the latest version we have in our test environment (2022.1.2278) there is no possibility to disable it.

jacobss · 13 September 2022 08:08

I found the octopus error and sql-monitoring tool blocking from last friday. I’ve uploaded screenshot of the error to the fileshare. Unfortunately I dont have the exact query with parameters that was run.

clare.martin · 13 September 2022 08:10

Hey @jacobss,

Thank you for those answers I will get those sent to the engineers.

As for the upgrade, you are correct in that the Config as Code feature cannot be disabled in the new versions of Octopus. That does not mean you have to use it though, as long as you don’t configure a project to use it your instance will just run as normal. You have to actually convert a project over to Config as Code to enable that feature, projects do not get migrated automatically.

Initially we had a toggle button to turn it on and off because it was in development so we wanted our users to have to physically flick a switch to get it turned on, similar to developer mode on a lot of software. Now we have ironed out a lot of issues we no longer need the toggle button but users still have to go through the process of converting a project to Config as Code.

I hope that helps alleviate any concerns you may have surrounding that feature, if you upgrade you will have the benefits of all of our bug fixes and performance enhancements and also support in case you run into any bugs, but you will not have to use Config as Code if you do not want to.

Kind Regards,

Clare

jacobss · 19 September 2022 08:26

Hi Clare. Any news from the engineers about the errors we got?

The errors happened (friday at around 16.00 and monday at around 12.00) at the same time as the “Apply retention policies” system task was running (duration of about 20 minutes and runs every 4 hours)

dan_close · 19 September 2022 13:28

Hi @jacobss,

I’m just stepping in for Clare while she’s off for the day. Thanks for sending the info regarding the errors occurring with the retention policy task.

I don’t see any recent updates from our engineering team so I will reach out to them again today. We will be in touch as soon as we have any further information.

Thanks!
Dan

clare.martin · 20 September 2022 07:31

Good morning @jacobss,

I have a quick update from our engineers, The trello card for this issue is near the top of the engineers next up list. A few high priority tasks came up unfortunately. They should be able to get to this one soon.

We will keep you updated as soon as the engineers have something.

Kind Regards,

Clare

clare.martin · 23 September 2022 07:22

Hey @jacobss,

Sorry it has taken a while to get back to you regarding this issue, the engineers have identified you are experiencing an issue with RelatedDocumentStore.PopulateDeploymentRelatedMachines.

This Github issue has more details and also has the versions that were patched.

The engineers said ideally you should install the latest GA version. You should not upgrade to another 2021.3 since we don’t support that branch anymore. We also have issues with the later 2021.3 versions and upgrading them to any 2022.2 and 2022.3 versions since the later versions of 2021.3 were done as a backport for a customer.

I hope that helps, let me know once you are able to upgrade to a 2022.2 or 2022.3 version (one that is past the build number on the github issue) and if this fixes the errors you are experiancing.

Kind Regards,

Clare

jacobss · 23 September 2022 12:20

Thank you for all help. We’ll upgrade to the versions you specified.
Kind regards
Jacob

system · 24 October 2022 12:21

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.