OctopusDeploy service stops when SQL query fails in UpdateDeploymentHistory

petter · 11 December 2015 10:37

On a semi-regular basis, our Octopus 3.1.2 installation fails to execute UpdateDeploymentHistory because of the following error: “Error while executing SQL command: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.”

When this happens, the service stops and doesn’t restart, even though the service is set to retry every minute indefinitely.

Here’s the full stack trace:

2015-12-11 02:38:20.9936 13 ERROR Error when running scheduled task: UpdateDeploymentHistory
System.Exception: Error while executing SQL command: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
The command being executed was:
UpdateDeploymentHistory —> System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. —> System.ComponentModel.Win32Exception: The wait operation timed out
— End of inner exception stack trace —
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData() at System.Data.SqlClient.SqlDataReader.get_MetaData() at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString) at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds) at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource1 completion, Int32 timeout, Task& task, Boolean asyncWrite)
at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
at System.Data.SqlClient.SqlCommand.ExecuteScalar()
at Octopus.Core.RelationalStorage.TransientFaultHandling.IDbCommandExtensions.<>c__DisplayClass11_0.b__0() in Y:\work\refs\tags\3.1.2\source\Octopus.Core\RelationalStorage\TransientFaultHandling\IDbCommandExtensions.cs:line 119
at Octopus.Shared.TransientFaultHandling.RetryPolicy.ExecuteAction[TResult](Func`1 func) in Y:\work\refs\tags\3.1.2\source\Octopus.Shared\TransientFaultHandling\RetryPolicy.cs:line 215
at Octopus.Core.RelationalStorage.RelationalTransaction.ExecuteScalar[T](String query, CommandParameters args) in Y:\work\refs\tags\3.1.2\source\Octopus.Core\RelationalStorage\RelationalTransaction.cs:line 355
— End of inner exception stack trace —
at Octopus.Core.RelationalStorage.RelationalTransaction.ExecuteScalar[T](String query, CommandParameters args) in Y:\work\refs\tags\3.1.2\source\Octopus.Core\RelationalStorage\RelationalTransaction.cs:line 362
at Octopus.Server.Schedules.UpdateDeploymentHistory.Trigger() in Y:\work\refs\tags\3.1.2\source\Octopus.Server\Schedules\UpdateDeploymentHistory.cs:line 34
at Octopus.Server.Schedules.Scheduler.RunTask(IRunOnASchedule task) in Y:\work\refs\tags\3.1.2\source\Octopus.Server\Schedules\Scheduler.cs:line 125

Damian_Brady · 14 December 2015 03:34

Hi Petter,

Thanks for getting in touch!

There is a stored procedure in the database that populates a DeploymentHistory table. Every 30 seconds, it updates the table with new information, but every 4 hours, it will do a full refresh to update any changes that have occurred with older records.

It sounds like this may be what’s failing. Obviously the service shouldn’t be dying as a result.

Does this 4 hour schedule sound like it might match the error you’re getting?
Are you able to do a count of the records in your DeploymentHistory table in the Octopus database?

Thanks,
Damo

petter · 14 December 2015 07:26

Hi.

Fortunately, it fails much more rarely than that.

The service will be running for a few days, then suddenly one of our
TeamCity builds will fail because the Octopus nuget server fails to reply.
When we remote into the Octopus server, we’ll see that the service stopped
some minutes or hours ago.

Best regards
Petter Brodin

Vanessa_Love · 14 December 2015 07:50

Hi Petter,

Could you please provide your full Octopus Server logs from c:\Octopus\Logs (for defauly installations) and do a count on the DeploymentHistory table in your database.

Just so we can see if there is anything obvious.

Thanks
Vanessa

petter · 14 December 2015 08:50

Hi and thanks for a quick reply!

I’ve attached Friday’s log, where the server went down at 02:38:20 As you
can see, one of our environments is currently down, so there’s a lot of
exceptions when connecting to that, but the crash comes at the
UpdateDeploymentHistory call.

I counted 352 rows in the DeploymentHistory table.

Best regards
Petter Brodin

OctopusServer.2.txt (142 KB)

Damian_Brady · 15 December 2015 02:17

Hi Petter,

Thanks for that extra information. 352 rows shouldn’t pose much of a problem, but unfortunately a timeout could be caused by a lot of things.

A few things that might help me work out what’s going on:

Can you give me some more information about the network architecture there? Is there likely to be a slow connection between the Octopus server and the database server?
Is there anything else that runs on the database server that could be contributing to the timeout?
Can you have a look at the Event Viewer to see if anything happened on either the Octopus server or the Database server at that time?
Can you execute the UpdateDeploymentHistory stored procedure in your Octopus database to see how long it takes to run? This will be perfectly safe - it just updates a reporting table.

I’m having a look at how we handle these timeouts - we definitely shouldn’t be crashing the service if there’s a database timeout!

Thanks,
Damo

petter · 15 December 2015 08:07

Hi.

They’re both virtual machines on the same Hyper-V host. They’re not very
powerful, but we haven’t notice any performance issues or high loads on
them.
There are a few other applications that run on the database server, but
it hasn’t really been under much load.
On the Octopus server there was an info message the exact same time
(02:38:20) the exception was logged that “The Application Experience
service entered the stopped state.”. That service was started 11 minutes
earlier.

There are a few messages form the Octopus service, but they’re a bit after
the service supposedly failed:

03:01:47 - The OctopusDeploy service entered the stopped state.
03:12:46 - A timeout was reached (30000 milliseconds) while waiting for the
OctopusDeploy service to connect.
03:12:46 - The OctopusDeploy service failed to start due to the following
error: The service did not respond to the start or control request in a
timely fashion.

I executed it without error, and it ran in less than a second.

Best regards
Petter Brodin

Damian_Brady · 17 December 2015 00:31

Hi Petter,

Thanks for that information. It’s a hard one to diagnose because it seems quite environmental, and that stack trace seems like a bit of a red herring.

The actual exception is being caused by a call to SQL Server (in this case to run that stored proc on a schedule) that’s failing. We have an exponential backoff policy with 5 attempts at running the command. If that fails, the exception gets thrown up the chain, but again, that’s handled in a try-catch so it shouldn’t in any way bring down the service.

It’s curious that there’s a different service that’s being stopped at the same time. Could there be a reboot or another external application that’s bringing down the service? Automatic updates or something like that?

Thanks,
Damo

petter · 17 December 2015 08:27

Hi.

Sorry that I didn’t reply to this, for some reason I didn’t see that you’d
replied.

I think you’re right that there might be something else going on. The
Octopus server was restarted on Friday morning 03:01:36, about 20 minutes
after the exception from the log. The SQL server did not go down that night
at all, though, and if the Octopus server was restarted, you’d expect the
service to come back up, right? It’s set to automatic startup, but there’s
nothing in the logs between the time it went down (for whatever reason) and
I started it manually.

Best regards
Petter Brodin

Damian_Brady · 17 December 2015 23:40

Hi Petter,

Thanks for the reply.

Is this something you can keep an eye on? It could be that there was an issue with the server that forced a restart.

In terms of starting the service after a reboot, I would have thought it would start automatically, but maybe the way it was shut down meant Windows didn’t want to restart it?

Let me know if you get any more info or if it continues to occur.

Thanks,
Damo