Database timeout

matt_richardson · 30 November 2015 09:55

Hi there

We hit an issue this morning where we were getting timeouts getting a connection from the connection pool:

2015-11-30 02:24:30.8218     15  WARN  Outstanding Health tasks were not completed before the next task was due to be scheduled. If this error persists, check the Tasks tab for any running Health tasks, and cancel them manually.
2015-11-30 03:24:34.7722     15  WARN  Outstanding Health tasks were not completed before the next task was due to be scheduled. If this error persists, check the Tasks tab for any running Health tasks, and cancel them manually.
2015-11-30 04:24:38.6222     15  WARN  Outstanding Health tasks were not completed before the next task was due to be scheduled. If this error persists, check the Tasks tab for any running Health tasks, and cancel them manually.
2015-11-30 05:01:06.1549     17 ERROR  An unexpected error occurred while attempting to retrieve and execute a task: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.
System.InvalidOperationException: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.
   at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.Open()
   at Octopus.Shared.TransientFaultHandling.RetryPolicy.<>c__DisplayClass26_0.<ExecuteAction>b__0() in Y:\work\refs\tags\3.1.2\source\Octopus.Shared\TransientFaultHandling\RetryPolicy.cs:line 172
   at Octopus.Shared.TransientFaultHandling.RetryPolicy.ExecuteAction[TResult](Func`1 func) in Y:\work\refs\tags\3.1.2\source\Octopus.Shared\TransientFaultHandling\RetryPolicy.cs:line 215
   at Octopus.Core.RelationalStorage.RelationalStore.BeginTransaction(IsolationLevel isolationLevel) in Y:\work\refs\tags\3.1.2\source\Octopus.Core\RelationalStorage\RelationalStore.cs:line 62
   at Octopus.Core.Model.Clustering.OctopusClusterService.GetCurrentNode() in Y:\work\refs\tags\3.1.2\source\Octopus.Core\Model\Clustering\OctopusClusterService.cs:line 40
   at Octopus.Server.Orchestration.TaskQueueLimiter.HasTooManyRunningTasks() in Y:\work\refs\tags\3.1.2\source\Octopus.Server\Orchestration\TaskQueueLimiter.cs:line 29
   at Octopus.Server.Orchestration.TaskQueue.PollForMoreWorkOnBackgroundThread(Object state) in Y:\work\refs\tags\3.1.2\source\Octopus.Server\Orchestration\TaskQueue.cs:line 56

The error was then continuously logged every 20 seconds or so.

This manifested with the octopus just sat at the “checking credentials” screen, and not loading the dashboard.

We’ve bounced the box, and its back up, so its not a critical issue.

We were wondering if it had been a database issue, but other servers were able to connect to the sql server at the same time.

We are running Octopus 3.1.2.

Can you please investigate?

Michael_Noonan · 1 December 2015 02:30

Hi Matt,

Thanks for getting in touch. I have seen this happen transiently under extreme load, but by the time you’ve seen this exception we’ve already attempted to get a connection at least five times with an exponential back-off between each attempt. Taking the diagnostics you’ve performed into account it looks all the world like a SqlConnection leak. I’ve passed over the codebase a couple of times and have found some potential leads. I’ve raised this GitHub Issue to investigate it further.

Thanks for the report! Also if you see this happen again, there are some diagnostics you could perform on the SQL Server that may help us determine the source if we don’t find it through inspection: https://msdn.microsoft.com/en-us/library/aa175863(SQL.80).aspx

Hope that helps!
Mike

matt_richardson · 2 March 2016 15:33

No further occurances of this since then (that said, we’re now restarting on a nightly basis due to the nuget handle leak bug)