Retention policy change hoses server

nskerl · 27 February 2018 21:56

Today we decided to change our rather lenient retention policy (keep everything forever) to purge several years worth of releases. When the retention was triggered it queued > 12k tasks and everything halted on the server.

I noticed the below SQL error in the logs, which leads me to believe we are hitting a limit of the t-sql statement ( >2100 parameters might be the one).

I updated the dbo.ServerTasks to manually cancel the queued tasks to get the server functional again, but we’re worried the same thing is going to happen when retention kicks in later on.

We are on v2018.2.6

Any advice greatly appreciated.

An unexpected error occurred while attempting to retrieve and execute a task: 
Exception occured while executing a reader for
'SELECT * FROM dbo.[ServerTask] with (UPDLOCK,READPAST) WHERE ([State] = 'Queued' 
AND [HasPendingInterruptions] = 0 
AND ([ServerNodeId] IS NULL 
OR [ServerNodeId] = @servernodeid) 
AND [Id] IN (@potentialtasks_0, @potentialtasks_1, @potentialtasks_2, @potentialtasks_3,
 @potentialtasks_4, @potentialtasks_5, @potentialtasks_6, @potentialtasks_7, 
@potentialtasks_8, @potentialtasks_9, @potentialtasks_10, @potentialtasks_11,    
...  
@potentialtasks_12351) 
AND [QueueTime] <= @queuetime ) 
ORDER BY [QueueTime]

Michael_Compton · 28 February 2018 02:57

Hi, sorry this has affected you.

in 2018.2.6 we (actually me ) made some changes to how the task queue is handled and this many tasks all ready to run at once wasn’t something considered. There’s probably two issues here, one is that the task queue should have handled this more gracefully and the other is that a task for each document to delete is probably not right.

We’ve put up an issue for it here https://github.com/OctopusDeploy/Issues/issues/4333 and we’ll get on to fixing it.

For you to get back up and running it might be worth trying rolling back to 2018.2.5, there’s still going to be loads of tasks on the queue but with any luck the server will tick through them and then things will return to normal.

Michael

nskerl · 28 February 2018 05:01

@Michael_Compton Thanks for the follow-up. I just reverted the change in retention back to keep everything and manually canceled the queued tasks. We are all good for now… we will wait for the fix.

Michael_Compton · 6 March 2018 05:51

Hi, just letting you know that a fix to the problem you had is shipping in 2018.3.0. Check for it in the release notes. (it fixes the task queue problem, not the idea to batch deletes, that’s been moved to another issue)

The task queue should handle your 12000 tasks; however, that’s still 12,000 tasks and by default, the server might spin up 5 tasks, sleep for 20 seconds, check if the tasks are done, spin up another 5 etc. That could take a while to churn through all 12000!

I’d advise picking a time when nothing else is running, bump the task cap up to something like 100 (configuration->nodes-…), let it churn through the deletes, then return the task cap to what you had before.

Michael

nskerl · 28 March 2018 00:55

Hi Michael,
Just wanted to report back that the retention task issue is looking good over here. It took a while, but it finally completed the delete tasks.

Im not sure it is related, but we are seeing a frequent timeout on our /app#/tasks page. The error message that comes back is below:

This is on v2018.3.2

Michael_Compton · 8 April 2018 22:43

Sorry, I dropped the ball on this one. Are you still getting this error?

Michael

nskerl · 23 April 2018 19:58

Sorry for the late reply… no, this error went away. This was likely not related to the retention policy stuff here. Thanks for checking in.

system · 23 May 2018 20:06

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.