Apply retention policies task failing since upgrade

tystol · 28 May 2017 13:30

Hi,

We just upgraded from 3.11.12 to 3.13.7, and we are now getting regular failures on Apply retention policies (see screenshot).

The following error is occuring within the ‘Apply built-in deployment manifest retention policy’ step:

                | == Failed: Apply retention policies ==

12:58:39 Info | Note that this task only applies Octopus server-side retention policies for releases, events and package retention policies for the built-in package repository. Files on Tentacles will only be deleted when the retention policy is applied at the end of a deployment.
12:59:39 Fatal | Exception occured while executing a reader for SELECT Deployment.* | FROM dbo.VariableSet | INNER JOIN dbo.Deployment ON VariableSet.OwnerId = Deployment.Id | INNER JOIN dbo.ServerTask ON Deployment.TaskId = ServerTask.Id | WHERE ServerTask.[State] IN ('Success', 'Failed', 'TimedOut', 'Canceled') | AND Deployment.Id NOT IN (SELECT Id FROM MultiTenancyDashboard)
| SQL Error -2 - Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
| The wait operation timed out

Any ideas?

Cheers.

MarkSiedle · 29 May 2017 00:32

Hi Tyson,

Thanks for getting in touch and sorry to hear you’ve had problems after upgrading.

To help us track down what’s causing the problem, we’ll need to run some queries against your actual data to see what part of the query is causing the problem in your case.

Could we please get a SQL backup of your Octopus database? If you can upload the .bak file to this upload location, we will be able to investigate further.

Out of curiosity, how long does this SELECT query take if you run it manually in SQL Studio against your Octopus database?

SELECT Deployment.*
FROM dbo.VariableSet
INNER JOIN dbo.Deployment ON VariableSet.OwnerId = Deployment.Id
INNER JOIN dbo.ServerTask ON Deployment.TaskId = ServerTask.Id
WHERE ServerTask.[State] IN ('Success', 'Failed', 'TimedOut', 'Canceled')
AND Deployment.Id NOT IN (SELECT Id FROM MultiTenancyDashboard)

Cheers
Mark

tystol · 29 May 2017 13:39

Hi Mark,

Thanks – just uploaded. Maybe twice actually. Upload seems to complete but then the UI goes back to ‘There are no files in this directory’. Let me know if you didn’t get it.

Wow - That query has been running for 15 minutes and still hasn’t finished!

FYI – the whole reason we upgraded was because the old version started chewing up our EC2 instance cpu credits, as a periodic query on MultiTenancyDashboard was chewing heaps of cpu (and taking like 15 seconds to run). Seems to be better now – still takes 2 seconds to query that view for only 242 rows, but faster than before.

Also FYI, just ran this:
SELECT COUNT() FROM Deployment
SELECT COUNT() FROM [ServerTask]
SELECT COUNT() FROM [Release]
SELECT COUNT() FROM VariableSet

And got: 594, 697, 70, and 561 respectively.

Although hopefully you got our db and can do this yourself now 

1 possible hint – we auto tag tenants with a ‘version’ tag during deploy based on the deployed version, to track by version and apply updates to only tenants on specific versions. We also have a script that runs on deployment and removes empty version tags once it finds no more tenants using it. Possibly we’re polluting your tagging system with this and creating many more rows than expected. No idea, just something we do that I guess may be out of the ordinary.

Good luck!

Tyson.

MarkSiedle · 30 May 2017 02:13

Hi Tyson,

Thanks very much for the database backup. We don’t think there’s a problem with your tagging, but will keep it in mind. The retention policy query seems to be locking up with the subquery to the MultiTenancyDashboard view, so we’ll look into a fix for this asap.

We’ve created a priority GitHub issue here that you can track to know when this has been fixed.

Cheers
Mark

Robert_Wagner · 31 May 2017 09:05

Hi Tyson,

I’ve had a closer look at this issue, and your database backup has been very useful. I updated statistics and rebuilt the indexes. After that the query ran very quickly (< 1 second). I recommend you do those two things. Be aware that these command slow down the server significantly and took about 5 minutes and 20 seconds respectively, so choose a good time to do it.

The best way going forward though is to setup a Database maintenance plan to do this on a regular basis. You can run it immediately to get over this hurdle.
As a side note, when setting it up, don’t choose the shrink database command, it actually decreases performance.

Please let me know how you go,

Robert W

tystol · 31 May 2017 10:46

Thanks Robert. Unfortunately I just tried both those actions (used Ola Hallengren’s tools to rebuild all indexes), and I am still getting the same error within OD.
Interestingly, when I run that failing query within SMSS it now works fine. Yet still times out from within the OD app. I’ve restarted the OD service, but same error after 1 minute of waiting for the query.
The only difference is OD connects to the db via Integrated Security (local system account), but I’m connected to SMSS via the sa user. Not that that should make a diff…

Any more ideas?

Cheers.
Tyson.

tystol · 31 May 2017 10:49

Just read the ticket here: https://github.com/OctopusDeploy/Issues/issues/3531

Sounds like it is a deadlock to me now (as well as a bad db state). The db state is now fixed and the query run’s fine in isolation, but still hits a deadlock when run within OD.

Cheers.
Tyson.

Robert_Wagner · 31 May 2017 21:39

Hi Tyson,

Thanks for trying that. Could you go to Configuration -> Diagnostics -> System Report while the problem is occurring and upload the zip to the above share? I’ll have a further look into this today.

Robert W

tystol · 1 June 2017 03:39

Sure, done.

Robert_Wagner · 2 June 2017 00:07

Hi Tyson,

Thanks for that. It doesn’t look like the problem is a dead lock or it is waiting on another transaction. When the Octopus server starts, the query starts running slowly again even after the server has exited again. It smells a lot like a query plan caching problem. I ran the query when it was slow to completion (19 minutes) and it showed it was producing 180 million rows at some stage in the execution!.

Anyway, I have added some query hints to that query, which keep it consistently fast. That change should be included in 3.14.0 next week.

The report also showed that we kept a transaction open when polling for queued tasks, so the report was doubly useful

Let me know how you go after installing 3.14.0. In the meantime, that part of the cleanup process failing won’t cause other problems.

Robert W

tystol · 2 June 2017 12:00

Ahh, makes sense.
And glad it helped.
Will let you know how 3.14 goes.

Cheers.
Tyson.

tystol · 2 July 2017 19:42

Hey guys,

I upgraded to 3.14.1 a few weeks back, and it’s been fine for a few weeks, but the last few days the sql server process has spiked again (never mentioned this first time around, but this was another side effect - increased CPU usage, enough to burn down our EC2 cpu credits)
.
Running that same query is SMSS is now taking 11s (nowhere near as bad as before, but still not great).

I’m upgrading to latest 3.14.x version now, but can’t see anything in release notes between 3.14.1 and latest that would affect this.

EDIT: Nevermind guys - I stopped the OD service and the cpu usage is still high - something else causing it this time. So looks like 3.14 did resolve this issue for me a few weeks back, thanks for all the help!