Self-Hosted - Tasks stuck waiting in queue after upgrading to LDS v2019.6.0

RT · 2 July 2019 16:38

Today i updated to LTS V2019.6.0 which went fine…mostly. There were 2 deploy targets that had trouble upgrading from to the latest version of tentacle, i believe they were trying to upgrade from tentacle 4.0.1 to 4.0.5. I ended up having to download the Octopus Tentacle 64 bit installer for 4.0.5 and manually run it on the target server to install it. After that it reported in fine. however at the time of the attempted upgrade there were some deploys queued up to run. Now after the upgrades no deployment tasks will run. I have around 9 tasks queued to run on various servers but they are all queued behind a task that is stuck in queue with nothing in front of it. The stuck task is set to run on one of the servers that was having problems upgrading. we can’t do any deploys now. Any help would be greatly appreciated.

RT · 2 July 2019 17:36

Additional information - The task that is hanging indefinitely has the following error:
[“ServerTasks-43963_6JMVPHU6ZF”,“INF”,“2019-07-02T17:27:25.9610577+00:00”,“Synchronize Community Step Templates”,"",0]
[“ServerTasks-43963_6JMVPHU6ZF”,“INF”,“2019-07-02T17:27:25.9766115+00:00”,“Running community library step templates sync…”,"",0]
[“ServerTasks-43963_6JMVPHU6ZF”,“VBS”,“2019-07-02T17:27:26.0077211+00:00”,“Downloading latest community templates from https://library.octopus.com/api/step-templates…”,"",0]
[“ServerTasks-43963_6JMVPHU6ZF”,“FAT”,“2019-07-02T17:27:26.2099358+00:00”,“The Octopus server failed to connect to our community library. http://g.octopushq.com/CommunityContributedStepTemplatesTroubleshooting","An error occurred while sending the request.\r\nSystem.Net.Http.HttpRequestException: An error occurred while sending the request. —\u003e System.Net.WebException: The underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel. —\u003e System.Security.Authentication.AuthenticationException: The remote certificate is invalid according to the validation procedure.\r\n at System.Net.TlsStream.EndWrite(IAsyncResult asyncResult)\r\n at System.Net.ConnectStream.WriteHeadersCallback(IAsyncResult ar)\r\n — End of inner exception stack trace —\r\n at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)\r\n at System.Net.Http.HttpClientHandler.GetResponseCallback(IAsyncResult ar)\r\n — End of inner exception stack trace —\r\nOctopus.Server version 2019.6.0 (2019.6.0+Branch.tags-2019.6.0.Sha.f9961d59a9d808a7a1abfce979cc1096470eca32)”,0]
[“ServerTasks-43963_6JMVPHU6ZF”,“FAT”,“2019-07-02T17:27:26.2099358+00:00”,“The community library step templates synchronization failed.\r\nAn error occurred while sending the request.\r\nThe underlying connection was closed: Could not establish trust relationship for the SSL/TLS secure channel.\r\nThe remote certificate is invalid according to the validation procedure.”,"",0]
[“ServerTasks-43963_6JMVPHU6ZF”,“FIN”,“2019-07-02T17:27:26.2099358+00:00”,“Finished”,"",0]

Justin_Walsh · 2 July 2019 17:41

Hi @RT!

I wonder if this was stemming from our Cloudflare outage this morning, causing the task to get stuck. In order to get you back up and running, you could try restarting the Octopus service on your machine, which will (in most cases) allow tasks to process once again.

Please let me know if this doesn’t work for whatever reason, and we can dig in some more.

RT · 2 July 2019 17:57

No luck. Unfortunately that had no affect. Any other ideas? I also tried toggling the proxy settings for tentacle to see if that made a difference, but to no avail.

RT · 2 July 2019 18:05

Additional Info-
After the restart The task logs for the queued tasks has disappeared from C:\Octopus\TaskLogs, and the tasks remain queued.

Justin_Walsh · 2 July 2019 18:13

Hi @RT

Is it possible that you have tasks running on other spaces that are taking up your task slots?

Do you have deployments with awaiting manual interventions? 2019.6 (well, 2019.5.8, but bundled in 2019.6’s LTS release) - one change in the new version was to block deployments when there is a manual intervention waiting for that project to that environment - as referenced in this issue: https://github.com/OctopusDeploy/Issues/issues/4564

Hope to hear from you soon!

RT · 2 July 2019 18:19

Yes, we have around 6k tasks awaiting manual intervention. The way we use octopus is our CI pipelines detect new code build it and upload it to octopus and queue up a release in the dashboard the QA team sees a deploy is available and approves it when their ready for it, which may or may not be immediately. If an application has several changes (and thus several builds and releases queued for deployment) then QA will only end up approving the most recent deployment. and the others are essentially abandoned.

That new feature sounds like a terrible thing to implement without also implementing logic to cancel any releases awaiting approval when a newer release for that project is created.

RT · 2 July 2019 18:29

I’ll have to roll back the update as this is literally a breaking change. I really appreciate your help finding the issue. I would not have found it on my own. I am also very very disappointed with the team that thought this was a “Feature”. at a minimum this should be an option that can be toggled off or on. I guess we’ll have to stay on an older version of octopus for now. If Octopus doesn’t release a fix for this, we’ll be looking to replace Octopus with some other solution rather than renewing. Which would be extremely disappointing as I have really loved Octopus up to this point.

RT · 2 July 2019 18:42

well… that or I’ll have to completely change our process and all of our pipelines to account for this new “Feature”.

Justin_Walsh · 2 July 2019 18:44

Appreciate it, @RT, and I completely understand - I’ll bring this up with our product team ASAP, and hopefully we can get a solution worked out here.

One option here could be relocating your Manual intervention step to the start of your deployment process, and then scoping the step so that it is only required for your Post-Dev environments. That way if someone wants to promote it to QA, they only get the interruption/confirmation when higher up the chain than the dev environment. Not sure if something like this could work, but it’s an option for the short term.

Then you’d just need use a powershell script to clear out the existing interventions, something built off of https://github.com/OctopusDeploy/OctopusDeploy-Api/blob/master/REST/PowerShell/ManualInterventions/ApproveOrAbort.ps1

This is additional workload, and for that, I’m sorry. As I mentioned though, I’ll talk to our team and see if this can be made opt-in/out via a project setting or something similar.

I will circle back around with you tomorrow, after I’ve had a chance to talk to the team (who’re based in Australia).

RT · 2 July 2019 18:51

Thanks. The manual Intervention is already the first step of our deployment process. (no point in doing a deployment if it’s not approved) And currently the vast majority of our deployments do only target QA and not development. I just failed to install an older version of octopus so I guess I’m stuck on this new version and everything i thought I was going to work on this week now has to get pushed back because it looks like i’ll spend the rest of it updating our projects and build jobs so that we’ll be able to do some sort of deploy going forward, since Rolling back just causes the service to crash. Thanks for the resources.

Justin_Walsh · 2 July 2019 18:55

Sorry to hear that you’re having issues rolling back - did you take a backup of your install before you did the upgrade? If so, you should be able to restore it per https://octopus.com/docs/administration/data/backup-and-restore

If you need any specific help with this, please let me know!

RT · 2 July 2019 18:56

No I didn’t. I’d gotten too comfortable with Octopus running smoothly and being the golden child of my toolset. That’s definitely on me.

Justin_Walsh · 2 July 2019 19:00

Any chance your server team took a system-level backup last night or anything that could be used for a rollback?

RT · 2 July 2019 19:12

I’ve reached out to them, their going to see if they have a backup from last night. On the chance that they don’t, with that powershell script, It needs an InteruptionID. Is there a way to use a wildcard somehow instead of specifying 6k+ interuptionID’s?

Justin_Walsh · 2 July 2019 19:17

Sadly, I don’t have one to iterate over all the outstanding interrupts handy, but I’ll try to work on one when time permits. The easiest way of doing this would be to query the Octopus API via the /api/interruptions endpoint, getting everything that has "IsPending": true and then action that.

As a long-time user, I’m sure you’re aware, but Octopus comes with a SwaggerUI installed, for easily looking through the documentation and running queries. You can find it at <your_octopus_server>/swaggerui

RT · 2 July 2019 19:25

Thanks for the tips! I really do appreciate them. That’s a good thought, I’ll give that a shot. As to the swaggerUI I have seen it but admittedly calling API’s is an area i am weak at. I have used the octopus user interface for pretty much everything. At one point i tried to get into working with the API but had to switch tasks.

Justin_Walsh · 3 July 2019 13:47

Hi @RT!

Just a quick note to let you know that we are going to make some changes to allow the older functionality in (likely) the next LTS patch. We’ve raised a GitHub issue here: https://github.com/OctopusDeploy/Issues/issues/4564 that you will be able to find more information, and subscribe for email updates about its status.

Hope to have something out in your hands soon!

RT · 3 July 2019 14:05

Wow That’s great news. Thanks a lot. reading that made my day.

system · 2 August 2019 14:05

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.