Release crashes when Azure Powershell script takes more than 30 minutes

tvannuland · 17 February 2017 15:21

I have a problem with the Azure Powershell Script execution.
Whenever a script takes longer than 30 minutes, the next step in the releaseprocess will fail.
I currently run the latest version of Octopus.

In Octopus 3.3 the step fails with this error:
The step failed: Activity Restore backup in Staging slot on the Octopus Server failed with error ‘The connection is broken and recovery is not possible. The connection is marked by the server as unrecoverable. No attempt was made to restore the connection.’.

In Octopus 3.10 the step fails with this error ( i think under the hood it’s the same error as above )
The step failed: Activity Restore backup in Staging slot on the Octopus Server failed with error 'Exception occured while executing a reader for SELECT TOP 1 * FROM dbo.[Account] WHERE ([Id] = @id) ORDER BY [Id]'.

I think this is related to the github issue below.
Some of the last comments say indeed that the issue still occurs for Azure powershell after fixing it for the SQL Azure functionality.

Shannon_Lewis · 22 February 2017 22:12

Hi,

Thanks for getting in touch and apologies for the delay in getting back to you. I suspect you are correct and it is the same underlying error, but maybe surfaced differently now. That issue with the database connections has proven particularly difficult to track down, we have spent many hours trying to reproduce it in the past months.

Could I get some more details about your installation and deployment process for comparison? Do you have your Octopus server in a cloud VM or on-premise? Does it have .NET 4.5.1 or greater installed? We added the Idle Connection Resiliency connection string settings back in one of the 3.4 versions, but it does require .NET 4.5.1 or greater to be installed on the Octopus server before it takes effect. I believe the issue has persisted after this, but it’s probably worth checking.

We believe this issue has only occurred in the past when the Octopus database is hosted in SQL Azure, can I confirm that’s how your installation is configured?

Is the Azure PowerShell script you are executing doing anything database related?

Would you be able to attach the server log for the error, so I can get the exact stack trace etc?

Regards
Shannon

tvannuland · 28 February 2017 14:43

Hi,

We use octopus server in an Azure virtual machine
We have .NET 4.5.1 installed
Our octopus database is indeed hosted on an Azure SQL Server
The commands that take longer than 30 minutes are: Restore-AzureRmWebAppBackup and New-AzureRmWebAppBackup
Anything executed after a 30+ minute execution will fail.

best regards,
Thijs

Shannon_Lewis · 28 February 2017 23:49

Hi Thijs,

Thank you for the additional information. We have had a couple of similar reports over the past weeks, all with very similar configurations. The original issue is closed in GitHub, but I have opened a new one as this issue clearly needs more investigation.

Sorry I don’t have better news at this point, but please keep an eye on that new GitHub issue for further updates and let me know if I can assist with anything else in the meantime.

Regards
Shannon

tvannuland · 6 March 2017 10:14

Thanks, if it is indeed caused by the SQL Connection to SQL Azure, (octopus database itself), wouldnt it be better to just start a new SQL connection on every octopus step, or check on the re-usability of the existing sessions?
It’s normal for connections to time-out (due to OS or network device settings)

Shannon_Lewis · 7 March 2017 00:41

Hi Thijs,

I believe we are requesting a new SqlConnection on at least each step, sometimes more frequently. It’s looks almost like the connection we’re getting from the ADO.NET connection pool looks ok until we try to use it.

There were some other updates made to transaction handling in 3.11.2, primarily to address some performance issues in large HA installations but they could possibly have some impact here. We are still trying to reproduce the issue, but have been unsuccessful so far. If you’re seeing this issue frequently, would you be in a position to try >= 3.11.2 ?

Regards
Shannon

tvannuland · 7 March 2017 10:46

Hi Shannon,

We are currently running 3.11.3 and the problem still occurs.

Regards,
Thijs

Shannon_Lewis · 8 March 2017 01:53

Hi Thijs,

Thanks for the update, that’s good information to know. We’ll continue our testing again on the latest version and see if we can reproduce it.

Regards
Shannon

tvannuland · 8 March 2017 19:29

thanks, if i need to test something, let me know.