Update HA config to 3.5 seems to have issues with Nodes

I updated from 3.4.14 to 3.5. Both nodes are now reporting that they are not accessible on the maintenance | Nodes page, but I can access both nodes and the product seems to be working. How do we fix this? In the attached screen shot you can see the URL is the node on the right. I can also log into the URL via the load balancer infront of the HA pair with no issues. The only interesting lines in the log file look like this:
2016-11-04 02:05:59.0576 9 WARN Append log entry to ScheduledTasks_ProcessAutoDeployments attempt #1 faulted, retrying in 00:00:01: The process cannot access the file \tasklogs\scheduledtasks_processautodeployments.txt’ because it is being used by another process.

This seems to have resolved itself, not sure what changed

Hi Leslie,

Thanks for getting in touch.

Unfortunately we’ve been unable to replicate this problem, but the error message and screenshots you’ve provided have given us something to investigate regarding file locks.

We’ll do some investigation and report back if we can reproduce this.

Thanks
Mark

I am seeing this same issue with the upgrade to 3.5.1, the nodes are both red and in the log file I am seeing
Append log entry to ScheduledTasks_ProcessAutoDeployments attempt #9 faulted, retrying in 00:00:09: The process cannot access the file ‘\imofiles\octopus\tasklogs\scheduledtasks_processautodeployments.txt’ because it is being used by another process.

Not sure if it helps. The way I did the update was to put the server in maintenance mode and then updated both servers and finally took the server out of maintenance mode

still seeing this, don’t know if this extra info helps:
’\imofiles\octopus\tasklogs\scheduledtasks_processautodeployments.txt’ because it is being used by another process.

2016-11-09 09:19:51

Error
Failed to append the following log entry:
CorrelationId: ScheduledTasks_ProcessAutoDeployments
Message: Filtered out machine 10.10.108.55 for the auto-deployment of release Releases-12183 (Project:Projects-392 | Environment:Environments-503). Reason:MachineDoesNotBelongToEnvironmentException The machine 10.10.108.55 does not belong to the environment with id Environments-503.
System.IO.IOException: The process cannot access the file ‘\imofiles\octopus\tasklogs\scheduledtasks_processautodeployments.txt’ because it is being used by another process.
at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
at System.IO.FileStream.Init(String path, FileMode mode, FileAccess access, Int32 rights, Boolean useRights, FileShare share, Int32 bufferSize, FileOptions options, SECURITY_ATTRIBUTES secAttrs, String msgPath, Boolean bFromProxy, Boolean useLongPath, Boolean checkHost)
at System.IO.FileStream…ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options, String msgPath, Boolean bFromProxy)
at System.IO.FileStream…ctor(String path, FileMode mode, FileAccess access, FileShare share)
at Octopus.Shared.Util.OctopusPhysicalFileSystem.OpenFile(String path, FileMode mode, FileAccess access, FileShare share) in Z:\BuildAgent\work\cd33152532ae115f\source\Octopus.Shared\Util\OctopusPhysicalFileSystem.cs:line 184
at Octopus.Server.Orchestration.Logging.ServerLogStorage.BuildChain(String rootCorrelationId) in Z:\BuildAgent\work\cd33152532ae115f\source\Octopus.Server\Orchestration\Logging\ServerLogStorage.cs:line 105
at Octopus.Server.Orchestration.Logging.Processors.Caching.SynchronizedCache1.GetOrCreate(String id, Func1 factory) in Z:\BuildAgent\work\cd33152532ae115f\source\Octopus.Server\Orchestration\Logging\Processors\Caching\SynchronizedCache.cs:line 58
at Octopus.Server.Orchestration.Logging.ServerLogStorage.ProcessLogEntry(String correlationId, ActivityLogEntry entry) in Z:\BuildAgent\work\cd33152532ae115f\source\Octopus.Server\Orchestration\Logging\ServerLogStorage.cs:line 78
at Octopus.Shared.TransientFaultHandling.RetryPolicy.<>c__DisplayClass26_0.b__0() in Z:\BuildAgent\work\cd33152532ae115f\source\Octopus.Shared\TransientFaultHandling\RetryPolicy.cs:line 172
at Octopus.Shared.TransientFaultHandling.RetryPolicy.ExecuteAction[TResult](Func`1 func) in Z:\BuildAgent\work\cd33152532ae115f\source\Octopus.Shared\TransientFaultHandling\RetryPolicy.cs:line 215
at Octopus.Server.Orchestration.Logging.ServerLogStorage.SafelyAppend(String correlationId, ActivityLogEntry entry) in Z:\BuildAgent\work\cd33152532ae115f\source\Octopus.Server\Orchestration\Logging\ServerLogStorage.cs:line 60

Hi Leslie,

So I think we’ve figured out what was causing these warning logs.

Basically when the leader node changes (during maintenance mode for example), there may be a period of time when the previous leader-node still has locks on certain files (such as the auto-deployment task log file). The warnings are nothing to be concerned about in this case, because as soon as the previous leader-node either finishes its current task or releases the lock on the file, the new leader will then be able to take the file-lock it needs and continue operations.

So this is working as intended. It is correctly stopping a different node from taking on auto-deployment tasks when another node is still processing.

Full steps to reproduce:

  • Create two-node HA instance and enable both nodes
  • Open the OctopusServer.txt log file for both nodes in Notepad++ and also open the scheduledtasks_processautodeployments.txt file from the leader node’s TaskLog folder (so you can monitor the files changing)
  • Create an environment with two cloud regions, cr-001 and cr-002 (you can use whatever deployment targets you want, cloud regions are easy for testing)
  • Disable cr-002, so only cr-001 is enabled
  • Create a project deployment process with a sleep operation of e.g. 5 minutes. e.g. Powershell:
Write-Host "start"
Start-Sleep -s 300
Write-Host "end"
  • Create a project trigger that will auto-deploy when machines are enabled
  • Deploy to your environment (where only cr-001 is currently online and available)
  • When the deployment has succeeded, enable cr-002 and watch the task screen to see when the trigger kicks in and starts an auto-deploy to cr-002
  • While auto-deploy is processing this deployment for 5 minutes, it will take a lock on the scheduledtasks_processautodeployments.txt file for the duration
  • Now go to your nodes screen and drain the leader node (let’s say this is node-1) so it will force node-2 to become leader). Watch this screen until you know the current leader has changed.
  • Node-2 will then start the scheduler for auto-deploy, but node-1 still has a lock on the scheduledtasks_processautodeployments.txt file (until that task has completed or the node releases the lock on the file), so node-2’s server logs will start filling up with warnings every 30 seconds until this file is free.

As for the nodes reporting that they are unavailable in the UI, this is a temporary UI issue that will resolve itself when Octopus successfully detects a heartbeat from your nodes. During an upgrade you may just have to wait for both nodes to fully resume operations and these will return to normal.

Hope this helps
Mark

on the Nodes tab things look good now, but I am still seeing the warning

Append log entry to ScheduledTasks_ProcessAutoDeployments attempt #7 faulted, retrying in 00:00:07: The process cannot access the file ‘\imofiles\octopus\tasklogs\scheduledtasks_processautodeployments.txt’ because it is being used by another process.

This happens multiple times a day and the nodes are not swapping roles

Hi Leslie,

These shouldn’t be logged as errors, so we’ve made a Github issue here to investigate this issue further. We will also look at improving the logging around when nodes are swapping roles, so we can determine how often this is occurring. You can track this issue if you wish to be notified when this has been fixed.

Cheers
Mark