As part of our Octopus HA configuration we have tasklogs and artifacts saved to a shared DFS location. We’ve been experiencing issues today where we are seeing file locks in the octopus log and our nodes are intermittently reporting as being offline in the node configuration screen.
An example error we see in the tasks logs is…
" System.IO.IOException: The process cannot access the file ‘\asdfsshare\octopusshare\tasklogs\scheduledtasks_cleanunavailablemachines.txt’ because it is being used by another process."
Is there any reason why we would suddenly start seeing these issues? We haven’t made any changes to our Octopus configuration expect for the dropping of one of our SQL boxes.
Thanks for getting in touch! Could you confirm which version of Octopus Server is running on the nodes of your HA cluster?
We shipped a fix for this kind of behaviour as part of Octopus
3.11.12. I would either recommend upgrading, or otherwise I need to dig further with you and figure out exactly what is going wrong.
From our testing there is no need to be terribly alarmed about this - it’s more an annoying behaviour than anything harmful.
Hope that helps!
Cheers for the response, we will be scheduling an upgrade as soon as possible.
Is there any documentation about how the Leader/Follower election process works within a HA setup? It’s not documented very clearly anywhere and it would be really helpful to understand what’s happening under the hood.
No problems! Please be sure to get back in touch after the upgrade if you see the same behaviour and we’ll get to the bottom of it.
I’ve had a think about it, and then a follow up chat with my team about adding some documentation around the HA leadership election process. We concluded at this point the documentation might not be super beneficial. If something is going wrong with HA leadership there’s not much you can do about it, and it’s a bug we’ll have to fix.
Is there something specific you would have been looking for?
The elevator pitch is: Octopus runs a scheduler loop, and each node in the HA cluster heartbeats into the
Nodes table in the SQL Server on each execution of the loop. If the leader hasn’t been seen for ~1m, another node will elect itself as the leader since that means the scheduler loop hasn’t run for ~1m indicating the leader won’t be scheduling its special tasks.
The bug was that a synchronous/foreground task had been added to the loop when it should have been asynchronous/background.
As part of this fix we also added code which should prevent this ever happening again, and in the unlikely case it does we will log a nice descriptive error message.
Hope that helps!