Release showing as success but steps showing as not run

Hi,
We are seeing an issue where the release is showing as success but the steps are still running or not showing as complete.

We upgraded our version yesterday to 2023.1.10235.

The issue was reported before and after the upgrade.

I can send the log, so if you can send me a link to upload.

Kind Regards,
Micheál Power

Hey @mikepower79,

Sorry you are running into this issue, I have seen this before but only when Octopus is in a HA configuration and the instances are using shared storage. I do not think that is the case with you is it?

I have created you a secure link here you can upload the task log to, is it just that one release that is having the issue?

Let us know once the logs are uploaded and I can take a look for you,
Kind Regards,
Clare

Hey @mikepower79,

Sorry for the double post here but a college of mine just pointed out can you check if the deployment has available targets as if it does not it will skip all the steps. It should show that in the task log.

If that is not the case I will await the logs and take a look!

Kind Regards,
Clare

Hi @clare.martin,
Our octopus is a HA instance and using Azure files storage.

Kind Regards,
Micheál Power

@mikepower79,

We have seen it before with shared storage where the deployment completes in the database, which is why you are seeing it as green in the UI, but the logs are still being written to, which is why it can take a bit of time for the UI to catch up. When we see this its usually intermittent and does not happen on all projects.

If you can send us the logs over just so we can check there is nothing else going wrong here but it might just be that it may take a bit of time for the logs to be written to for that release. If that is the case how are the metrics looking for the shared storage do you notice a lag there or is it slower than what it is normally?

I look forward to hearing from you,
Kind Regards,
Clare

Hi @clare.martin,
I have uploaded the deployment log and the system diagnostic report.

Kind Regards,
Micheál Power

Hey @mikepower79,

Thanks for those logs, it does look like they are full logs and the steps that are supposed to run are showing as successful, which does seem to point to the shared storage issue I mentioned.

I did also find this forum post you put up which is the same issue and Justin mentioned shared storage and also this post here which was shared storage related and Justin gave you some tips as to how to pinpoint what the issue was, if you have not followed those already that would be a good start to the troubleshooting process.

I also noticed you put in another request which looks shared storage related so I would definitely take a look at your shared storage and see if the metrics are on par with how they are supposed to look.

Let me know if they look ok but the faults you are putting in at the moment all seem to point to a shared storage issue I am afraid.

Kind Regards,
Clare

1 Like

Hi @clare.martin,
Cant see any issues with our Azure storage.
If it was an issue with our storage catching up with the UI then the deployment would catch up and the steps would show as complete in the UI.
But this is not the case the steps are still showing not complete after a week, the release is showing as complete but steps are not.

I will investigate further but for now our Storage looks good.

Kind Regards,
Micheál Power

Hey @mikepower79,

I took another look at your server logs, specifically for ServerTasks-3366253 as that is the task you are seeing as still showing as not completed and there are a few errors relating to what I think is the task log, this does point to permissions on the storage. I am unsure on what the first error means so I can get that in front of our engineers but it does look like you may need to take a look at your storage permissions as Octopus is struggling to write to task logs:

Access to the path '/taskLogs/servertasks-3366253_xfxfrqu8t9.txt' is denied.

System.ArgumentException: CorrelationId ServerTasks-3366253_XFXFRQU8T9 is not currently active
   at Octopus.Core.Diagnostics.TaskLog.TaskLogFactory.Get(CorrelationId correlationId) in ./source/Octopus.Core/Diagnostics/TaskLog.cs:line 377
   at Octopus.Server.Extensibility.ITSM.Core.Precondition.ChangeControlledDeploymentStartHandlerBase.Handle(DeploymentEvent e, CancellationToken cancellationToken) in ./source/Octopus.Server.Extensibility.ITSM.Core/Precondition/ChangeControlledDeploymentStartHandlerBase.cs:line 53
   at Octopus.Core.Infrastructure.MessageBus.Middleware.UnitOfWorkMiddleware.Invoke(MessageContext context, Func`2 next, CancellationToken cancellationToken) in ./source/Octopus.Core/Infrastructure/MessageBus/Middleware/UnitOfWorkMiddleware.cs:line 29
   at Octopus.Core.Infrastructure.MessageBus.Middleware.UnitOfWorkMiddleware.Invoke(MessageContext context, Func`2 next, CancellationToken cancellationToken) in ./source/Octopus.Core/Infrastructure/MessageBus/Middleware/UnitOfWorkMiddleware.cs:line 29
   at Octopus.Core.Infrastructure.MessageBus.Middleware.LoggingMiddleware.Invoke(MessageContext context, Func`2 next, CancellationToken cancellationToken) in ./source/Octopus.Core/Infrastructure/MessageBus/Middleware/LoggingMiddleware.cs:line 39
   at Octopus.Core.Infrastructure.MessageBus.Middleware.CorrelationIdMiddleware.Invoke(MessageContext context, Func`2 next, CancellationToken cancellationToken) in ./source/Octopus.Core/Infrastructure/MessageBus/Middleware/CorrelationIdMiddleware.cs:line 33
   at Octopus.Core.Infrastructure.MessageBus.PipelineFactory.BuildAndRun[TEvent](Int64 sequenceNumber, TEvent e, IDictionary`2 metadata, String consumerGroupId, Type handlerType, CancellationToken cancellationToken) in ./source/Octopus.Core/Infrastructure/MessageBus/PipelineFactory.cs:line 51
   at Octopus.Core.Infrastructure.MessageBus.MessagePump.Dispatch[TEvent](Int64 sequenceNumber, TEvent e, IDictionary`2 metadata, CancellationToken cancellationToken) in ./source/Octopus.Core/Infrastructure/MessageBus/MessagePump.cs:line 135
2023-05-18 10:26:47.6595      1     74  WARN  An unhandled exception occurred while dispatching an event via the message bus: "CorrelationId ServerTasks-3366253_XFXFRQU8T9 is not currently active"
System.ArgumentException: CorrelationId ServerTasks-3366253_XFXFRQU8T9 is not currently active
"CorrelationId: "ServerTasks-3366253_XFXFRQU8T9/6ec38cf4a5854df198facc13655a1f3a""
"Message: "Enable Maintenance completed"
System.UnauthorizedAccessException: Access to the path '/taskLogs/servertasks-3366253_xfxfrqu8t9.txt' is denied.
 ---> System.IO.IOException: Permission denied
   --- End of inner exception stack trace ---
"CorrelationId: "ServerTasks-3366253_XFXFRQU8T9/85fe2e16c7224a33aad3a76cf403d11f/60ec2f6113ff4f64bff995305632c6a8/b9ff3e936ae4446580144d9d6066aa37""
"Message: "Calamari Version: 25.3.3"
System.UnauthorizedAccessException: Access to the path '/taskLogs/servertasks-3366253_xfxfrqu8t9.txt' is denied.
 ---> System.IO.IOException: Permission denied
   --- End of inner exception stack trace ---
   at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan`1 buffer, Int64 fileOffset)
   at System.IO.Strategies.BufferedFileStreamStrategy.FlushWrite()
   at System.IO.Strategies.BufferedFileStreamStrategy.Dispose(Boolean disposing)
   at System.IO.StreamWriter.CloseStreamFromDispose(Boolean disposing)
   at System.IO.StreamWriter.Dispose(Boolean disposing)
   at Octopus.Core.Orchestration.Logging.Processing.AppendToLogFile.InitializeInternal() in ./source/Octopus.Core/Orchestration/Logging/Processing/AppendToLogFile.cs:line 49
   at Octopus.Core.Orchestration.Logging.Processing.AppendToLogFile.Process(CorrelationId correlationId, ActivityLogEntry entry) in ./source/Octopus.Core/Orchestration/Logging/Processing/AppendToLogFile.cs:line 71
   at Nevermore.Transient.RetryPolicy.<>c__DisplayClass26_0.<ExecuteAction>b__0()
   at Nevermore.Transient.RetryPolicy.ExecuteAction[TResult](Func`1 func)
   at Nevermore.Transient.RetryPolicy.ExecuteAction(Action action)
   at Octopus.Server.Orchestration.Logging.ServerLogWriter.DoSafeAppend(ILogEntryProcessor processor, CorrelationId correlationId, ActivityLogEntry entry) in ./source/Octopus.Server/Orchestration/Logging/ServerLogWriter.cs:line 171
2023-05-18 11:18:19.9081      1    138 ERROR  Failed to append the following log entry:"
"CorrelationId: "ServerTasks-3366253_XFXFRQU8T9/85fe2e16c7224a33aad3a76cf403d11f/60ec2f6113ff4f64bff995305632c6a8/b9ff3e936ae4446580144d9d6066aa37""
"Message: "Exit code: 0"
System.UnauthorizedAccessException: Access to the path '/taskLogs/servertasks-3366253_xfxfrqu8t9.txt' is denied.
 ---> System.IO.IOException: Permission denied
   --- End of inner exception stack trace ---

You can see here too its failing to write to the logs:

2023-05-18 11:18:19.9081      1    138 ERROR  Failed to append the following log entry:"
"CorrelationId: "ServerTasks-3366253_XFXFRQU8T9/85fe2e16c7224a33aad3a76cf403d11f/60ec2f6113ff4f64bff995305632c6a8/b9ff3e936ae4446580144d9d6066aa37""

Which would show as incomplete in the UI as the database has to get the information from the storage in order to update the steps in the UI. It will show as complete for the deployment as when that completes it updates directly from the database, but the logs themselves for the steps need to be taken from the storage and Octopus is having difficulty accessing them.

I will get the first error looked at by the engineers but in the meantime can you look at the permissions for that folder and ensure the service account Octopus is using for your HA nodes has access to that task logs folder.

I found at least 50 of those access denied errors so I would try and get those resolved and that might fix the issue.

I will update you on what the engineers say about the first error message.

Kind Regards,
Clare

Hi @clare.martin,
Ok will take a look at permissions on our storage.
Thanks for the quick response.

Kind Regards,
Micheál Power

1 Like

Hey @mikepower79,

Just to let you know our engineers got back to me last night and did mention we have another customer seeing this issue (we know their permissions are correct on their shared storage).

I will keep you updated on any discussions but as a precaution just to tripple check I would still check the permissions on the task logs folder just so we can rule that out. If they look ok at least we can say its not down to permissions as it may just be a co-incidence that this other customer has the same issue as you but its not entirely related to what may be the cause of your issue here.

Once I hear back from the engineers I will let you know.

Kind Regards,
Clare

Hey @mikepower79,

Sorry for the double post but I just took another look at the shortcut story the engineers have open and the other customer was getting the same error you are but only the first one -

System.ArgumentException: CorrelationId ServerTasks-3366253_XFXFRQU8T9 is not currently active

They think that is related to service now and Jira but that is what they are investigating, so the other customer is not seeing the tasks UI issue you are, I apologise for mixing that up!

So your issue looks to be permissions related but the engineers are looking into the ‘not currently active’ error.

Kind Regards,
Clare

Hey @mikepower79,

One last thing on that error:

System.ArgumentException: CorrelationId ServerTasks-3366253_XFXFRQU8T9 is not currently active

The engineers mentioned that error is is not impacting the ability to use Octopus. They think it’s caused by how task logs are being resolved when the message pump sends a message but it can be safety ignored.

The engineers will work on seeing what the cause of that issue is but it doesn’t factor into the issues you are seeing in the UI.

Hopefully the permissions check gets you somewhere but reach out if not and we can see if we can dig into this a little more.

Kind Regards,
Clare

Hi @clare.martin,
I have been notified that our storage account has no permissions restrictions.
We seen the issue again at the weekend where tasks were running but on the UI wasnt keeping up.

Kind Regards,
Micheál Power

Hey @mikepower79,

I am sorry we have missed this, I was on PTO for a few days and I think this slipped the net. I have sent this to our engineers to see if there is anything else we could get from you to help diagnose this issue but the task logs are pretty clear in that there are some sort of permission restrictions involved. Though I know you did say those have been checked.

Unfortunately the diagnostic report has been deleted as it gets automatically removed from our secure file site after 6 days. Has this issue happened again since you posted up last time?

Would you be able to send us over another diagnostic report by any chance if it has happened again, are you planning to upgrade to latest 2023.2 server version at all any time soon? Not that I think an upgrade would fix this issue but it would be nice to rule out a specific Octopus version being the cause.

As I mentioned, we only see this when there are either permissions issue on customers shared storage or there are performance issues (lag etc) with the storage itself and since no other customer has mentioned this issue after upgrading to 2023.1.10235 I do think this is environmental, we just have to figure out why its happening.

I am sorry this is taking a while to diagnose, if it has happened again here is the Secure File Site link you can send the report to and I can get that sent to our engineers, if you could also get us a HAR file when it happens so we have that to hand too.

Kind Regards,
Clare

Hi @clare.martin,
We have been notified that there are over 100 deployments in this situation.
We have tried to restart the tentacle service but still the deployment is succeeded but the child steps are showing as still running.

I have attached the server logs in the link and a HAR file.

Kind Regards,
Micheal Power

Hey @mikepower79,

Thanks for the logs and I am sorry this is such a headache for you at the moment, I can get this in front of our engineers but I am afraid we are still seeing access denied errors in your logs for the task logs and until that gets fixed there is not much we can do really as this is external so we cant fix any code or bug that is present, no other customers have reported this issue on 2023.1.10235 so I do not think this is a bug.

2023-06-16 02:42:37.4836      1     62 ERROR  Failed to append the following log entry:"
"CorrelationId: "ServerTasks-3409467_S4WQ4ZJ8K8/1b2c3ef4ffcd44fc9797bc588c458065/9eb8dc6211a542a8b1a99915afa002f3/768eaec6a1f649debf528559d6bcea6d""
"Message: "Executing Validate Octopus release format (type Run a Script) on EUXXX60-ITXXXX89 "
System.UnauthorizedAccessException: Access to the path '/taskLogs/servertasks-3409467_s4wq4zj8k8.txt' is denied.
 ---> System.IO.IOException: Permission denied
2023-06-16 02:42:59.4925      1     60 ERROR  Failed to append the following log entry:"
"CorrelationId: "ServerTasks-3409460_GSYEL5L9N8/3ee7bed8c96a445aa54c4d5983b4f142/399d25deaf5f425389abdb185580d70e""
"Message: "Exit code: 0"
System.UnauthorizedAccessException: Access to the path '/taskLogs/servertasks-3409460_gsyel5l9n8.txt' is denied.
 ---> System.IO.IOException: Permission denied
2023-06-16 02:44:59.5255      1     46 ERROR  Failed to append the following log entry:"
"CorrelationId: "ServerTasks-3409467_S4WQ4ZJ8K8/3f11685b01ec41d3bec7e64fa6850f5c/08c4f34dea3545f39a0f332082ae95df""
"Message: "Performing variable substitution on 'D:\Octopus\```\Work\20230616024452-3409467-23826\Script.ps1'"
System.UnauthorizedAccessException: Access to the path '/taskLogs/servertasks-3409467_s4wq4zj8k8.txt' is denied.
 ---> System.IO.IOException: Permission denied
2023-06-16 04:22:37.6543      1     50 ERROR  Failed to append the following log entry:"
"CorrelationId: "ServerTasks-3409553_U4LSLMUJ2F/c4a959488aba469da5e70dbd387e974d/78ccfb36fff14e898ac1645c0cb7f1a4""
"Message: "Acquiring isolation mutex PackageRetention with NoIsolation in ServerTasks-3409553"
System.UnauthorizedAccessException: Access to the path '/taskLogs/servertasks-3409553_u4lslmuj2f.txt' is denied.
 ---> System.IO.IOException: Permission denied
2023-06-16 04:34:53.9498      1    163 ERROR  Failed to append the following log entry:"
"CorrelationId: "ServerTasks-3409553_U4LSLMUJ2F/98767a07ba6f4d648566752989685c01""
"Message: "The step failed: Activity Deploying list of projects on EUXXX60-ITXXXX89 failed with error 'The remote script failed with exit code 1'."
System.UnauthorizedAccessException: Access to the path '/taskLogs/servertasks-3409553_u4lslmuj2f.txt' is denied.
 ---> System.IO.IOException: Permission denied

There are over 90 of those error messages in your latest server log, they are all different task logs and because Octopus does not have permissions to edit the task logs that is why you are seeing it in the UI where it completes the task (because that is a database function not a file share function) but then the UI shows the steps are still not complete because that relies on the fileshare for Octopus to write to the task log and send that back to the UI.

I can see if the engineers have a screenshot of the permissions they have for one of our cloud instances but usually its the Octopus server service account that needs full control to the Tasks Logs folder on the file share, I am sure you have seen the HA documentation but this bit describes what you need to do for the shared storage.

Does this happen on every project, you mentioned there are hundreds but that would be the case if there was a permissions issue.

If this does not happen on every project, if you have a few nodes and they have different service accounts that are running the Octopus Server service it may be that one nodes service account has not been granted access to the file share.

You mentioned you don’t have any access restrictions on the file share and if that is the case it might be worth trying to put the nodes service account in the file share and give it full control. I have seen it before where in Windows you need to actually put an account in the security of that folder for the account to be able to access it so it may be that you need to give explicit permissions here.

You can see what node is deploying what project using the dev tools on your browser so if you go to deploy a project you know is not having the UI issue and grab the node information, then you deploy a project that has the issue and they are different nodes you can look at the node with the issue and see what service account it is running under, you can then make sure that is added to the file share.

I would look at all of your nodes too and ensure they are all using the same service account for Octopus just so it makes it easier permissions wise for the file share, if one of the accounts is different it makes it harder to fault find on the nodes when this sort of thing happens.

Let me know if this happens on all of the deployments or just a few and let me know if you manage to pinpoint the node that’s the issue and add that to the shared storage, I would look at all the nodes and make sure they are all using the same service account and I would also try explicitly adding that account to the file share Task Log folder to see if that alleviates this issue.

It might be worth getting us a screenshot of the Task Log folder for your file share if possible (just the permissions) so we can take a look and see if we can advise there. Regarding the engineers, I will send the information we have to them but as mentioned we can only advise on permissions at this point because we cannot control that, the only error we have to go off at the moment is there are permission issues on your file share which all line up with the issues you are seeing in the UI.

I am really hoping we can get to the bottom of this as I can completely see this being a huge pain point for you so I am sorry this is happening.

Let me know how you get on with what I have suggested and if you can get us a screenshot of the permissions (and possibly the Octopus server service - just for the account that’s running the nodes) that would be helpful.

Kind Regards,
Clare

@mikepower79,

One more thing, sorry I know my last response was long, I just took a look at the HAR file and there are no errors (which there would not be if the task log was unable to be written to) the UI will just show what the task logs have in them (since they have things in them the HAR file throws no errors) and the deployment is successful (because the DB shows it is) so no errors would be present in the HR file. So that further points to an issue writing to the task logs rather than the Octopus backend code I am afraid.

At least it did rule something untoward happening though that we have not accounted for.

Kind Regards,
Clare

Hi @clare.martin,
I want to send you a on our Azure file share logs, can you send a link to upload.
It looks like one .txt file is trying to be accessed over and over again in the space of 20 minutes. I just want to verify if this is normal behaviour.

We are currently investigating the permission denied errors with Microsoft.

Kind Regards,
Micheal Power

Hey @mikepower79,

Of course we will take a look at that for you, here is the secure link I have generated for you, let me know when you have uploaded those logs and we will take a look.

Good to note you have brought Microsoft into the equation too, hopefully they should be able to help. I got our lead engineer to take a look at our file share for cloud customers the other day but they are setup slightly differently permission wise so there was no clear screenshot we could get which would mimic an on-prem instance with HA and send it to you so you had it to hand I am afraid.

We are keen to getting to the bottom of this for you as its a bit odd its just started happening as you have been running fine permissions wise for awhile based on your previous tickets.

I look forward to taking a look at the logs.

Kind Regards,
Clare