Runbook gets canceled on a long task

I have a Octopus Runbook with a step which clones a repository from a network address I give it. Every time I give the Runbook a bigger repository to clone it fails giving the following message:

“This task was canceled. The task was executing, but the Octopus Server process that was executing the task was terminated unexpectedly, or the Octopus server computer shut down, potentially leaving this task in an inconsistent state. Please check the Octopus server log files for reasons for the termination.”

The Runbook starts the cloning process, but is unable to ever finish it before the task is terminated. Any ideas what might cause this?

Hi @rcmtsbhahr

I’m sorry you’ve hit this issue. I’m not sure what is causing this but I’m looking forward to finding out.

The Runbook starts the cloning process, - Would you be able to send over the JSON file of the runbook?

image

Can you also send through the Raw Tasklogs for that failing task?

You can send all your files using this secure link.

Regards,

Dane

Hi,

I sent the files.

Hi @rcmtsbhahr,

The logs you’ve sent through don’t show any errors, which is unusual.

Where do you see the error you mentioned before?

"This task was canceled. The task was executing, but the Octopus Server process that was executing the task was terminated unexpectedly, or the Octopus server computer shut down, potentially leaving this task in an inconsistent state. Please check the Octopus server log files for reasons for the termination.”

I was assuming it would appear in the Tasklogs you sent through?
I’m assuming it’s the ‘git clone’ command that is failing?

Can you run the Git Clone command outside of the Octopus ecosystem on the target?
This will determine if it’s Octopus that is causing the issue or some other part of the pipeline.

If it is SSH that is terminating the connection, that might explain why the logs don’t appear in Octopus. Perhaps you can look at your SSH logs for a termination reason.

Also, and this is a guess because we don’t have all of the logs available and a lot of the information from the supplied files are redacted - Are you creating the Azure target, then attempting to use that as the clone target immediately?

If that’s the case, when you are testing outside of Octopus, I would test running that entire pipeline, as it may be the deployment target not being ready to write data to that location. Again, this is an absolute guess as there isn’t much context, without seeing the actual error and being able to correlate it with the timestamps.

If you could provide some more context about where you are seeing the error as well as testing outside of Octopus, we should be in a better position to investigate.

Regards,

The Git clone is currently the only command on the task which is failing.

The Git Clone command works outside of the Octopus ecosystem on this target repository and the Git Clone command works inside the Octopus ecosystem on this Runbook when the repository I provide it is smaller in size. The issue appears only when I try to give a bigger (~1,5 GB) repository for it to clone. The task is able to start cloning the repository, but seems to always fail midway through. I provided a picture of where the error appears.

The Azure part is for authentication with our Azure Kubernetes Clusters and is very unlikely the cause for this issue.

I will check our SSH logs tomorrow. Hopefully the logs will provide further information about the issue at hand.

Hi @rcmtsbhahr,

Thank you for all of the information. We have some engineers looking at this issue. As we try to reproduce this, we may reach out to you for further clarification.

The SSH logs may hold some very relevant information so if you could get those logs, it would be much appreciated.

At this point in time, we still don’t know if it’s a size limit or a timeout limit, but we will continue investigating and will keep you up to date as we get more information.

Regards,

I sent a successful and a failed run task logs with ssh verbosity on. Our server end SSH logs are unfortunately held by a third party and I can not provide them to you.

I did some testing and noticed a couple things about the cloning:

  • I can get a successful clone if I provide a “–depth” value to the cloning, but using depth does unfortunately not serve the purpose of this runbook and can not be the solution here.
  • I splitted the cloning to git init, git fetch and git pull. The result was the same and the task failed at the fetching phase. The fetch portion stopped at the exact same spot as the git clone command:
    “debug1: Sending env LC_ALL = C.UTF-8”.
  • Raising the ssh_config - files "ServerAliveInterval " and “ServerAliveCountMax” values did not effect the cloning.

I will most likely try out cloning through HTTP next to see what happens. Hopefully these logs will provide some help to the troubleshooting.

Hi @rcmtsbhahr,

Thank you. There is a lot of helpful information here.

We have spent some of this morning digging through the issues and log files and it seems like StackOverflow may have answered our questions.

One of the interesting things that we have found in your logs, is this line:

05:25:27   Verbose  |       "azureActiveDirectoryPassword": {
05:25:27   Verbose  |       "Type": "String",
05:25:27   Verbose  |       "DependsOn": nullllGITLOW_SPEED_LIMIT for longe

Although this log is malformed, which is quite odd, it does reference the SPEED_LIMIT argument in the top answer on the stackoverflow link I sent through. I wonder if there is a correlation between a low SPEED_LIMIT and the failure of the GIT Commands.

The Second thing to try is again from the same Stack Overflow post. Setting CURL to verbose may be helpful with diagnosing the issue, but also setting the max_requests to something (like 16), will also mean that there is less chance that the SPEED_LIMIT should not drop below minimum.

I would really like to see if purely using HTTP will fix the issue, if not, I would like to see if CURL verbose provides any more information, finally, if neither of these provides us with any obvious improvements or direction to solving the problem, then trying the rest of the arguments in that first answer of the Stack Overflow post.

I am sorry that I can’t provide an answer more scoped to this problem, but I think after performing these steps that are outlined, we should be able to fix or at least identify what is going wrong.

Regards,

Hi,

setting these parameters did not have any effect on the outcome:

  • “git config --global http.postBuffer 524288000”
  • GIT_CURL_VERBOSE=1
  • GIT_HTTP_MAX_REQUESTS=16

I also tried cloning with http and It did not work. I have the Tasklog if you want to check it out.

Hi @rcmtsbhahr,

I’m just stepping in for Dane while he is away. Could you send us the latest task log at your nearest convenience? Your original secure upload link may have expired, so I have sent you an updated link through DM.

I look forward to hearing back, and please let me know if you have any questions for us.

Thanks!
Dan

I sent the file.

Thanks @rcmtsbhahr,

Looks like the bitbucket connection tries to connect via TLS1.2 and when that fails the connection then tries to use HTTP/1.1 but assumes there is a valid certificate. If it is connecting via HTTP only, no certificate would be necessary, so you get odd values like this:

07:56:23   Error    |       < Expires: Tue, 01 Jan 1980 00:00:00 GMT

Can you Enable TLS 1.2 to be used for your bitbucket connection? I think this may allow a successful connection.

I will raise this with our engineers because it seems like this should be failing with a TLS1.2 not successful as opposed to retrying in the way that it does.

Let me know how it goes.

Regards,

I doubt the issue is related to TLS 1.2, because when I run this image in a separate docker container the connecting to the bitbucket server happens exactly the same way as it does on this octopus container and it connects just fine using the HTTP/1.1. I think the underlying issue here is the same as when trying to connect via SSH; The Runbook just gets canceled at some point of the cloning process without giving any indication why.

Hi @rcmtsbhahr,

Thanks for getting back in touch.
I’ve conveyed your concerns to our engineers who will review the issue you’re experiencing further.

They advised previously that they’re looking at improving performance in this area however I would like to await their assessment to see if we can find a better solution to your specific case or if it’s, unfortunately, a performance bottleneck.

I’ll be in touch once I have feedback from them but please let me know if you have any questions or concerns in the meantime.

Kind Regards,
Adam

Hi,

I managed to solve the issue. The Runbook cancellation was caused by the Kubernetes Nodes ephemeral storage limit being too low to let the Pods handle the larger repositories.

Hey @rcmtsbhahr,

Brilliant news you have managed to fix the issue, thank you for posting up the resolution, I will pass this over to our engineers so they don’t look into this further, it is also another thing we can suggest to customers to check out if they are having this issue and they are running K8 clusters.

If you need any help in the future please reach out,

Kind Regards,

Clare