Large packages causing deployment failures

reliability
(jwdean) #1

Raising this topic again for a bit more clarity. This thread describes exactly what we’re encountering along with a possible solution:

Our use case is the deployment of a 200MB file to a remote site with multiple tentacles and a relatively low bandwidth network connection. Concurrent connections effectively saturate the network connection and cause timeouts by exceeding the 2 minute Process Step timeout.

The conclusion of that thread is to adjust this timeout extend the allowed time to deploy the package:

<<add key=“Halibut.PollingRequestMaximumMessageProcessingTimeout” value=“01:00:00”/ >>

Since this seems to be a system-wide change, before doing this I need to understand the impact on other Octopus tasks. What else is impacted by this change? Will this be multiplied by offline tentacles? During regular operations we expect there may be 100’s of offline tentacles. Or are there any more granular controls available, or perhaps a different way to address this now?

Thank you,
Jeff

(Shannon Lewis) #3

Hi Jeff,

Thanks for reaching out. We have had some work going on in this space recently and I’m just trying to get some more information for you from the team.

We’ve seen similar sorts of things happening from servers in Azure, for example, where package transfers greater than 2min cause the connections to drop. You mentioned low bandwidth connection, are you on managed infrastructure?

From your description you have multiple tentacles on the other side of this low bandwidth connection. We’ve talked about something we’ve dubbed an “Edge server” in the past, which would essentially be a package proxy that sat on the tentacle side of the connection and Octopus could push once to it and then the tentacles would source the package from it on their local network. Does something like that sound like it would resolve the issue you’re experiencing?

Any further insight you can provide into your scenario and configuration would be greatly appreciated.

Regards
Shannon

(jwdean) #4

Thanks Shannon,

We manage our own Octopus servers.

We have hundreds of tenants, each representing one customer. Each customer provides a connection based on what their ISPs can provide and what is appropriate for their business. With locations scattered around the world we deal with varying degrees of bandwidth and quality connections. Fundamentally we have little control over those connections other than they are expected to always-on broadband.

There is an edge server concept built into WSUS called Branch Cache that we investigated. There’s no compelling reasons for us to implement that since the delivery model for WSUS is lazy and doesn’t have this type of failure mode. For Octopus, it’s an interesting idea, but I’m not sure it addresses this underlying issue.

Pros:

  1. More effecient use of available bandwidth. Delivering large updates to our fleet is a choke we manage on other systems.

  2. Having tenticals source packages from a system on site should improve update performance at bandwidth limited sites.

Cons/Questions/Thoughts:

  1. Since the timeouts still exist, the package size and bandwidth restrictions may fail delivering to the edge itself.

  2. Would the edge server a true proxy for all tentacle communication between the site and server? Or only for package delivery?

  3. How would it recover when the Edge server is unavailable, or becomes unavailable mid-deployment.

  4. Scale issues may exist as packages collect on a single remote system. In our case the Tentacles are mostly inexpensive and unique physical systems with limited storage.

  5. Perhaps more interesting woudl be a edge server that is a… deplyoment agent… where deployments to related tentacles is offloaded from the primary Octopus server to the edge.

What we expected is that release steps (especially the package deployment step) would tolerate slow transfers and recover from any short connection disruptions. This package is only 200MB, but we have update packages in the works that will be closer to 500 MB.

There is one point I accidentally left off my original post is that we also have trouble with Tentacles after this time out error. At that point they show as offline. Even manual health checks failed until we restarted their Octopus service. I presume that’s related, but do not have more information other than it happened regularly.

Yours,
Jeff

(Shannon Lewis) #5

Thanks for the additional info.

In regard to your original question, I don’t believe changing that Halibut Timeout would have an adverse effects. We are looking into what we might be able to do to give some more granular control for this type of scenario in the future.

(tgillitzer) #6

@Shannon_Lewis A little interesting tidbit, the Halibut.PollingRequestMaximumMessageProcessingTimeout setting only affects the first attempt to download the package. The subsequent retries timeout after 2 minutes (even using the default setting).

here is a cleaned up log:

A request was sent to a polling endpoint, the polling endpoint collected it but did not respond in the allowed time (00:10:00), so the request timed out.
File upload failed. Retry attempt 1 of 5…
Beginning streaming transfer of file.nupkg
A request was sent to a polling endpoint, but the polling endpoint did not collect the request within the allowed time (00:02:00), so the request timed out.
File upload failed. Retry attempt 2 of 5…
Beginning streaming transfer of file.nupkg

The error that gets reported out of the task is the last one, which makes it look like there is still a 2 minute timeout.

(David Young) #7

Thanks for investigating that and reaching out again, tgillitzer.

It sounds like you’ve isolated a real issue, so I want to let you know we’re working on reproducing it, and will give you an update soon on our findings, along with options for fixes or workarounds.

Let me know if you have any more ideas or questions.

Regards,
David.

(David Young) made this topic a personal message #8
(David Young) made this topic public #9
(David Young) #11

Hi tgillitzer,

I’d like to update you on our findings after reviewing this.

We’ve run some tests with polling tentacles on a slow network with Octopus Server 3.3.6 that you mentioned running, and have found that the message you are seeing with the allowed time is actually controlled by the Halibut.PollingRequestQueueTimeout setting.

On our systems, timeouts controlled using that setting do continue to apply during retries, and I can’t find anything in the code that looks like it would be resetting to default.

A lot of refinements have been made to Server/Tentacle communication since 3.3.6, so I recommend staying up to date with the LTS release schedule, which is also where we focus our resources in providing ongoing support.

Thank you for reporting this, and helping us make Octopus better, even though we can’t always find the answer.

Regards,
David Young.

(system) closed #12

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.