Intermittent Failure while acquiring package on tentacle from external feed

Hi Team,

We are facing intermittent failures while acquiring packages during deployment in few environments having 50+ deployment targets. There are 26 packages being acquired on these machines. During few deployments, package acquire fails for random package, while same package downloaded successfully on other machines in same deployment.

Error:

Downloading NuGet package ABC v1.0.0 from feed: ‘http://abc:8081/xxx/xxx/xxx/
Failed to download package ABC v1.0.0 from feed: ‘http://abc:8081/xxx/xxx/xxx/
Could not find Zip file Directory at the end of the file. File may be corrupted.
SharpCompress.Common.ArchiveException
at SharpCompress.Common.Zip.SeekableZipHeaderFactory.SeekBackToHeader(Stream stream, BinaryReader reader, UInt32 headerSignature)
at SharpCompress.Common.Zip.SeekableZipHeaderFactory.d__3.MoveNext()
at SharpCompress.Archives.Zip.ZipArchive.d__16.MoveNext()
at SharpCompress.LazyReadOnlyCollection1.LazyLoader.MoveNext() at Calamari.Integration.Packages.NuGet.LocalNuGetPackage.ReadMetadata(String filePath) at System.Lazy1.CreateValue()
at System.Lazy`1.LazyInitValue()
at Calamari.Integration.Packages.PackageName.FromFile(String path)
at Calamari.Integration.Packages.Download.NuGetPackageDownloader.DownloadPackage(String packageId, IVersion version, Uri feedUri, ICredentials feedCredentials, String cacheDirectory, Int32 maxDownloadAttempts, TimeSpan downloadAttemptBackoff)
at Calamari.Integration.Packages.Download.NuGetPackageDownloader.DownloadPackage(String packageId, IVersion version, String feedId, Uri feedUri, ICredentials feedCredentials, Boolean forcePackageDownload, Int32 maxDownloadAttempts, TimeSpan downloadAttemptBackoff)
at Calamari.Integration.Packages.Download.PackageDownloaderStrategy.DownloadPackage(String packageId, IVersion version, String feedId, Uri feedUri, FeedType feedType, ICredentials feedCredentials, Boolean forcePackageDownload, Int32 maxDownloadAttempts, TimeSpan downloadAttemptBackoff)
at Calamari.Commands.DownloadPackageCommand.Execute(String[] commandLineArguments)

We have noticed that in such errors, the package file is present on the machine where it notified to be failed, but the size of package is not correct and the package file is faulty. to fix the error, we have to manually delete the file and re-try the deployment. We have found idle timeouts errors on our feed endpoint logs, though there is no network contention or IOPS issues. There should be some better error handling instead of runtime error. May be addition of logic to delete the corrupt package and re-try (like we do in case of normal timeout to fetch package from feed endpoint). We will also like to get some help in to understand about the idle timeouts if possible, ideally if package file is not downloaded completely, the file name should be different indicating download is not complete, thats not the case in these errors.

Server details:
Octo version: 2020.2.15
Instance type: C5.2xlarge
number of nodes in cluster: 1
task cap: 50
Octopus.Acquire.MaxParallelism : 50
Octopus.Action.MaxParallelism : 100

Old case raised for same issue by us:

Our final package repository behind cache/forwarder machine has been changed from Proget to Nexus. And we are using nexus at our forwarder machine as well (even when we were using Proget).

Thanks and Regards,
Devan

Hi Devan,

Thank you for contacting Octopus Support.

I have reviewed the previous ticket regarding this issue.

Just to make sure I understand the current situation correctly, you are no longer using ProGet and are now using Nexus for your external feeds and you are having the same intermittent issue using this new feed. Is that correct?

Is there a particular package that seems to fail more than others or any commonality? For example, is it only packages over a certain size that seem to fail?

Let me know at your earliest convenience.

Regards,
Donny

Hey Donny,

Usually package size is greater than 100MB. But we have multiple such packages and it fails randomly for 1-2 servers out of 50+ and for rest it downloads perfectly. As soon as we delete corrupt file, we are good for re-acquire. I am suspecting number of requests to feed within short time period is culprit. With max parallel package acquire set to 100, occurrence of failure was frequent. After reducing it to 50, frequency has reduced.

Regards,
Devan

Hi Devan,

Thank you for the quick response.

Have you tried reducing Octopus.Acquire.MaxParallelism any lower than 50? If not, would it be possible to try this setting at 25 to see if failures are further reduced or eliminated?

Let me know your thoughts on this.

Regards,
Donny

Hi Donny,

Setting it to 50, have already increased time by 4-5 minutes. Lowering it to 25 will increase another 4-5 min, which is huge considering its production deployment. And after this deployment, we have more projects to be deployed. And this is almost daily routine.

Regards,
Devan

Hi Devan,

Thank you for getting back to me.

Apologies in advance, I have a few more questions:

  • Are the tentacles all on separate machines or are there machines with multiple tentacles installed?
  • How many deployments are typically running on the Octopus Server at one time?
  • Can you provide recent Octopus Server logs and representative raw task logs from a recent deployment?

You may upload files via the following secure link:

I look forward to hearing back from you.

Regards,
Donny

Hi Donny,

I have uploaded raw deployment log for 1 such scenario. Unfortunately I don’t have Octo server logs for the day.

Regards,
Devan

Hi @d.jain,

Thank you for getting back to me and providing the task logs.

I did some digging and was able to find an internal pull request that appears to address this specific issue. It looks like the PR stemmed from a stress test performed in one of our Octopus Cloud instances back in August of last year. Normally we would have a github issue logged for this that I could share with you. However, there is not one in this case and our pull requests are not publicly visible.

The root of the issue seems to be a timing problem where near simultaneous downloads attempt to pull a still-downloading package from cache rather than from the source. The fixed was applied to all versions 2020.5 and later.

Does this description seem to line up with what you have been seeing? If so, perhaps an upgrade should fix this.

Let me know your thoughts at your earliest convenience.

Regards,
Donny