1.6.2.1722: Extremely slow "Finding package" step, plus timeouts when downloading package

Rodolfo_Grave · 6 January 2014 16:46

Hi.

Problem:
Octopus is taking 10 minutes to “Find package” and then another 11 minutes to upload the package to the agent.

Context:

The nuget repository is local (localhost:444) and it’s a NuGet Gallery website.
The package is 76Mb in size.
If I use a browser (Chrome) either from that server (localhost) or other PCs in the network, I can download the package by going to the api query and it takes literally 1 sec.

I’ll be testing more combinations but it would be a great help if you could come up with possible causes for this, and ideally possible fixes.

Relevant fragment from the deployment log illustrating the issue:

Downloading packages from NuGet servers
2014-01-06 16:03:58 INFO Making a list of packages to download
2014-01-06 16:24:04 INFO All packages have been downloaded
Download package Xyz 6.0.0.4057 from NuGet feed: OctopusNuGetRepo
2014-01-06 16:03:58 DEBUG Checking package cache for package Xyz 6.0.0.4057
2014-01-06 16:03:58 INFO Downloading NuGet package Xyz 6.0.0.4057 from feed: https://octopus:444/api/v2/
2014-01-06 16:03:58 DEBUG Downloaded packages will be stored in: E:\Octopus\Data\PackageCache\feeds-33
2014-01-06 16:03:58 DEBUG Finding package (attempt 1 of 5)
2014-01-06 16:13:20 DEBUG Found package Trayport.ClearingHub version 6.0.0.4057
2014-01-06 16:13:20 DEBUG Downloading to: E:\Octopus\Data\PackageCache\feeds-33\Xyz.6.0.0.4057_6BCE9900ECFC714696307CC423081B16.nupkg
2014-01-06 16:13:46 DEBUG SHA1 hash of package is: abfc8f0ce787b74a31715ed9eced77145c0d39be
2014-01-06 16:13:46 DEBUG Determining machines that need the package
Upload to machine http://OurIntegrationEnvironment:10933/
2014-01-06 16:13:46 DEBUG Uploading package Xyz 6.0.0.4057 to tentacle http://OurIntegrationEnvironment:10933/
2014-01-06 16:24:04 INFO Package Xyz 6.0.0.4057 successfully uploaded to tentacle http://OurIntegrationEnvironment:10933/

Nicholas_Blumhardt · 6 January 2014 23:09

Thanks for the additional detail. It is strange that the “finding package” step is slow, and so is the “upload” to the target machine - these are using different technologies, and one is local while the other sounds like it is crossing the network. A bit of a long shot, but do you have any antivirus software running on the Octopus Server? If so, testing this out with AV disabled may rule it out.

Did this problem steadily get worse over time, or has it just appeared suddenly?

Many thanks!
Nick

Nicholas_Blumhardt · 8 January 2014 00:20

To add to the questions above; while the deployment is taking place, can you please check the CPU, RAM network and disc usage of all processes on the machine to see if there’s any activity?

Knowing which of these is active (is there compute or I/O going on and in which processes) will help us narrow this down.

Regards,
Nick

rodolfograve · 8 January 2014 11:44

Hi.

No, we don’t have any antivirus. Worse thing is the problem is not consistent. Some times it happens and some times everything is really fast as we expect it to be.

Without being able to query the releases is very hard to tell when has it happened in the past, but it’s now become much more annoying in the last couple of weeks.

As for what happens in the server during one of these affected deployments:

During the “Finding package step” network usage went up. It stayed up through the whole 3 mins it took it to complete the step.

Memory and CPU were pretty much flat.

Does it give you any clues? We’re now in the process of discarding this is a problem with our NuGet gallery. In any case, the 2nd step “Uploading package to tentacle” cannot be affected by any NuGet issues.

Can you please give us a quick explanation of what goes on behind the scenes in these 2 steps. My guess is:

1- Octopus uses the NuGet api to get the package, by simply accessing an URL like this: https://nuget/api/v2/package/PACKAGE_NAME/PACKAGE_VERSION

2- After downloading the package using the previous method, how is it trying to upload the package to the tentacle?

This information will greatly help us track down what’s going on.

Cheers

Nicholas_Blumhardt · 9 January 2014 01:06

Hi Rodolfo,

Thanks for the extra info. I might try looping Paul in on this (he’s on vacation currently) as he has had more experience with the various NuGet servers and access quirks.

For 1), we use the NuGet.Core built-in FindPackage() method:

public static IPackage FindPackage(this IPackageRepository repository, string packageId, SemanticVersion version, bool allowPrereleaseVersions, bool allowUnlisted)

I believe this is direct to the API as you suggest.

Is the SQL Server used by NuGet gallery on the same machine, and is it shared with any other applications? It might be an interesting angle to consider.

Regarding 2), we chunk the file into 64k pieces and send it via WCF; it generally performs pretty well so I don’t think the chunking/send on a local TCP connection is likely to be the root cause either.

I’ll ask around for more info.

Regards,
Nick

rodolfograve · 9 January 2014 13:58

Hi.

I’ll wait for more information from you about how things work internally. In the meantime we found there are several WARNING entries like this one in the Windows Event Log, from Octopus:

2014-01-09 13:46:44,685 [8] WARN Octopus [(null)] - Outstanding Health tasks were not completed before the next task was due was scheduled. If this error persists, please post details on http://help.octopusdeploy.com.

Any chance having too many environments (we have a lot, some of them unavailable) is causing Octopus to be somehow blocked?

Nicholas_Blumhardt · 9 January 2014 23:53

Interesting - there’s definitely a possibility of that - no specific code path I’m aware of, but a plausible cause to investigate. Octopus 1.6 uses the .NET thread pool extensively - which may be exhausted if say, 30 machines are blocking on health check requests.

Is the number of machines in disabled environments in this order of magnitude? Are you able to disable the machines via the UI and re-test?

Next time a slowdown happens it would be great if you can check whether a Health Check is running, or recently finished.

Cheers,
Nick

rodolfograve · 14 January 2014 10:49

Hi.

Sorry for the delay. I’ve tried to delete as many tentacles as we could but the issue is still there. Is there a way to tell Octopus not to check health unless we manually ask for it?

I’m pretty much convinced this is the issue, because we’re seeing other weird behaviour related to health checks, like reporting out-of-date tentacles when they aren’t, and slow processing of health checks.

Nicholas_Blumhardt · 15 January 2014 06:06

Hi Rodolfo,

This still doesn’t sound quite right - if the disconnected Tentacles are disabled or deleted, they shouldn’t interfere with health checks this way.

It is possible there’s something else going on - it is another long shot, but RavenDB/ESENT fragmentation or corruption might be worth checking.

Resetting RavenDB indexes is probably the first thing to try. Make sure you take a backup of the Octopus database first. The instructions for an index reset are at: http://octopusdeploy.com/blog/resetting-raven-indexes – it can take a little while for the server to warm back up again after doing this.

If resetting the indexes doesn’t help, or if any errors are reported in that process, repairing and/or defragmenting ESENT may help, but let’s try indexes first.

Another avenue to explore, if you have the option, is to install Octopus Server on a new machine and restore a backup into it, switching off the old machine in the process. This could rule out any hardware/OS/network issues impacting the first machine.

Incidentally, are you in a position to upgrade to Octopus 2.0? We’ve done pretty extensive work in all these areas in the new version - if it is feasible to migrate we’re here to help with that process.

Best regards,
Nick

Rodolfo_Grave · 15 January 2014 09:50

Hi.

I’ve just rest the indexes and unfortunately there’s no improvement. The “Finding package” step is still taking several minutes.

About Octopus 2.0: we’re anxiously waiting for it to be released in order to upgrade. Unfortunately we cannot take it while it’s still in RC. Is there any expected date for the final release? If it’s going to be soon then we might as well drop this investigation and just live with the issue for a week.

I made a quick search about repairing/defragmenting ESENT but couldn’t find anything. Can you please send me the instructions?

Cheers,
Rodolfo

Nicholas_Blumhardt · 16 January 2014 01:07

2.0 is very close, we’ve been using end-of-January as our working date for an RTW, but the exact date is dependent on when we get an RC that we’re sufficiently happy with. There are still a few small issues being worked on (you can track our progress at https://github.com/octopusdeploy/issues/issues).

ESENTUTL is included in Windows, there are some details on its use here: http://technet.microsoft.com/en-us/library/hh875546.aspx

Since you’re able to upgrade in the near future though I’d consider holding off on this, and perhaps install the latest RC on a test server instead.

We can help to work through any migration issues and get you ready for 2.0 migration as soon as it is available.

rodolfograve · 10 February 2014 12:56

Hi. Just for future reference, we upgraded to v2 and this issue dissapeared.

Cheers,
Rodolfo