Concurrent Octo.exe deploy-release calls behaving buggy

hal_smithstevens · 1 December 2015 20:26

Octopus Server version 3.1.2.0
Tentacle version 3.1.6.0
Octo.exe version 3.0.18.71

I have 4 environments with n machines in each. Each machine is an EC2 instance that runs a bootstrap script on instantiation registering itself with the Octopus server and uses Octo.exe deploy-release to trigger an initial deployment. The bootstrap script makes a call to api/dashboard to get the last release that was deployed to that particular environment and specifies itself as the single machine to target so as to not affect any other machines currently running in the environment.

./octo.exe deploy-release --server $server --apiKey $api --project $project --specificmachines $machinesToTarget --deployto $env --version $latestReleaseForEnv

I also have a tentacle running on the central Octopus server which runs scripts that check the health of currently registered machines in the environment and deletes any unhealthy machines. These two actions get called as the first two steps of the projects deployment steps. This tentacle is also referenced in the --specificmachines value on the octo.exe call above.

This whole configuration is an effort to support immutable infrastructure, self healing and auto scaling with all the added orchestration control and niceties that Octopus Deploy offers out of the box.

When writing the bootstrap script and testing it on a single box this works just fine. The script runs and the machines get health checked and the unhealthy ones deleted. Then the correct release gets deployed to the new machine. The load balancers then successfully ping the application and bring the new machine into the pool. “I win!” I thought… somewhat prematurely…

For the next set of testing I ran a single machine in each environment, so 4 machines in 4 separate environments. This is where I ran into trouble. When terminating all 4 instances at once and watching the self healing mechanism bring up 4 new instances I was met with mixed results. Sometimes 2 of the 4 instances would run their scripts successfully and other times 3 of 4. This was not consistently the same instances and the order I terminated them in didn’t seem to make a difference. Each time an instance failed its bootstrapping I would RDP into the offending instance and trigger the bootstrap script manually. Then the instance would bootstrap correctly and trigger the desired deployment.

The logs in the failing instances showed the following:

Handshaking with Octopus server: http://xxx.xxx.xxx.xxx:8081
Handshake successful. Octopus version: 3.1.2; API version: 3.0.0
Authenticated as: me <>
Finding project: MyProject
Finding release: 2015.12.553
Release ‘2015.12.553’ of project ‘MyProject’ cannot be deployed to environment ‘test’ because the environment is not in the list of environments that this release can be deployed to. This may be because a) the environment does not exist, b) the name is misspelled, c) you don’t have permission to deploy to this environment, or d) the environment is not in the list of environments defined by the project group.
Exit code: -1

This environment is definitely on the list of environments that the specified release can be deployed to… It works with exactly the same params when I trigger it manually.

I am assuming this is to do with the concurrent calls to the Octopus server not being queued correctly.

The tentacle running on the server fails to health check and delete the instances consistently also. In the server logs at C:\Octopus\Logs it shows the timeouts happening three times when the health check is fired correctly but no errors to show the failed deletion of servers. I get this popping up in the TaskLogs…

[“ServerTasks-3051_V48S7ZB4Q3/3af6a9c2366242d9b9bea32960d3e7e2”,“INF”,“2015-12-01T21:44:08.9060500+00:00”,“This Tentacle is currently busy performing a task that cannot be run in conjunction with any other task. Please wait…”,"",0]

Is this being treated as a fail in the project’s deployment process?

Does Octopus support this sort of concurrent use of octo.exe? Does it have a queuing mechanism for this?

Shane_Gill · 3 December 2015 22:57

Hi,

Thanks for the questions and detailed information.

When you are deploying to the self-healing instances and different machines are failing/succeeding, are you deploying a release that had been previously deployed to all of the environments? My immediate thought was lifecycle restrictions. If your project lifecycle defines an environmental order (eg Dev > Test > Staging > Prod) and your instances are coming online and deploying out of order you would get that error message.

Re: the Tentacle running on your server, the message in your task logs is displayed because the Tentacle semaphore is locked by another task. Each Tentacle can run one task at a time, it will wait for the other task to finish and then resume. That should not fail the deployment process.

I am a little confused about the second problem. Are health checks incorrectly failing and then the Tentacle is being deleted? The delete is not working? Are you getting the busy message on your server Tentacle or your autoscale Tentacles?

Cheers,
Shane

hal_smithstevens · 7 December 2015 19:40

Hi Shane,

Thanks for getting back to me. Initially I overlooked the case that the release referenced in the octo.exe call has yet to become a valid deployment due to the lifecycle and prerequisite deployments. I’m hitting the dashboard API to get the most recently deployed release for the specific environment to avoid that case. I was familiar with the error so I was surprised to see it when I had logs to show that the release that was referenced in the octo.exe call was definitely valid and had been deployed to that environment before. The bootstrap script is designed to deploy the most recent release that has previously been deployed to the environment - It’s job is to maintain the existing state during scaling and healing not to decide which releases we want deployed.

The server tentacle is the one that runs the health check and delete scripts. It seems to successfully health check but then fail to delete the unhealthy instances when a few of these bootstrap scripts are running concurrently. The server tentacle is the one that logs the busy message.

I have, as a temporary solution, put random sleeps in the script to avoid these concurrency issues but as you can imagine that adds seconds even minutes to our pipeline and is unsatisfactory.

I’m keen to help out in getting to the bottom of this so if there is any further information or examples that could help you out just ask.

Cheers,

Hal

Shane_Gill · 8 December 2015 02:19

Hi Hal,

Yes, adding sleep in scripts is less than ideal.

Would you be able to provide your bootstrap script and a screenshot or export of your deployment process so I can accurately reproduce on my end and figure out why the concurrency issues are popping up?

Cheers,
Shane

hal_smithstevens · 20 December 2015 19:56

Hi Shane,

Sorry for the belated reply. Its been rather hectic and somehow the notification of your reply slipped through the net.

I have attached the bootstrapping script. Beware, it is very much a POC and I am a Javascript guy so there is more than a few bits that would make real Powershell people wince, I’m sure. Hopefully it helps you get to the bottom of this though.

Fair warning - There is some AWS specific code in there that does a hacky restart of the EC2Config service to attain the EB env vars. This is due to a connectivity issue we were having with our scheduled tasks on startup. Please ignore, although I felt I shouldn’t omit it for completeness.

Thanks for looking into this,

Hal

regTentacle.ps1 (6 KB)

Shane_Gill · 4 January 2016 04:26

Hi Hal,

I have an environment set up to provision a bunch of Tentacles and deploy the latest release for a project to them using the same method as the script you have provided. It works great.

I suspect the issues you are running in to are related to the Tentacle running on the Server. Would it be possible to send through the health check / delete script that is running on the server?

Is the server script firing off a health check and then waiting for the result? The deployment might be blocking the health check. You could try adding a variable to your projects called “OctopusBypassDeploymentMutex” and setting the value to “true”. This will allow multiple tasks to access a Tentacle at the same time.

Cheers,
Shane

hal_smithstevens · 4 February 2016 21:29

Hi Shane,

I had been switched onto the front end team until yesterday, hence my absence… I’m back working on the DevOps end of things now so I’ll be looking into this soon. I am currently trying to get 3.3beta001 release working for asp.net zips so the built in script running on that new release may prove to be the fix… Will be in touch.

Cheers,

Hal

hal_smithstevens · 4 February 2016 21:33

Here is the health check script…

octo-health-checker.ps1 (570 Bytes)

hal_smithstevens · 4 February 2016 21:35

Here is the unhealthy machine removal script that runs after it…

octo-remove-nuked-machines.ps1 (898 Bytes)

Shane_Gill · 10 February 2016 04:42

Hi Hal,

Good to hear you are trying out the 3.3 beta. There is a new step type that allows running a script server-side which I think will help your cause.

I am trying out your scripts. I can provision and deploy Tentacles, kick off a health check, delete offline Tentacles and deploy the latest release of a projects to those Tentacles. When it runs I am seeing a bunch of health checks kick off and then the deployments get queued to each machine. Only one deployment happens at a time.

From your description on the problem it sounds like all of your deployments are happening at once?

Regarding machines not being deleted, the health check is asynchronous and the health check script does not wait for it to finish before proceeding to the delete step. Could this be why machines are not being deleted?

I really like your scripts by the way.

Cheers,
Shane

hal_smithstevens · 17 February 2016 20:09

Ah yes,

The health check being async and not blocking for its results may be the cause of a few of my woes. Thanks for spotting that.

I have been using the new 3.3 ‘run script on server’ functionality and I will run a stability test shortly with multiple concurrent deployments.

I’ll let you know how I get on.

Cheers,
Hal