Unhealthy dead instances preventing deploy

Mike_Carlisle · 27 December 2013 12:32

Hi,

We are using Ocotpus Deploy 2.0 to deploy to AWS EC2 instances using listening mode. Each instance registers on launch and invokes a push of the latest release to itself using Octo.exe. There is an issue that if the enviornment contains references to terminiated EC2 instances (unhealthy), it just hangs on the first step of downloading packages, as it can’t find the instances.

We have a script to clear out orphaned instances, but it doesn’t get that far. Is there a way around this? Shouldn’t unheathly instances be ignored by default, or just queue up tasks without blocking?

The only work around I can think of it to run the clear orphans script from the bootstrapper eC2 instance itself.

Thanks,
Mike

Nicholas_Blumhardt · 30 December 2013 01:47

Thanks for getting in touch. I think your workaround is probably the only option right now, but I’ve added a ticket to try rectifying this by allowing individual machines to be the targets of a deployment: https://github.com/OctopusDeploy/Issues/issues/414

Mike_Carlisle · 30 December 2013 02:03

Thanks, being able to target one machine could also be useful, but we would want to avoid creating ad-hoc releases to do this that would break our versioning. It would work if we could push an exisitng release to a specific machine/role.

Another way to solve would be to allow users to run a script on the Octopus server itself prior to a deployment. This way we could do any prep - like removing dead machines.

This has really come about because tectales failing health check grid everything to a halt. Shouldn’t the health check serve a purpose, like taking unresponsive tecticales out of service?

Nicholas_Blumhardt · 30 December 2013 02:51

Deploying to a selection of machines would create a new deployment for an existing release - should be pretty close to what you do currently.

Skipping unresponsive tentacles is an interesting idea; it has more implications that we’d need to consider though, so it isn’t as likely to happen in the short term. Definitely a logical thing to investigate though.

Mike_Carlisle · 31 December 2013 01:13

HI Nicholas,

I’ve come across another related problem. In our setup when new EC2 instances come online they invoke a push of a release using Octo.exe so they get any relavent updates.

In version 1.6 I’m pretty sure machines with the same version already installed would just skip the release (unless the force flag is used). In 2.0 I’m seeing all machines updating, even through they already have the same version installed and without specifying the force flag. Is this a bug?

Thanks,
Mike

Nicholas_Blumhardt · 31 December 2013 02:41

Hi Mike,

This was a deliberate change because the inverse behaviour generated a lot of surprises for people, too.

We’re realising now that this change is problematic in your situation; I think this bumps up the priority of #414 linked above.

Regards,
Nick

Mike_Carlisle · 31 December 2013 10:12

Hi,

Changing the behaviour seems a really bad idea, and completely breaks deployment into cloud based environments. There is now no way to update a single insance without redeplying everything to all machines. One of our packages wipes and restores our database, so this is a show stopper for us.

Why would anyone want to redeploy the same versioned package to a machine that already has the package on it? If so, the force flag enabled this and was quite intuative. Now we would need some kind of skipalreadyinstalled flag, but that’s pretty dirty.

I don’t really want to go back to 1.6, is there any chance of a urgent fix for this? I really need a way to ignore a step if the package verison is already installed.

Thanks,
Mike

Nicholas_Blumhardt · 31 December 2013 21:33

Hi Mike, thanks for the follow-up. The “unhealthy instance” problem also makes the redeployment approach to scaling cloud deployments problematic, so hopefully we can get to a solution that works better all-round in 2.0.

Here’s how work-in-progress on #414 looks:

octo.exe deploy-release (release details) --specificMachines=machines-123,machines-124

This can be used to update a single instance without redeploying everywhere, and also gets around the problem of unhealthy machines preventing the deployment moving forward.

We can get a quicker turnaround on this than changing redeployment behaviour (code is written) and I can get a build to you shortly if you want to try it out. I’m not keen to push an unwieldy solution though, so if this doesn’t meet your needs or isn’t practical please let me know.

Mike_Carlisle · 31 December 2013 21:47

Hi Nick,

Thanks, that looks promising. I’m still concerned that redeploying the same release could have undesired side affects that weren’t there in 1.6 (without force), but it’s moving forward and allows us to support self-healing/auto scaling AWS instances.

It would be great to get a build as soon as you have one available. I know it’s bad timing with New Years on the horizon so don’t over do it! We can wait a couple of days.

Thanks,
Mike

Nicholas_Blumhardt · 31 December 2013 21:53

Sounds good - Happy New Year, we’ve beaten you to it down here

Mike_Carlisle · 1 January 2014 13:59

Happy new year to you too! Forgot you guys are ahead!

Mike_Carlisle · 2 January 2014 17:39

Hi Nick,

I just tried out the fix for #414. I ran into a problem running the following command:

tentacle.exe register-with --name=“xxx” --instance=“xxx” --server=“https://octopusserver/2.0” --apiKey=“xxx” --environment=“xxx” --publicHostname=“xxxx” --comms-style TentaclePassive --console --role=“role1” --role=“role2”

Resulted in:

2014-01-02 17:21:42,935 INFO 14 DeploymentLog - System.NullReferenceException: Object reference not set to an instance of an object.
at Octopus.Tentacle.Commands.RegisterMachineCommand.Start() in c:\TeamCity\buildAgent\work\1116bd9da9e239fd\source\Octopus.Tentacle\Commands\RegisterMachineCommand.cs:
2014-01-02 17:21:42,935 INFO 14 DeploymentLog - line 93
at Octopus.Shared.Startup.ConsoleHost.Run(Action`1 start, Action shutdown) in c:\TeamCity\buildAgent\work\1116bd9da9e239fd\source\Octopus.Shared\Startup\ConsoleHost.cs:line 34

Any chance of a fix or ideas what I might be doing wrong? This is the same code that was working with the previous version.

Thanks,
Mike

Mike_Carlisle · 2 January 2014 18:14

ah this seems to relate to #434. Apparently there’s a new command that needs to now happen:

Tentacle.exe new-certificate

Mike_Carlisle · 2 January 2014 20:01

OK added that but can’t get anything to work anymore. Error is:

2014-01-02 19:50:24,351 INFO 10 DeploymentLog - Octopus Deploy: Tentacle version 2.0.8.977

2014-01-02 19:50:24,351 INFO 10 DeploymentLog -

2014-01-02 19:50:27,580 INFO 10 DeploymentLog - Adding 1 trusted Octopus servers
2014-01-02 19:50:27,580 INFO 6 DeploymentLog -

2014-01-02 19:50:28,500 INFO 6 DeploymentLog - Cannot start service from the command line or a debugger. A Windows Service must first be installed (using installutil.exe) and then started with the ServerExplorer, Windows Services Administrative tool or the NET START command.

2014-01-02 19:50:28,610 INFO 6 DeploymentLog - Octopus Deploy: Tentacle version 2.0.8.977

2014-01-02 19:50:28,610 INFO 6 DeploymentLog -

2014-01-02 19:50:34,116 INFO 6 DeploymentLog - -------------------------------------------------------------------------------

2014-01-02 19:50:34,116 INFO 10 DeploymentLog - A fatal exception occurred

2014-01-02 19:50:34,163 INFO 10 DeploymentLog - System.NullReferenceException: Object reference not set to an instance of an object.
at Octopus.Tentacle.Commands.RegisterMachineCommand.Start() in c:\TeamCity\buildAgent\work\1116bd9da9e239fd\source\Octopus.Tentacle\Commands\RegisterMachineCommand.cs:
2014-01-02 19:50:34,163 INFO 10 DeploymentLog - line 93
at Octopus.Shared.Startup.ConsoleHost.Run(Action`1 start, Action shutdown) in c:\TeamCity\buildAgent\work\1116bd9da9e239fd\source\Octopus.Shared\Startup\ConsoleHost.cs:line 34

2014-01-02 19:50:34,397 INFO 10 DeploymentLog - Octopus Deploy: Tentacle version 2.0.8.977

Mike_Carlisle · 2 January 2014 20:28

I think I was just missing --console on the new-certificate command.

All good now

Nicholas_Blumhardt · 2 January 2014 22:15

Ah yes, sorry about that one. I went back to the “downloads” page and added a note about this last night- a few "gotcha"s in there that we might be able to smooth out.