AWS auto-scale best practices

anonguy4349 · 7 May 2014 13:43

Hi all,

I’ve recently started using Octopus deploy in an AWS EC2 environment and have some questions for the community on approaches to certain issues. To contribute, I’ve also listed the approach I’m currently using (or plan to use) to address the issue, although I make no promises that they are the best approach. If you have a better approach that you would like to share, or advice regarding one of my approaches I would very much appreciate hearing your feedback.

Automatic Tentacle Registration on Auto Scale:
I basically used the script at http://octopusdeploy.com/blog/auto-provision-ec2-instances-with-tentacle-installed The script gets passed to the newly launched instance by way of “user data” that is specified in the auto scale groups launch config. I did make some minor changes such that the name used in the registration is the EC2 instance ID and not the hostname. I also added an automatic download/install of the octopustools.zip file. And I added some functionality so that the environment/role/project associated with the instance is pulled from AWS instance Tags. That way the same script can be used for all systems with just the instance Tags being changed.

Automatic Tentacle Deregistering on Scale-down/Termination:
Most of the approaches I saw suggested using a clean-up powershell script that would run prior to a deployment that would remove any orphaned tentacles. I don’t think this is a good approach as a temporary period of network downtime between Octopus and Tentacle could cause Octopus to remove the Tentacle even though the associated instance is still in AWS and receiving traffic. This would cause code on the effected instance to become stale. My thought was to leverage the notification functionality built-in to auto-scale groups. When a termination or scale-down event occurred, the notification would be recorded in an SQS queue. There would then be a powershell script that runs periodically from task scheduler on the octopus server that reviews this queue, deregisters effected instances, and then removes the entry from the queue. My concern with this approach is that there still may be edge cases where AWS and Octopus don’t agree on what the environment looks like.

Automatic Deploy of Software on Auto Registered Tentacles:
The same “user data” powershell script that performs the Tentacle registration also initiates a deploy of the latest software. This idea was mostly taken from the script posted at http://www.codeproject.com/Articles/719801/AWS-Deployment-With-Octopus-Deploy One of my concerns with this approach is that a rapid scale-up event could be slow as the deployments happen serially one instance at a time. An alternate method might be to remove the instance specific deploy and simply do a full deploy to tentacles that don’t have the latest code. But the issue with this is that it could be more fragile, as a single unreachable tentacle causes full deployments to fail. Does anyone have an approach that addresses both of these?

Blue/Green Testing:
I’m interested to hear how people are doing this in large AWS envrionments with Octopus Deploy. Initially I was thinking that I could have two stacks and switch between them with DNS, but this can be very expensive if the environment is large and you operate in the 2N mode for very long. One approach that I was bouncing around was to do the following:

One auto-scale group for Green one for Blue with their own load balancers
By default the non-active group is configured with min/max instances to 1
When a production deploy is initiated from octopus a powershell script runs that does the following
- changes min for the group to whatever the current number of servers is plus some fudge factor
- sets max to whatever the max of the active group is
- waits for all instances to launch
- performs DNS switch
- waits for manual review
- change min/max of new non-active group to 1
- wait for instances to terminate/deregister
One thing that isn’t clear to me is how would I go about initiating the powershell script above only during a full site deployment vs a single host deploy that happens during tentacle registration? It also seems like I will need to have a blue/green tag associated with the instances and an associated deployment that is specific to that role (ie webserver-green, webserver-blue). This all seems very messy, does anyone have thoughts on how to bring this piece together in a modern development environment?

Paul_Stovell · 11 May 2014 11:28

Hi!

These are really good questions, and to be honest I don’t think Octopus currently has all the answers here.

As far as the deregistration goes, what should the source of truth be? If the SQS notification was used to cause you to query the current EC2 machines to get a view of what the “real” state of the environment should be, would that help?

The Blue/Green deployment model is usually approached by having different environments (i.e., a blue environment vs. a green environment) as opposed to per-instance tags. Running scripts on a full site deploy vs. partial deploy could be handled by having the script only run on a specific machine, and not including that machine in the deployment (the “deploy to specific machines” feature).

Those are just some thoughts. I’d love to hear more about what we should be doing to handle this better!

Paul

Paul_Stovell · 12 May 2014 02:58

Doing a bit more thinking about this.

Blue/green deployment with AWS is a big topic in itself (regardless of how you use Octopus). For a single server you can use an EIP, but for multiple servers the only solutions seem to be removing/adding machines to an ELB, or making DNS changes - both of which don’t give you much control over when it happens.

Elastic Beanstalk can also perform the CName swap for you:

http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.CNAMESwap.html

There’s an article on Octopus with Elastic Beanstalk here that might help:

http://www.codeproject.com/Articles/719801/AWS-Deployment-With-Octopus-Deploy

Paul

Paul_Stovell · 12 May 2014 03:26

There’s also a discussion here about blue/green deployments with AWS:

The main issue being that there doesn’t seem to be a way to do a blue/green swap that is both immediate and without downtime. ELBs and CNAMES both result in delays.

Paul