Ansible EC2 Auto Scaling Tutorial
We use Ansible to manage application deployments to EC2 with Auto Scaling. It's particularly suited because it lends itself to easy integration with existing processes such as CI, enabling rapid development of a continuous deployment pipeline. One crucial feature is that it is able to hand-hold a rolling deploy (that is, zero downtime) by terminating and replacing instances in batches. Typically when we deploy to EC2, we do so in an automated fashion which makes it important to have rollback capability and for this, we typically maintain a short history of Amazon Machine Images (AMIs) and Launch Configurations which are associated with a particular Auto Scaling Group (ASG). In the event you wish to roll back to a particular version of your application, you can simply associate your ASG with the previously known working launch configuration and replace all your instances.
Our normal workflow for auto scaling deployments starts with an Ansible playbook which runs through the deploy lifecycle. Each step along the way is represented by a role and applied in order, keeping the main playbook lean and configurable. Depending on our client's requirements, that playbook might be triggered in a number of ways such as the final step in a continuous integration build, or on demand via Hubot in a Slack/Flowdock/IRC chat.
In this post we'll walk through each stage of the build and deployment process, and use Ansible to perform all the work. The goal is to build our entire environment from scratch, save for a few manually created resources at the outset.
We'll be using EC2 Classic for these examples, although they can be trivially adapted for VPC. Start by creating an EC2 Security Group for your application, taking care to open the necessary ports for your application in addition to TCP/22 for SSH.
Add a new keypair for SSH access to your instances. You can either create a new private/public keypair or upload your existing SSH public key.
You may optionally register and host a domain name with AWS Route 53. If you do so, the domain will be pointed at your application so that you don't have to browse to it by using an automatically assigned AWS hostname.
Ansible uses Boto for AWS interactions, so you'll need that installed on your control host. We're also going to make some use of the AWS CLI tools, so get those too. Your platform may differ, but the following will work for most platforms:
pip install python-boto awscli
We also assume Ansible 1.9.x, for Ubuntu you can get that from the Ansible PPA.
add-apt-repository ppa:ansible/ansible apt-get install ansible
You should place your AWS access/secret keys into
~/.aws/credentials
[Credentials] aws_access_key_id = aws_secret_access_key =
We'll be using the ec2.py dynamic inventory script for Ansible so we can address our EC2 instances by various attributes instead of hard coding hostnames into an inventory file. It's not included with the Ubuntu distribution(s) of Ansible, so we'll grab it from GitHub. Place ec2.py and ec2.ini into
/etc/ansible/inventory(creating that directory if absent)
Modify
/etc/ansible/ansible.cfgto use that directory as the inventory source:
# /etc/ansible/ansible.cfg inventory = /etc/ansible/inventory
A prerequisite to setting up an application for auto scaling involves building an AMI containing your working application, which will be used to launch new instances to meet demand. We'll start by launching a new instance onto which we can deploy our application. Create the following files:
--- # group_vars/all.ymlregion: us-east-1 zone: us-east-1a keypair: YOUR_KEYPAIR security_groups: YOUR_SECURITY_GROUP instance_type: m3.medium volumes:
device_name: /dev/sda1 device_type: gp2 volume_size: 20 delete_on_termination: true
---deploy.yml
hosts: localhost connection: local gather_facts: no roles:
---roles/launch/tasks/main.yml
name: Search for the latest Ubuntu 14.04 AMI ec2_ami_find: region: "{{ region }}" name: "ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-*" owner: 099720109477 sort: name sort_order: descending sort_end: 1 no_result_action: fail register: ami_result
name: Launch new instance ec2: region: "{{ region }}" keypair: "{{ keypair }}" zone: "{{ zone }}" group: "{{ security_groups }}" image: "{{ ami_result.results[0].ami_id }}" instance_type: "{{ instance_type }}" instance_tags:
Name: "{{ name }}"
volumes: "{{ volumes }}" wait: yes register: ec2
name: Add new instances to host group add_host: name: "{{ item.public_dns_name }}" groups: "{{ name }}" ec2_id: "{{ item.id }}" with_items: ec2.instances
name: Wait for instance to boot wait_for: host: "{{ item.public_dns_name }}" port: 22 delay: 30 timeout: 300 state: started with_items: ec2.instances
The ec2amifind module is a new addition to Ansible 2.0 but has not been backported to 1.9, so we'll need to import this module from GitHub and place it into the
library/directory relative to
deploy.yml.
Run the playbook with
ansible-playbook deploy.yml -vvand a new instance will be launched. You'll see it in the AWS Web Console and you should be able to SSH to it.
Now we'll use Ansible to deploy our application and start it. We'll deploy a sample Node.js web application, the source code of which is kept in a public git repository. Ansible is going to clone and checkout our application at a desired revision on the target instance and configure it to start on boot, in addition to setting up a web server.
--- # deploy.yml
hosts: localhost connection: local gather_facts: no roles:
hosts: ami-build roles:
---roles/deploy/tasks/main.yml
name: Install git apt: pkg: git state: present sudo: yes
name: Create www directory file: path: /srv/www owner: ubuntu group: ubuntu state: directory sudo: yes
name: Clone repository git: repo: "https://github.com/atplanet/hello-world-express-app.git" dest: /srv/www/webapp version: master
name: Install upstart script copy: src: upstart.conf dest: /etc/upstart/webapp.conf sudo: yes
name: Enable and start the application service: name: webapp enabled: yes state: restarted sudo: yes
# roles/deploy/files/upstart.confdescription "Sample Node.js app" author "Tom Bamford"
start on (local-filesystems and net-device-up) stop on runlevel [06]
env IP="127.0.0.1" env NODE_ENV="production" setuid ubuntu
respawn exec node /srv/www/webapp/app.js
---roles/nginx/tasks/main.yml
name: Install Nginx apt: pkg: nginx state: present sudo: yes
name: Configure Nginx copy: src: nginx.conf dest: /etc/sites-enabled/default sudo: yes
name: Enable and start Nginx service: name: nginx enabled: yes state: restarted sudo: yes
# roles/nginx/files/nginx.confserver { listen 80 default_server; location / { proxy_pass http://127.0.0.1:8000; } }
Running the playbook again will launch another instance, install some useful packages, deploy our application and set up Nginx as our web server. If you browse to the newest instance at its hostname, as reported in the output of ansible-playbook, you should see a "Hello World" page.
Now that the application is deployed and running, we can use the newly launched instance to build an AMI. Create the
build-amirole and amend the deploy.yml to invoke it.
--- # deploy.yml
hosts: localhost connection: local gather_facts: no roles:
hosts: ami-build roles:
hosts: localhost connection: local gather_facts: no roles:
---roles/build-ami/tasks/main.yml
name: Create AMI ec2_ami: region: "{{ region }}" instance_id: "{{ ec2_id }}" name: "webapp-{{ ansible_date_time.iso8601 | regex_replace('[^a-zA-Z0-9]', '-') }}" wait: yes state: present register: ami
You'll probably have noticed by now that each time the playbook is run, Ansible launches a new instance. At this rate, we'll keep accumulating instances that we don't need, so we will add another role and a new task to locate these instances and terminate them. Now, after Ansible successfully launches a new instance, it will terminate any existing instances immediately afterwards.
--- # deploy.yml
name: Find existing instance(s) hosts: "tag_Name_ami-build" gather_facts: false tags: find tasks:
hosts: localhost connection: local gather_facts: no roles:
hosts: ami-build roles:
hosts: localhost connection: local gather_facts: no roles:
hosts: old-ami-build roles:
---roles/terminate/tasks/main.yml
name: Terminate old instance(s) ec2: instance_ids: "{{ ec2_id }}" region: "{{ region }}" state: absent wait: yes
Our AMI is built, so now we'll want to create a new Launch Configuration to describe the new instances that should be launched from this AMI. We'll create another role to handle that.
--- # deploy.yml
name: Find existing instance(s) hosts: "tag_Name_ami-build" gather_facts: false tags: find tasks:
hosts: localhost connection: local gather_facts: no roles:
hosts: ami-build roles:
hosts: localhost connection: local gather_facts: no roles:
hosts: old-ami-build roles:
---roles/create-launch-configuration/tasks/main.yml
name: Create Launch Configuration ec2_lc: region: "{{ region }}" name: "webapp-{{ ansible_date_time.iso8601 | regex_replace('[^a-zA-Z0-9]', '-') }}" image_id: "{{ ami.image_id }}" key_name: "{{ keypair }}" instance_type: "{{ instance_type }}" security_groups: "{{ security_groups }}" volumes: "{{ volumes }}" instance_monitoring: yes
Clients will connect to an Elastic Load Balancer which will distribute incoming requests among the instances we have launched into our upcoming Auto Scaling Group. Again we'll create another role to handle the management of the ELB, and apply it from our playbook.
--- # deploy.yml
name: Find existing instance(s) hosts: "tag_Name_ami-build" gather_facts: false tags: find tasks:
hosts: localhost connection: local gather_facts: no roles:
hosts: ami-build roles:
hosts: localhost connection: local gather_facts: no roles:
hosts: old-ami-build roles:
---roles/load-balancer/tasks/main.yml
name: Configure Elastic Load Balancers ec2_elb_lb: region: "{{ region }}" name: webapp state: present zones: "{{ zone }}" connection_draining_timeout: 60 listeners:
- protocol: http
load_balancer_port: 80
instance_port: 80
health_check:
ping_protocol: http
ping_port: 80
ping_path: "/"
response_timeout: 10
interval: 30
unhealthy_threshold: 6
healthy_threshold: 2
register: elb_result
We'll create an Auto Scaling Group and configure it to use the Launch Configuration we previously created. Within the boundaries that we define, AWS will launch instances into the ASG dynamically based on the current load across all instances. Equally when the load drops, some instances will be terminated accordingly. Exactly how many instances are launched or terminated is defined in one or more scaling policies, which are also created and linked to the ASG.
--- # deploy.yml
name: Find existing instance(s) hosts: "tag_Name_ami-build" gather_facts: false tags: find tasks:
hosts: localhost connection: local gather_facts: no roles:
hosts: ami-build roles:
hosts: localhost connection: local gather_facts: no roles:
hosts: old-ami-build roles:
---roles/auto-scaling/tasks/main.yml
name: Retrieve current Auto Scaling Group properties command: "aws --region {{ region }} autoscaling describe-auto-scaling-groups --auto-scaling-group-names webapp" register: asg_properties_result
name: Set asg_properties variable from JSON output if the Auto Scaling Group already exists set_fact: asg_properties: "{{ (asg_properties_result.stdout | from_json).AutoScalingGroups[0] }}" when: (asg_properties_result.stdout | from_json).AutoScalingGroups | count
name: Configure Auto Scaling Group and perform rolling deploy ec2_asg: region: "{{ region }}" name: webapp launch_config_name: webapp availability_zones: "{{ zone }}" health_check_type: ELB health_check_period: 300 desired_capacity: "{{ asg_properties.DesiredCapacity | default(2) }}" replace_all_instances: yes replace_batch_size: "{{ (asg_properties.DesiredCapacity | default(2) / 4) | round(0, 'ceil') | int }}" min_size: 2 max_size: 10 load_balancers:
- webapp
state: present register: asg_result
name: Configure Scaling Policies ec2_scaling_policy: region: "{{ region }}" name: "{{ item.name }}" asg_name: webapp state: present adjustment_type: "{{ item.adjustment_type }}" min_adjustment_step: "{{ item.min_adjustment_step }}" scaling_adjustment: "{{ item.scaling_adjustment }}" cooldown: "{{ item.cooldown }}" with_items:
name: Determine Metric Alarm configuration set_fact: metric_alarms:
- name: "{{ asg_name }}-ScaleUp"
comparison: ">="
threshold: 50.0
alarm_actions:
- "{{ sp_result.results[0].arn }}"
- name: "{{ asg_name }}-ScaleDown"
comparison: "<="
threshold: 20.0
alarm_actions:
- "{{ sp_result.results[1].arn }}"
name: Configure Metric Alarms and link to Scaling Policies ec2_metric_alarm: region: "{{ region }}" name: "{{ item.name }}" state: present metric: "CPUUtilization" namespace: "AWS/EC2" statistic: "Average" comparison: "{{ item.comparison }}" threshold: "{{ item.threshold }}" period: 60 evaluation_periods: 5 unit: "Percent" dimensions:
AutoScalingGroupName: "{{ asg_name }}"
alarm_actions: "{{ item.alarm_actions }}" with_items: metric_alarms when: max_size > 1 register: ma_result
There's more going on here too. We not only configure our ASG and scaling policies, but also create CloudWatch metric alarms to measure the load across our instances, and associate them with the corresponding scaling policies to complete our configuration.
Here we have configured our CloudWatch alarms to trigger based on aggregate CPU usage within our auto scaling group. When the average CPU utilization exceeds 50% across your instances for 5 consecutive samples taken every 60 seconds (i.e. 5 minutes), a scaling event will be triggered that launches a new instance to relieve the load. A corresponding CloudWatch alarm also triggers a scaling event to terminate an instance from the auto scaling group when the average CPU utilization drops below 20% across your instances for the same sample period.
The minimum and maximum sizes for the auto scaling group are set to 2 and 10 respectively. It's important to get these values right for your application workload. You do not want to be under resourced for early peaks in traffic, and for redundancy reasons it's a good idea to always have at least 2 instances in service. Equally you probably want your application to scale for peak periods, but perhaps not beyond a safety limit in the event you receive massive amounts of traffic which could result in escalating costs.
Particularly important to note here is how we configure the
ec2_asgmodule to perform rolling deploys. First, we determine how many instances the ASG currently has running and use this to specify our
desired_capacityand calculate a suitable
replace_batch_size. The
replace_all_instancesoption specifies that all currently running instances should be replaced by new instances using the new Launch Configuration. Together, this ensures that the capacity of our ASG is not adversely affected during the deploy and allows us to safely deploy at any time, whether we are currently running 5 or 5000 instances! Of course this means that the more instances you have running, the longer the entire process will take. You may wish to increase the
replace_batch_sizeif you are consistently running more instances.
If you have a domain name, or subdomain, set up with AWS Route 53, you can have Ansible update the DNS records to point to your Auto Scaling Group.
--- # deploy.yml
name: Find existing instance(s) hosts: "tag_Name_ami-build" gather_facts: false tags: find tasks:
hosts: localhost connection: local gather_facts: no roles:
hosts: ami-build roles:
hosts: localhost connection: local gather_facts: no roles:
hosts: old-ami-build roles:
---roles/dns/tasks/main.yml
name: Update DNS route53: command: create overwrite: yes zone: "{{ domain }}" record: "www.{{ domain }}" type: CNAME ttl: 300 value: "{{ elb_result.elb.dns_name }}"
Whilst we already configured Ansible to terminate old instances used for building AMIs, right now we will start to accumulate launch configurations and AMIs each time we invoke the
deploy.ymlplaybook. This might not appear to be much of a problem at the outset (financial costs aside), but it will soon become an issue due to service limits imposed by AWS. At the time of writing, the relevant limit on Launch Configurations was 100 per region. When this limit is reached, no more can be created and our playbook will start to fail.
Note that whilst you can request increased limits per region for your account, in our experience sometimes these requests are refused on the grounds that AWS would prefer for you to clean up your cruft instead of relying on perpetual service limit increases.
Leaving unused resources lying around is not very good practise in any case, and we certainly don't want to be paying for those resources unnecessarily. To fix this, we'll make use of the
ec2_ami_find/
ec2_amimodules to delete the older AMIs, and a quick and dirty (but effective) hand rolled module to discard old launch configurations.
--- # deploy.yml
name: Find existing instance(s) hosts: "tag_Name_ami-build" gather_facts: false tags: find tasks:
hosts: localhost connection: local gather_facts: no roles:
hosts: ami-build roles:
hosts: ami-build connection: local gather_facts: no roles:
hosts: localhost connection: local gather_facts: no roles:
hosts: old-ami-build connection: local gather_facts: no roles:
---roles/delete-old-amis/tasks/main.yml
ec2_ami_find: region: "{{ region }}" owner: self name: "webapp-*" sort: name sort_end: -10 register: old_ami_result
ec2_ami: region: "{{ region }}" image_id: "{{ item.ami_id }}" delete_snapshot: yes state: absent with_items: old_ami_result.results ignore_errors: yes
---roles/delete-old-launch-configurations/tasks/main.yml
lc_find: region: "{{ region }}" name_regex: "webapp-.*" sort: yes sort_end: -10 register: old_lc_result
ec2_lc: region: "{{ region }}" name: "{{ item.name }}" state: absent with_items: old_lc_result.results ignore_errors: yes
#!/usr/bin/pythonroles/delete-old-launch-configurations/library/lc_find.py
import json import subprocess
def main(): argument_spec = ec2_argument_spec() argument_spec.update(dict( region = dict(required=True, aliases = ['aws_region', 'ec2_region']), name_regex = dict(required=False), sort = dict(required=False, default=None, type='bool'), sort_order = dict(required=False, default='ascending', choices=['ascending', 'descending']), sort_start = dict(required=False), sort_end = dict(required=False), ) ) module = AnsibleModule( argument_spec=argument_spec, ) name_regex = module.params.get('name_regex') sort = module.params.get('sort') sort_order = module.params.get('sort_order') sort_start = module.params.get('sort_start') sort_end = module.params.get('sort_end') lc_cmd_result = subprocess.check_output(["aws", "autoscaling", "describe-launch-configurations", "--region", module.params.get('region')]) lc_result = json.loads(lc_cmd_result) results = [] for lc in lc_result['LaunchConfigurations']: data = { 'arn': lc["LaunchConfigurationARN"], 'name': lc["LaunchConfigurationName"], } results.append(data) if name_regex: regex = re.compile(name_regex) results = [result for result in results if regex.match(result['name'])] if sort: results.sort(key=lambda e: e['name'], reverse=(sort_order=='descending')) try: if sort and sort_start and sort_end: results = results[int(sort_start):int(sort_end)] elif sort and sort_start: results = results[int(sort_start):] elif sort and sort_end: results = results[:int(sort_end)] except TypeError: module.fail_json(msg="Please supply numeric values for sort_start and/or sort_end") module.exit_json(results=results)
from ansible.module_utils.basic import * from ansible.module_utils.ec2 import *
if name == 'main': main()
When these roles are used together, Ansible will maintain a history of 10 AMIs and 10 Launch Configurations prior to the latest one of each. This will provide our rollback capability; in the event that you wish to roll back to an earlier deployed version of your application, you can update the active Launch Configuration in your Auto Scaling Group settings and replace your instances by terminating them in batches. Auto Scaling will start up new instances with your specified launch configuration in order to fulfill the desired instance count.
Now that we have a completed playbook to handle deployments of our application to EC2 Auto Scaling, all that remains is to hook it up to your existing systems to invoke it whenever you want a new deploy to occur. We'll cover that in a later blog post.
All the code from this article is available on GitHub.