Validating Ansible playbooks with Docker and Serverspec

I’m currently working with a client to bring in DevOps practices, including infrastructure automation and continuous delivery, both using Ansible.

Why Ansible?

Ansible is a great choice for an organisation new to this sort of automation thanks to its easy learning curve and agentless architecture. Whereas alternatives such as Chef and Puppet require the installation of agents running as root on each target machine, Ansible needs nothing more than to be able to establish an ssh session as a user with appropriate permissions and a functioning Python installation on the target machine. Ansible can even bootstrap the installation of Python using plain shell commands if required.

Permissions

Ansible relies heavily on the use of dynamically generated Python scripts running on the target machine to perform actions. Core Ansible doesn’t include any permissions model, so the only mechanisms available to define the actions it can perform on a given target are standard Unix account access controls such as file permissions, group membership and if root permissions are needed then sudoers. But since Ansible modules rely on running dynamic Python scripts it turns out these can’t in practise be controlled through sudoers. The only options are to grant permission to run what amounts to arbitrary Python scripts as root (which would give no protection) or to grant no permission to elevate to root — which would make some actions impossible. The commercial Ansible Tower product does layer on a permissions model, but if you’re not ready to pay for that but still need to control permissions then unfortunately the only effective way to allow actions to run as root is to fall back to lower level scripting. So instead of

- yum: name=httpd
  become: yes

we use,

- command: sh -c "sudo -c yum install httpd"

which is certainly less elegant, but tolerable. But in other cases things do become much more ugly, especially when manipulating root-owned files, where normally we could use modules such as lineinfile.

In these cases we lose some of the power and much of the elegance of Ansible, which is a shame. It also means you have an unpleasantly tight coupling between the commands allowed by sudoers and the playbooks. This is something we’re still actively investigating to try to find a workable solution and we may yet find we need to start using Tower.

Validating playbooks

Ansible will abort with an error if a task step fails, but how can we verify that a successful run has brought the target machine into the expected state? Careful use of Ansible commands can verify some aspects, but if you want independent verification then the default choice is Test Kitchen, which has a pluggable architecture allowing it to work with a variety of configuration management systems (including Chef, Puppet and Ansible) against a choice of what it calls drivers (including Vagrant, AWS and Docker) using a variety of test frameworks. Unfortunately the kitchen-ansible gem which provides the integration with Ansible didn’t suit our needs.

What’s wrong with kitchen-ansible?

We wanted to run a routine like this:

  1. Create a clean test target (e.g. VM)
  2. Apply the Ansible playbook to the test target
  3. Verify the state of the test target is as expected
  4. Apply the Ansible playbook again, verifying there are no errors
  5. Verify the state is still as expected, proving the playbook is idempotent

And kitchen-ansible does support this, but not quite in the way you might imagine. Instead of running Ansible and the tests from outside the test target, kitchen-ansible instead installs Ansible and the test framework into the target and runs each from within, pointing locally. This has a couple of major drawbacks. Firstly, this installation takes significant time. We chose to use Serverspec as our test framework, which is Ruby-based, and installing Ruby can be quite time-consuming. Secondly, and more problematically, the installation of these frameworks means that the test target is polluted and cannot be trusted to behave the same as a clean system will. This concern would be far reduced if using Chef or Puppet, since these both require a Ruby-based agent to be installed on the target system, meaning that the pollution is far less and the installation overhead much reduced, but for Ansible this model doesn’t work well.

So what’s the alternative?

This faulty testing model is not a necessary outcome of using Test Kitchen: it would be quite possible to create an alternative to kitchen-ansible which works differently. As a starting point we started simple and rolled our own, but this could grow into an alternative Test Kitchen plugin. We chose to use Docker as our test bed because of the blistering fast startup time. And I had some experience with Docker which made me fairly confident that a container could be constructed which captured all the essential elements of a VM. We ended up with this test flow:

  1. Create a clean test target by building a Docker image which includes an SSH server and running it
  2. Apply the Ansible playbook to the test target through an SSH port exposed from the container
  3. Verify the state of the test target is as expected, again over an SSH session
  4. Apply the Ansible playbook again, verifying there are no errors
  5. Verify the state is still as expected, proving the playbook is idempotent
  6. Remove the Docker container if all previous steps completed successfully

The Docker setup

For a valid test our Ansible scripts need to be able to connect to the container in the same way they would to a VM and they need to be able to install and start services. This means the container must do two things that would normally be considered anti-patterns: run an ssh service and run an init system. After some trial and error we ended up with two layers: a base image which contained common setup,

FROM centos:centos6

# Disable PAM or ssh fails with GSS errors.
RUN yum install -y openssh-server sudo && \
 sed -ri 's/UsePAM yes/#UsePAM yes/g' /etc/ssh/sshd_config && \
 sed -ri 's/#UsePAM no/UsePAM no/g' /etc/ssh/sshd_config

# Need to allow sudo without a tty for ansible to successfully connect
RUN sed -i 's/\(Defaults\s*requiretty\)/#\1/' /etc/sudoers

# Add ansible user, which will be used to run ansible
RUN useradd -m ansible && \
 echo ansible | passwd ansible --stdin && \
 printf "Cmnd_Alias ANSIBLE_CMDS = /usr/bin/yum install\nansible ALL=(ALL) NOPASSWD:ANSIBLE_CMDS" > /etc/sudoers.d/ansible

# Set up ssh keys for ansible user

RUN mkdir -p /home/ansible/.ssh/ && \
 chown ansible:ansible /home/ansible/.ssh/ && \
 chmod 0700 /home/ansible/.ssh

COPY id_rsa.pub /home/ansible/.ssh/authorized_keys

RUN chown ansible:ansible /home/ansible/.ssh/authorized_keys && \
 chmod 0600 /home/ansible/.ssh/authorized_keys

# Add serverspec user, which will be used to run serverspec
RUN useradd -m serverspec && \
 echo serverspec | passwd serverspec --stdin && \
 printf "serverspec ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/serverspec

# Set up ssh keys for serverspec user

RUN mkdir -p /home/serverspec/.ssh/ && \
 chown serverspec:serverspec /home/serverspec/.ssh/ && \
 chmod 0700 /home/serverspec/.ssh

COPY id_rsa.pub /home/serverspec/.ssh/authorized_keys

RUN chown serverspec:serverspec /home/serverspec/.ssh/authorized_keys && \
 chmod 0600 /home/serverspec/.ssh/authorized_keys

# expose port 22 for ssh
EXPOSE 22

# init is our root process since we need to allow ansible to start services
CMD ["/sbin/init"]

and a playbook-specific layer, e.g. for testing a playbook which installs Hashicorp Vault,

FROM ansible-test

# Additional permissions for ansible user - this is obviously brittle
RUN printf "Cmnd_Alias ANSIBLE_VAULT_CMDS = /usr/sbin/setcap cap_ipc_lock=+ep /u01/app/vault/vault, /bin/cp -n --preserve=mode /u01/app/vault/staging/vault /etc/init.d/, /sbin/chkconfig --add vault, /sbin/service vault start\nansible ALL=(ALL) NOPASSWD:SETENV:ANSIBLE_VAULT_CMDS " > /etc/sudoers.d/vault

# Set up vault user
RUN useradd -m vault && \
 echo vault | passwd vault --stdin && \
 mkdir -p /u01/app/vault/

# Set up ssh keys for vault user

RUN mkdir -p /home/vault/.ssh/ && \
 chown vault:vault /home/vault/.ssh/ && \
 chmod 0700 /home/vault/.ssh

COPY id_rsa.pub /home/vault/.ssh/authorized_keys

RUN chown vault:vault /home/vault/.ssh/authorized_keys && \
 chmod 0600 /home/vault/.ssh/authorized_keys

# Create install directory

RUN mkdir -p /app/vault && \
 chown vault:vault /app/vault

Apart from the problem of needing to specialise the sudo rights for the specific playbook discussed above, this worked beautifully. There were some complexities, but these mostly turned out to be a product either of a corporate requirement to make outgoing http connections via a proxy (requiring some extra setup in the Dockerfile, not shown here) or in some cases because ansible_spec’s parsing of playbooks is a little picky and its error handling somewhat minimal.

What was this like in use?

The most basic scenario of a single clean run of ansible with verification looks like this:

# remove old container if present (e.g. if previous run failed part way), ignoring errors
docker rm -f test-target || true

docker build -f docker/Dockerfile-test-target -t test-target .

docker run -d --name test-target -p 2022:22 test-target

# Wait until ssh from container is responding
until ssh ansible@localhost -p 2022 -q -C 'echo'; do echo 'Waiting for container to respond to ssh' ; sleep 1; done

ansible-playbook -i environments/hosts-local-docker main.yml

TARGET_USER=serverspec rake serverspec:install-vault

# remove container if we get this far (i.e. if there are no errors previously)
docker rm -f test-target

We managed to achieve much faster feedback on playbook changes than before, drastically reducing playbook development time. Docker’s caching when building images means that, even if for simplicity we perform a build before each run, the container is up and running in less than a second. For the installation and configuration of a single node Hashicorp Vault instance, the test cycle time is around 12 seconds to start the container, run Ansible against it, validate with Serverspec tests and destroy the container again. This is significantly quicker than we could have achieved if using a VM as the test target.

The only time we noticed any functional difference between running in Docker and targeting a VM was caused by a specific Vault requirement to disable swapping of the vault process to prevent unencrypted secrets from being persisted. This is achieved with an Ansible step to run setcap cap_ipc_lock=+ep. In the VM that was all we had to do, but with Docker we had to grant the container additional rights by running with --cap-add IPC_LOCK. This kind of leaky abstraction seemed to be rare, but is obviously the price you pay for the much faster validation feedback.

Our workflow is now:

  1. Quick and regular feedback using Docker while developing
  2. Further validation by running on a clean Vagrant box, which is slower but a more realistic test
  3. Once happy, apply to the ultimate target.

What next?

Further refinement would include creating an alternative Ansible plugin for Test Kitchen to replace our shell scripts, and investigating the possibility of extracting some more of the boilerplate which ansible_spec generates into the plugin.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s