Chef object versioning aligned with deployed applications



Warning: Long post ahead.

I have a situation that I could use some input on. First, let me briefly describe the development process that our infrastructure supports.

There are four non-production environments that our internal applications progress through: delivery, integration (build servers), QA (internal), and UAT (partners). Once a release has passed through UAT and is approved, it is schedule for release into production. As developers finish features, they get merged into a branch called ‘develop’. Develop always runs in the delivery and integration environments. At the end of an iteration, we create a release branch, based on develop, in all the projects that get deployed (currently about 15). This release branch then gets deployed to QA and UAT where further testing happens. If problems are found, patches are applied directly to the release branch, which eventually get merged back into develop. When a release is ready to go to production, it gets merged into the master branch, tagged with the release number, and deployed to the production environment.

We use Chef to manage not only the supporting applications (db servers, web servers, build tools, etc) but also configuration files the applications need. For example, we have chef recipes that create unicorn configs, mongoid configs, etc. We also store on node files not only the run list but also information about which of our internal applications are going to be deployed onto them. When it’s time to deploy, for each application we ask Chef ‘What node does this application get deployed to for this environment?’, and then feed that into Capistrano. This allows us to move things around as we need to without having to update the applications themselves. It also allows us to test deployments on vagrants, which is very helpful.

Now the issue that I’m facing.

We recently hit a situation in which the unicorn configs were changed for a release, say 3.0, and got deployed to all non-production environments. Then we needed to redeploy an older version, say 2.0) of our product to UAT to test a hot fix going to production. The deploy of older code went fine, but the unicorn configs were incompatible and we had problems with the applications until we figured out what was going on. Since the timeframe for 2.0 being in UAT was temporary, my solution to the problem was to disable chef-client and manually change the unicorn configs.

This situation got me thinking about being able to reproduce an environment at any given time or roll back to a particular state for some reason, like to create a mirror of production for partner testing (which is actually a chore I have to do soon). Currently we use one Chef server in non-production and one in production, but they are not always in sync. Changes to Chef objects in non-production will eventually go into production, but there can be a lag of several weeks, depending on the production release schedule. This means that I go long stretches without being able to easily mirror the production environment, except by restoring a backup (which isn’t a great solution because of data).

I’ve been thinking about possible solutions, and one possibility is to cut branches of all Chef objects to coincide with the release branches of the applications. This would require a separate Chef server for each environment, but I could drop everything, switch to the correct branch, and reload everything and then deploy the applications. This would allow me to easily mirror any environment/version at any time. I’ve talked to a couple of people about this, and the only real hang up is separate Chef servers. Does anyone know of a way to accomplish a closer relationship between the infrastructure and the applications it supports using a single Chef server? I’ve heard that Private Chef’s multi tenancy might work with logical/virtual Chef servers. Does anyone know if that is the case?

One other piece of information that might be helpful is our application environments (delivery, integration, etc) do not coincide with chef_environments. We use more generic chef_environments like infrastructure, development, production.

Your input is very much appreciated. Sorry for such a long post.