App deployments w/ chef

Ohai,

I know this is a perennial bugaboo, but I’m curious how folks are handling internally-developed app deployments with Chef these days, and specifically if anyone has found a good solution that achieves all of the following:

  1. When provisioning a new node, Chef can deploy the app(s) that should be on that node. knife ec2 server create … should be the only command I need to run to go from “no server” to “it’s added to the LB and handling requests.” In other words, you don’t need to provision with Chef, then deploy with Capistrano, etc.

  2. When running chef-client on nodes that already have some version of the app(s) running, make sure that they all run the same version and upgrade simultaneously and atomically (or as close to that as possible). So, for example, if you were storing a git rev in a data bag as the “current version I want deployed” and then one node kicked off its regular chef-client run and upgraded to that, the other nodes running that app would then still be on an old version of that code. That shouldn’t happen, but it would with the simplest use of the Chef deploy resource and the splayed, regular interval chef-client runs.

  3. Don’t lose the ability to run chef-client automatically at regular intervals on all nodes. This is important. I have better things to do than running Chef code manually on servers when things change (and running that detection algorithm in my brain) and then cleaning up whatever bit rot breakage occurred since the last chef-client run hours/days/weeks/months ago.

It seems the main tension here is #1 requires Chef to be able to kick off a deploy to a single node, whereas #2 requires Chef to either never deploy code to nodes that already have some version of it (and thus require a separate manual process to do simultaneous, atomic upgrades) OR to kick off an orchestrated run across all nodes that need to be upgraded (which really isn’t Chef’s forte, AFAICT).

Something like Capistrano is handy because multi-server orchestration is its bread and butter, but getting Chef and Capistrano to communicate and share data is… tricky.

If the deploy resource could be configured to only deploy when no version of the app was present (covering #1), or when a defaulted-false parameter were set to true by a manual knife ssh run covering all nodes that run that app, that could work. Does that seem reasonable?

  • Wes

If you're using EC2, I'd look at customizing an AMI for ElasticBeanstalk
and using EB to manage your app deployments. YMMV, of course.

--
~~ StormeRider ~~

"Every world needs its heroes [...] They inspire us to be better than we
are. And they protect from the darkness that's just around the corner."

(from Smallville Season 6x1: "Zod")

On why I hate the phrase "that's so lame"... http://bit.ly/Ps3uSS

On Thu, Oct 10, 2013 at 12:25 PM, Wes Morgan cap10morgan@gmail.com wrote:

Ohai,

I know this is a perennial bugaboo, but I'm curious how folks are handling
internally-developed app deployments with Chef these days, and specifically
if anyone has found a good solution that achieves all of the following:

  1. When provisioning a new node, Chef can deploy the app(s) that should be
    on that node. knife ec2 server create … should be the only command I need
    to run to go from "no server" to "it's added to the LB and handling
    requests." In other words, you don't need to provision with Chef, then
    deploy with Capistrano, etc.

  2. When running chef-client on nodes that already have some version of the
    app(s) running, make sure that they all run the same version and upgrade
    simultaneously and atomically (or as close to that as possible). So, for
    example, if you were storing a git rev in a data bag as the "current
    version I want deployed" and then one node kicked off its regular
    chef-client run and upgraded to that, the other nodes running that app
    would then still be on an old version of that code. That shouldn't happen,
    but it would with the simplest use of the Chef deploy resource and the
    splayed, regular interval chef-client runs.

  3. Don't lose the ability to run chef-client automatically at regular
    intervals on all nodes. This is important. I have better things to do than
    running Chef code manually on servers when things change (and running that
    detection algorithm in my brain) and then cleaning up whatever bit rot
    breakage occurred since the last chef-client run hours/days/weeks/months
    ago.

It seems the main tension here is #1 requires Chef to be able to kick off
a deploy to a single node, whereas #2 requires Chef to either never deploy
code to nodes that already have some version of it (and thus require a
separate manual process to do simultaneous, atomic upgrades) OR to kick off
an orchestrated run across all nodes that need to be upgraded (which really
isn't Chef's forte, AFAICT).

Something like Capistrano is handy because multi-server orchestration is
its bread and butter, but getting Chef and Capistrano to communicate and
share data is… tricky.

If the deploy resource could be configured to only deploy when no version
of the app was present (covering #1), or when a defaulted-false parameter
were set to true by a manual knife ssh run covering all nodes that run that
app, that could work. Does that seem reasonable?

  • Wes

On Thursday, October 10, 2013 at 2:25 PM, Wes Morgan wrote:

  1. When running chef-client on nodes that already have some version of the app(s) running, make sure that they all run the same version and upgrade simultaneously and atomically (or as close to that as possible). So, for example, if you were storing a git rev in a data bag as the "current version I want deployed" and then one node kicked off its regular chef-client run and upgraded to that, the other nodes running that app would then still be on an old version of that code. That shouldn't happen, but it would with the simplest use of the Chef deploy resource and the splayed, regular interval chef-client runs.

The only way I've sanely been able to handle this is do breaking deployments in two parts, so the old version and the new CAN coexist in production for some period of time. This involves (and it is what I do, for example, for DB migrations: )

  1. Creating a release which adds new columns to the database and writes to them, as well as the old columns, while still reading from the old columns.
  2. Waiting for the deployment to be consistent
  3. Creating a second deployment which reads from the new columns and drops the old columns.

Graham

We’ve accomplished something pretty close to this.

For #1, we’re using Rackspace, so we create a template with chef-client pre-installed. We have a script that runs to deploy the node based on the template, pushes a chef config file and keys, sets up the basic node properties in chef including assigning roles, and then triggers the initial chef run. I suppose we could do the same with knife but the script works better for our purposes.

For #2, our code is packaged into debian packages, uploaded to our private debian repository, and the package version is saved as an attribute in the environment. Each node installs the version that’s in the environment assign to it. Upgrading means bumping the version in the environment, and within 15 minutes all our nodes are upgraded.

For #3, chef runs every 15 minutes regardless. We only shut it down temporarily if we have to make an emergency change (and this requires wearing the hat of shame) that needs to get in place before the code or configuration change can make its way through the build process (about 3 hours including full test suite), after which chef is enabled to implement the upgrade.

cheers
mike


Michael Hart
Arctic Wolf Networks
M: 226.388.4773

On 2013-10-10, at 3:25 PM, Wes Morgan <cap10morgan@gmail.commailto:cap10morgan@gmail.com> wrote:

Ohai,

I know this is a perennial bugaboo, but I’m curious how folks are handling internally-developed app deployments with Chef these days, and specifically if anyone has found a good solution that achieves all of the following:

  1. When provisioning a new node, Chef can deploy the app(s) that should be on that node. knife ec2 server create … should be the only command I need to run to go from “no server” to “it’s added to the LB and handling requests.” In other words, you don’t need to provision with Chef, then deploy with Capistrano, etc.

  2. When running chef-client on nodes that already have some version of the app(s) running, make sure that they all run the same version and upgrade simultaneously and atomically (or as close to that as possible). So, for example, if you were storing a git rev in a data bag as the “current version I want deployed” and then one node kicked off its regular chef-client run and upgraded to that, the other nodes running that app would then still be on an old version of that code. That shouldn’t happen, but it would with the simplest use of the Chef deploy resource and the splayed, regular interval chef-client runs.

  3. Don’t lose the ability to run chef-client automatically at regular intervals on all nodes. This is important. I have better things to do than running Chef code manually on servers when things change (and running that detection algorithm in my brain) and then cleaning up whatever bit rot breakage occurred since the last chef-client run hours/days/weeks/months ago.

It seems the main tension here is #1 requires Chef to be able to kick off a deploy to a single node, whereas #2 requires Chef to either never deploy code to nodes that already have some version of it (and thus require a separate manual process to do simultaneous, atomic upgrades) OR to kick off an orchestrated run across all nodes that need to be upgraded (which really isn’t Chef’s forte, AFAICT).

Something like Capistrano is handy because multi-server orchestration is its bread and butter, but getting Chef and Capistrano to communicate and share data is… tricky.

If the deploy resource could be configured to only deploy when no version of the app was present (covering #1), or when a defaulted-false parameter were set to true by a manual knife ssh run covering all nodes that run that app, that could work. Does that seem reasonable?

  • Wes

We like RunDeck for deployment orchestration, though obviously we are
interested in trying pushy. (Or push. Or whatever it's called this week.)
In most cases our RunDeck jobs trigger a Chef run on a specific scope of
Chef nodes within an environment, in series or in parallel ... though we
have some folks using a deployment pattern where RunDeck updates a data bag
before or after running Chef, and/or triggers some other post-deploy
process.

Could you do all that in a Chef run? It's probably possible but I'm not
sure a Chef handler running on the node is going to be the right place for
every service validation mechanism. You end up potentially adding many
more gem dependencies into Chef's Ruby install when you do that, which is
perhaps not desirable in production.

So maybe what you do is write a lightweight Chef report handler that hits a
web service saying, "This node's just finished deploying version Y of
Product X and needs to be validated." That web service could be, say, a CI
service or something else capable of running tests against a parameterized
endpoint and making pretty graphs of the result.

Then that thing would be responsible for deciding your app was ready to
go into the production pool on the LB. (Maybe it hits the load balancer's
API directly to put the node into service, maybe it sets a "validated" flag
on the Chef node object or in a data bag somewhere and Chef takes care of
it?)

And all this assumes that you don't have to do any destructive DB
migrations. (You should follow Graham's advice and then you won't need to
solve for that scenario :smiley: )

On Thu, Oct 10, 2013 at 12:32 PM, Graham Christensen graham@grahamc.comwrote:

On Thursday, October 10, 2013 at 2:25 PM, Wes Morgan wrote:

  1. When running chef-client on nodes that already have some version of the
    app(s) running, make sure that they all run the same version and upgrade
    simultaneously and atomically (or as close to that as possible). So, for
    example, if you were storing a git rev in a data bag as the "current
    version I want deployed" and then one node kicked off its regular
    chef-client run and upgraded to that, the other nodes running that app
    would then still be on an old version of that code. That shouldn't happen,
    but it would with the simplest use of the Chef deploy resource and the
    splayed, regular interval chef-client runs.

The only way I've sanely been able to handle this is do breaking
deployments in two parts, so the old version and the new CAN coexist in
production for some period of time. This involves (and it is what I do, for
example, for DB migrations: )

  1. Creating a release which adds new columns to the database and writes to
    them, as well as the old columns, while still reading from the old columns.
  2. Waiting for the deployment to be consistent
  3. Creating a second deployment which reads from the new columns and drops
    the old columns.

Graham

Hi,

I found Chef very rewarding for your #1. I don't see any need for
capistrano, fabric, whatever...

For your #2 if I understand you correctly you mean something like:
You have your in-house web-app. Historically it was deployed by
hand/capistrano to several nodes and worked well. Now you add nodes
doing the same tasks but are completely chef-managed.
I do have that scenario here too. And its a pain in the a**. But I made
very good results by just completely scrapping these nodes after
deploying fitting nodes of category #1.
tl;dr: Replace that #2 nodes with #1 nodes.

#3: When you write the recipes to deploy your app(s), include an
attribute for the version. Your production-environment then has
versions/revisions fixed env-wide. While your staging/testing areas
don't have version/revision-constrains set. So when your CI-testing
turns green and the manual checking in staging also turns out good, you
can adopt the attributes in the production environment and have the
nodes pick up the new version within your normal interval.
When you have things like load-balancers, not only have them search for
nodes with the needed roles in the environment, but also check that the
last time they finished a chef-run isn't to long ago. Then the nodes
where the update failed will only get a limited number of request and
once you finished looking into them, they will be picked up again by
the load-balancers...

Maybe that helps,

Arnold

On Thu, 10 Oct 2013 13:25:21 -0600 Wes Morgan cap10morgan@gmail.com
wrote:

Ohai,

I know this is a perennial bugaboo, but I'm curious how folks are
handling internally-developed app deployments with Chef these days,
and specifically if anyone has found a good solution that achieves
all of the following:

  1. When provisioning a new node, Chef can deploy the app(s) that
    should be on that node. knife ec2 server create … should be the only
    command I need to run to go from "no server" to "it's added to the LB
    and handling requests." In other words, you don't need to provision
    with Chef, then deploy with Capistrano, etc.

  2. When running chef-client on nodes that already have some version
    of the app(s) running, make sure that they all run the same version
    and upgrade simultaneously and atomically (or as close to that as
    possible). So, for example, if you were storing a git rev in a data
    bag as the "current version I want deployed" and then one node kicked
    off its regular chef-client run and upgraded to that, the other nodes
    running that app would then still be on an old version of that code.
    That shouldn't happen, but it would with the simplest use of the Chef
    deploy resource and the splayed, regular interval chef-client runs.

  3. Don't lose the ability to run chef-client automatically at regular
    intervals on all nodes. This is important. I have better things to do
    than running Chef code manually on servers when things change (and
    running that detection algorithm in my brain) and then cleaning up
    whatever bit rot breakage occurred since the last chef-client run
    hours/days/weeks/months ago.

It seems the main tension here is #1 requires Chef to be able to kick
off a deploy to a single node, whereas #2 requires Chef to either
never deploy code to nodes that already have some version of it (and
thus require a separate manual process to do simultaneous, atomic
upgrades) OR to kick off an orchestrated run across all nodes that
need to be upgraded (which really isn't Chef's forte, AFAICT).

Something like Capistrano is handy because multi-server orchestration
is its bread and butter, but getting Chef and Capistrano to
communicate and share data is… tricky.

If the deploy resource could be configured to only deploy when no
version of the app was present (covering #1), or when a
defaulted-false parameter were set to true by a manual knife ssh run
covering all nodes that run that app, that could work. Does that seem
reasonable?

  • Wes

Just wanted to throw in GitHub - gofullstack/capistrano-chef: Capistrano extensions for Chef integration

I have never used it and don't know if it would help for your specific
case, but you might find it useful though...

Cheers, Torben

On Thu, Oct 10, 2013 at 10:52 PM, Arnold Krille arnold@arnoldarts.dewrote:

Hi,

I found Chef very rewarding for your #1. I don't see any need for
capistrano, fabric, whatever...

For your #2 if I understand you correctly you mean something like:
You have your in-house web-app. Historically it was deployed by
hand/capistrano to several nodes and worked well. Now you add nodes
doing the same tasks but are completely chef-managed.
I do have that scenario here too. And its a pain in the a**. But I made
very good results by just completely scrapping these nodes after
deploying fitting nodes of category #1.
tl;dr: Replace that #2 nodes with #1 nodes.

#3: When you write the recipes to deploy your app(s), include an
attribute for the version. Your production-environment then has
versions/revisions fixed env-wide. While your staging/testing areas
don't have version/revision-constrains set. So when your CI-testing
turns green and the manual checking in staging also turns out good, you
can adopt the attributes in the production environment and have the
nodes pick up the new version within your normal interval.
When you have things like load-balancers, not only have them search for
nodes with the needed roles in the environment, but also check that the
last time they finished a chef-run isn't to long ago. Then the nodes
where the update failed will only get a limited number of request and
once you finished looking into them, they will be picked up again by
the load-balancers...

Maybe that helps,

Arnold

On Thu, 10 Oct 2013 13:25:21 -0600 Wes Morgan cap10morgan@gmail.com
wrote:

Ohai,

I know this is a perennial bugaboo, but I'm curious how folks are
handling internally-developed app deployments with Chef these days,
and specifically if anyone has found a good solution that achieves
all of the following:

  1. When provisioning a new node, Chef can deploy the app(s) that
    should be on that node. knife ec2 server create … should be the only
    command I need to run to go from "no server" to "it's added to the LB
    and handling requests." In other words, you don't need to provision
    with Chef, then deploy with Capistrano, etc.

  2. When running chef-client on nodes that already have some version
    of the app(s) running, make sure that they all run the same version
    and upgrade simultaneously and atomically (or as close to that as
    possible). So, for example, if you were storing a git rev in a data
    bag as the "current version I want deployed" and then one node kicked
    off its regular chef-client run and upgraded to that, the other nodes
    running that app would then still be on an old version of that code.
    That shouldn't happen, but it would with the simplest use of the Chef
    deploy resource and the splayed, regular interval chef-client runs.

  3. Don't lose the ability to run chef-client automatically at regular
    intervals on all nodes. This is important. I have better things to do
    than running Chef code manually on servers when things change (and
    running that detection algorithm in my brain) and then cleaning up
    whatever bit rot breakage occurred since the last chef-client run
    hours/days/weeks/months ago.

It seems the main tension here is #1 requires Chef to be able to kick
off a deploy to a single node, whereas #2 requires Chef to either
never deploy code to nodes that already have some version of it (and
thus require a separate manual process to do simultaneous, atomic
upgrades) OR to kick off an orchestrated run across all nodes that
need to be upgraded (which really isn't Chef's forte, AFAICT).

Something like Capistrano is handy because multi-server orchestration
is its bread and butter, but getting Chef and Capistrano to
communicate and share data is… tricky.

If the deploy resource could be configured to only deploy when no
version of the app was present (covering #1), or when a
defaulted-false parameter were set to true by a manual knife ssh run
covering all nodes that run that app, that could work. Does that seem
reasonable?

  • Wes

On 10/10/13 12:25 PM, Wes Morgan wrote:

It seems the main tension here is #1 requires Chef to be able to kick off a deploy to a single node, whereas #2 requires Chef to either never deploy code to nodes that already have some version of it (and thus require a separate manual process to do simultaneous, atomic upgrades) OR to kick off an orchestrated run across all nodes that need to be upgraded (which really isn't Chef's forte, AFAICT).

You can always do something like this for simple orchestration:

knife ssh 'role:foo_app' sudo chef-client -o 'role[foo_app]'

That'll login to every server that has role[foo_app] and run that role.
As long as orthogonal stuff like user accounts and ntp and dns are in
'role[base]' which is applied seperately, then you only run the
deployment for your foo_app code and whatever it depends upon. If its
just one cookbook you need to run then you can do that as well, or you
can use a role cookbook instead of a real role, etc (and if you have a
mega-role that rolls up all your base cookbooks applied to your host
then you probably have to use one of those approaches as well...
there's more than one way to do it...)

AFAIK, pushy is just a better and more scalable way of doing this where
you can require N hosts be up and M succeed and get back good reporting,
etc and makes it easier to chain actions as long as the prior ones
succeed... I don't see why you can't start with knife ssh and override
run lists though...

Here's how we handle deployments:

  • we pre-build EC2 images with all software that will need to be installed (but set to not start) to reduce our dependency on 3rd parties and time to production

  • EC2 nodes are provisioned via auto scale and bootstrapped with a user data script that (among other things) updates client.rb with the appropriate chef server, role and environment then kicks off a chef-client run

  • the deploy recipe searches a data bag to determine the correct build to deploy using the role as the key

  • builds are stored in S3 so we use s3_file to compare the remote build to the one already deployed, if one has been

  • if there's a new build, we have 2 scenarios

    • for public facing deploys, the recipe uses a db as a lock mechanism to determine if any other instances with the same role are deploying, if not, it inserts a row as a lock and starts the deploy. If there is an existing row/lock, the deploy is delayed until the next chef-client run
    • for non-public facing deploys, the recipe deploys
  • deploys update the db once it's done so we can query what build the instance is running. I have a ticket to monitor deploys and warn when/if we're in an inconsistent state after an acceptable amount of time.

  • chef-client runs every 30 minutes

cjs

On Oct 10, 2013, at 2:25 PM, Wes Morgan cap10morgan@gmail.com wrote:

Ohai,

I know this is a perennial bugaboo, but I'm curious how folks are handling internally-developed app deployments with Chef these days, and specifically if anyone has found a good solution that achieves all of the following:

  1. When provisioning a new node, Chef can deploy the app(s) that should be on that node. knife ec2 server create … should be the only command I need to run to go from "no server" to "it's added to the LB and handling requests." In other words, you don't need to provision with Chef, then deploy with Capistrano, etc.

  2. When running chef-client on nodes that already have some version of the app(s) running, make sure that they all run the same version and upgrade simultaneously and atomically (or as close to that as possible). So, for example, if you were storing a git rev in a data bag as the "current version I want deployed" and then one node kicked off its regular chef-client run and upgraded to that, the other nodes running that app would then still be on an old version of that code. That shouldn't happen, but it would with the simplest use of the Chef deploy resource and the splayed, regular interval chef-client runs.

  3. Don't lose the ability to run chef-client automatically at regular intervals on all nodes. This is important. I have better things to do than running Chef code manually on servers when things change (and running that detection algorithm in my brain) and then cleaning up whatever bit rot breakage occurred since the last chef-client run hours/days/weeks/months ago.

It seems the main tension here is #1 requires Chef to be able to kick off a deploy to a single node, whereas #2 requires Chef to either never deploy code to nodes that already have some version of it (and thus require a separate manual process to do simultaneous, atomic upgrades) OR to kick off an orchestrated run across all nodes that need to be upgraded (which really isn't Chef's forte, AFAICT).

Something like Capistrano is handy because multi-server orchestration is its bread and butter, but getting Chef and Capistrano to communicate and share data is… tricky.

If the deploy resource could be configured to only deploy when no version of the app was present (covering #1), or when a defaulted-false parameter were set to true by a manual knife ssh run covering all nodes that run that app, that could work. Does that seem reasonable?

  • Wes