Simultaneous software upgrade on multiple nodes


#1

Hi
Let’s imagine we have 50 servers with and a distributed web application and
we need to upgrade software on all of them. New software requires downtime
while upgrade.
If upgrade is performed simultaneously on all servers the application will
be totally unavailable for this time. How to avoid it?

My suggestion is to use one of the following ways:

  1. Manually run chef clients with knife for blocks of 10 servers in series.
  2. Create some complex cookbook that search for how much servers are being
    upgraded at this moment and not run software upgrade recipe if there are
    already 10 servers in upgrade queue(chef attribute). At the end of upgrade
    recipe there will be some notification to start chef run on the next 10
    nodes if it was the last upgrade node upgrade(will search for
    node[:upgrade][:in_process]). In this case there is no manual work - just
    change role version and run chef-client for all of 50 nodes, all upgrade
    logic will be in the recipes.

Which way is better? Maybe there are another great ways to perform partly
upgrade?


Best regards,
Koldaev Anton


#2

I usually go the first way for simplicity–I keep a set of environments (production1…n) and assign nodes in a sensible fashion. Cookbooks are always pinned to a version; then I can just bump version one env at a time.

I have never needed your second approach, but it’s an interesting idea, shouldn’t be too hard to implement. But I like the total control of the first one.

On Oct 22, 2011, at 10:10 AM, Anton Koldaev koldaevav@gmail.com wrote:

Hi
Let’s imagine we have 50 servers with and a distributed web application and we need to upgrade software on all of them. New software requires downtime while upgrade.
If upgrade is performed simultaneously on all servers the application will be totally unavailable for this time. How to avoid it?

My suggestion is to use one of the following ways:

  1. Manually run chef clients with knife for blocks of 10 servers in series.
  2. Create some complex cookbook that search for how much servers are being upgraded at this moment and not run software upgrade recipe if there are already 10 servers in upgrade queue(chef attribute). At the end of upgrade recipe there will be some notification to start chef run on the next 10 nodes if it was the last upgrade node upgrade(will search for node[:upgrade][:in_process]). In this case there is no manual work - just change role version and run chef-client for all of 50 nodes, all upgrade logic will be in the recipes.

Which way is better? Maybe there are another great ways to perform partly upgrade?


Best regards,
Koldaev Anton


#3

On Oct 22, 2011, at 3:10 AM, Anton Koldaev wrote:

Let’s imagine we have 50 servers with and a distributed web application and we need to upgrade software on all of them. New software requires downtime while upgrade.
If upgrade is performed simultaneously on all servers the application will be totally unavailable for this time. How to avoid it?

I don’t have personal experience with it, but other Chef experts I’ve heard have talked about using “rundeck” to handle the orchestration of things like this. I would be very interested to hear what your thoughts would be on this software.


Brad Knowles bknowles@ihiji.com
SAGE Level IV, Chef Level 0.0.1


#4

If you use aws ec2 and Trac, I wrote a Trac plugin that does the multi-node serial orchestration for push-button deployments:

http://trac-hacks.org/wiki/CloudPlugin

It’s really just a thin webui wrapper to pychef and boto.

  • Rob

On Oct 22, 2011, at 9:55 AM, Brad Knowles wrote:

On Oct 22, 2011, at 3:10 AM, Anton Koldaev wrote:

Let’s imagine we have 50 servers with and a distributed web application and we need to upgrade software on all of them. New software requires downtime while upgrade.
If upgrade is performed simultaneously on all servers the application will be totally unavailable for this time. How to avoid it?

I don’t have personal experience with it, but other Chef experts I’ve heard have talked about using “rundeck” to handle the orchestration of things like this. I would be very interested to hear what your thoughts would be on this software.


Brad Knowles bknowles@ihiji.com
SAGE Level IV, Chef Level 0.0.1


#5

I’ll prefer knife ssh and pass -C 1 (or 5 or whatever block side i need)

On Sun, Oct 23, 2011 at 1:52 AM, Rob Guttman robguttman@gmail.com wrote:

If you use aws ec2 and Trac, I wrote a Trac plugin that does the multi-node
serial orchestration for push-button deployments:

http://trac-hacks.org/wiki/CloudPlugin

It’s really just a thin webui wrapper to pychef and boto.

  • Rob

On Oct 22, 2011, at 9:55 AM, Brad Knowles wrote:

On Oct 22, 2011, at 3:10 AM, Anton Koldaev wrote:

Let’s imagine we have 50 servers with and a distributed web application and
we need to upgrade software on all of them. New software requires downtime
while upgrade.

If upgrade is performed simultaneously on all servers the application will
be totally unavailable for this time. How to avoid it?

I don’t have personal experience with it, but other Chef experts I’ve heard
have talked about using “rundeck” to handle the orchestration of things like
this. I would be very interested to hear what your thoughts would be on
this software.


Brad Knowles bknowles@ihiji.com
SAGE Level IV, Chef Level 0.0.1


#6

I wrote a knife plugin called “batch” (gem install knife-batch) that is
basically knife ssh, with the ability to specify how many servers to
operate on at once with a sleep of however long you want in between these
batches.

knife batch “role:foo” “command” -B 10 -W 10 (run command on 10 servers at
once, with a wait of 10 seconds per batch)

It’s pretty handy in cases where the -C option to knife ssh won’t fit your
needs.

  • Ian

On Mon, Oct 24, 2011 at 12:51 AM, Ranjib Dey ranjibd@thoughtworks.comwrote:

I’ll prefer knife ssh and pass -C 1 (or 5 or whatever block side i need)

On Sun, Oct 23, 2011 at 1:52 AM, Rob Guttman robguttman@gmail.com wrote:

If you use aws ec2 and Trac, I wrote a Trac plugin that does the
multi-node serial orchestration for push-button deployments:

http://trac-hacks.org/wiki/CloudPlugin

It’s really just a thin webui wrapper to pychef and boto.

  • Rob

On Oct 22, 2011, at 9:55 AM, Brad Knowles wrote:

On Oct 22, 2011, at 3:10 AM, Anton Koldaev wrote:

Let’s imagine we have 50 servers with and a distributed web application
and we need to upgrade software on all of them. New software requires
downtime while upgrade.

If upgrade is performed simultaneously on all servers the application
will be totally unavailable for this time. How to avoid it?

I don’t have personal experience with it, but other Chef experts I’ve
heard have talked about using “rundeck” to handle the orchestration of
things like this. I would be very interested to hear what your thoughts
would be on this software.


Brad Knowles bknowles@ihiji.com
SAGE Level IV, Chef Level 0.0.1


#7

I haven’t tried it myself, but saw this a while back and it looked promising:

Basically does rolling restarts using a databag to implement locking.

KC

On Thu, Nov 10, 2011 at 4:16 PM, Ian Meyer ianmmeyer@gmail.com wrote:

I wrote a knife plugin called “batch” (gem install knife-batch) that is
basically knife ssh, with the ability to specify how many servers to operate
on at once with a sleep of however long you want in between these batches.
knife batch “role:foo” “command” -B 10 -W 10 (run command on 10 servers at
once, with a wait of 10 seconds per batch)
It’s pretty handy in cases where the -C option to knife ssh won’t fit your
needs.

  • Ian

On Mon, Oct 24, 2011 at 12:51 AM, Ranjib Dey ranjibd@thoughtworks.com
wrote:

I’ll prefer knife ssh and pass -C 1 (or 5 or whatever block side i need)

On Sun, Oct 23, 2011 at 1:52 AM, Rob Guttman robguttman@gmail.com wrote:

If you use aws ec2 and Trac, I wrote a Trac plugin that does the
multi-node serial orchestration for push-button deployments:
http://trac-hacks.org/wiki/CloudPlugin
It’s really just a thin webui wrapper to pychef and boto.

  • Rob

On Oct 22, 2011, at 9:55 AM, Brad Knowles wrote:

On Oct 22, 2011, at 3:10 AM, Anton Koldaev wrote:

Let’s imagine we have 50 servers with and a distributed web application
and we need to upgrade software on all of them. New software requires
downtime while upgrade.

If upgrade is performed simultaneously on all servers the application
will be totally unavailable for this time. How to avoid it?

I don’t have personal experience with it, but other Chef experts I’ve
heard have talked about using “rundeck” to handle the orchestration of
things like this. I would be very interested to hear what your thoughts
would be on this software.


Brad Knowles bknowles@ihiji.com
SAGE Level IV, Chef Level 0.0.1