Simultaneous software upgrade on multiple nodes

Anton_Koldaev · October 22, 2011, 8:10am

Hi
Let’s imagine we have 50 servers with and a distributed web application and
we need to upgrade software on all of them. New software requires downtime
while upgrade.
If upgrade is performed simultaneously on all servers the application will
be totally unavailable for this time. How to avoid it?

My suggestion is to use one of the following ways:

Manually run chef clients with knife for blocks of 10 servers in series.
Create some complex cookbook that search for how much servers are being
upgraded at this moment and not run software upgrade recipe if there are
already 10 servers in upgrade queue(chef attribute). At the end of upgrade
recipe there will be some notification to start chef run on the next 10
nodes if it was the last upgrade node upgrade(will search for
node[:upgrade][:in_process]). In this case there is no manual work - just
change role version and run chef-client for all of 50 nodes, all upgrade
logic will be in the recipes.

Which way is better? Maybe there are another great ways to perform partly
upgrade?

–
Best regards,
Koldaev Anton

Andrea_Campi · October 22, 2011, 8:28am

I usually go the first way for simplicity--I keep a set of environments (production1..n) and assign nodes in a sensible fashion. Cookbooks are always pinned to a version; then I can just bump version one env at a time.

I have never needed your second approach, but it's an interesting idea, shouldn't be too hard to implement. But I like the total control of the first one.

On Oct 22, 2011, at 10:10 AM, Anton Koldaev koldaevav@gmail.com wrote:

Hi
Let's imagine we have 50 servers with and a distributed web application and we need to upgrade software on all of them. New software requires downtime while upgrade.
If upgrade is performed simultaneously on all servers the application will be totally unavailable for this time. How to avoid it?

My suggestion is to use one of the following ways:

Manually run chef clients with knife for blocks of 10 servers in series.

Create some complex cookbook that search for how much servers are being upgraded at this moment and not run software upgrade recipe if there are already 10 servers in upgrade queue(chef attribute). At the end of upgrade recipe there will be some notification to start chef run on the next 10 nodes if it was the last upgrade node upgrade(will search for node[:upgrade][:in_process]). In this case there is no manual work - just change role version and run chef-client for all of 50 nodes, all upgrade logic will be in the recipes.

Which way is better? Maybe there are another great ways to perform partly upgrade?

--
Best regards,
Koldaev Anton

Brad_Knowles · October 22, 2011, 1:55pm

On Oct 22, 2011, at 3:10 AM, Anton Koldaev wrote:

Let's imagine we have 50 servers with and a distributed web application and we need to upgrade software on all of them. New software requires downtime while upgrade.
If upgrade is performed simultaneously on all servers the application will be totally unavailable for this time. How to avoid it?

I don't have personal experience with it, but other Chef experts I've heard have talked about using "rundeck" to handle the orchestration of things like this. I would be very interested to hear what your thoughts would be on this software.

--
Brad Knowles bknowles@ihiji.com
SAGE Level IV, Chef Level 0.0.1

Rob_Guttman · October 22, 2011, 8:22pm

If you use aws ec2 and Trac, I wrote a Trac plugin that does the multi-node serial orchestration for push-button deployments:

http://trac-hacks.org/wiki/CloudPlugin

It's really just a thin webui wrapper to pychef and boto.

Rob

On Oct 22, 2011, at 9:55 AM, Brad Knowles wrote:

On Oct 22, 2011, at 3:10 AM, Anton Koldaev wrote:

Let's imagine we have 50 servers with and a distributed web application and we need to upgrade software on all of them. New software requires downtime while upgrade.
If upgrade is performed simultaneously on all servers the application will be totally unavailable for this time. How to avoid it?

I don't have personal experience with it, but other Chef experts I've heard have talked about using "rundeck" to handle the orchestration of things like this. I would be very interested to hear what your thoughts would be on this software.

--
Brad Knowles bknowles@ihiji.com
SAGE Level IV, Chef Level 0.0.1

Ranjib_Dey · October 24, 2011, 4:51am

I'll prefer knife ssh and pass -C 1 (or 5 or whatever block side i need)

On Sun, Oct 23, 2011 at 1:52 AM, Rob Guttman robguttman@gmail.com wrote:

If you use aws ec2 and Trac, I wrote a Trac plugin that does the multi-node
serial orchestration for push-button deployments:

http://trac-hacks.org/wiki/CloudPlugin

It's really just a thin webui wrapper to pychef and boto.

Rob

On Oct 22, 2011, at 9:55 AM, Brad Knowles wrote:

On Oct 22, 2011, at 3:10 AM, Anton Koldaev wrote:

Let's imagine we have 50 servers with and a distributed web application and
we need to upgrade software on all of them. New software requires downtime
while upgrade.

If upgrade is performed simultaneously on all servers the application will
be totally unavailable for this time. How to avoid it?

I don't have personal experience with it, but other Chef experts I've heard
have talked about using "rundeck" to handle the orchestration of things like
this. I would be very interested to hear what your thoughts would be on
this software.

--
Brad Knowles bknowles@ihiji.com
SAGE Level IV, Chef Level 0.0.1

ianmmeyer · November 11, 2011, 12:16am

I wrote a knife plugin called "batch" (gem install knife-batch) that is
basically knife ssh, with the ability to specify how many servers to
operate on at once with a sleep of however long you want in between these
batches.

knife batch "role:foo" "command" -B 10 -W 10 (run command on 10 servers at
once, with a wait of 10 seconds per batch)

It's pretty handy in cases where the -C option to knife ssh won't fit your
needs.

Ian

On Mon, Oct 24, 2011 at 12:51 AM, Ranjib Dey ranjibd@thoughtworks.comwrote:

I'll prefer knife ssh and pass -C 1 (or 5 or whatever block side i need)

On Sun, Oct 23, 2011 at 1:52 AM, Rob Guttman robguttman@gmail.com wrote:

If you use aws ec2 and Trac, I wrote a Trac plugin that does the
multi-node serial orchestration for push-button deployments:

http://trac-hacks.org/wiki/CloudPlugin

It's really just a thin webui wrapper to pychef and boto.

Rob

On Oct 22, 2011, at 9:55 AM, Brad Knowles wrote:

On Oct 22, 2011, at 3:10 AM, Anton Koldaev wrote:

Let's imagine we have 50 servers with and a distributed web application
and we need to upgrade software on all of them. New software requires
downtime while upgrade.

If upgrade is performed simultaneously on all servers the application
will be totally unavailable for this time. How to avoid it?

I don't have personal experience with it, but other Chef experts I've
heard have talked about using "rundeck" to handle the orchestration of
things like this. I would be very interested to hear what your thoughts
would be on this software.

--
Brad Knowles bknowles@ihiji.com
SAGE Level IV, Chef Level 0.0.1

KC_Braunschweig · November 11, 2011, 12:30am

I haven't tried it myself, but saw this a while back and it looked promising:

Basically does rolling restarts using a databag to implement locking.

KC

On Thu, Nov 10, 2011 at 4:16 PM, Ian Meyer ianmmeyer@gmail.com wrote:

I wrote a knife plugin called "batch" (gem install knife-batch) that is
basically knife ssh, with the ability to specify how many servers to operate
on at once with a sleep of however long you want in between these batches.
knife batch "role:foo" "command" -B 10 -W 10 (run command on 10 servers at
once, with a wait of 10 seconds per batch)
It's pretty handy in cases where the -C option to knife ssh won't fit your
needs.

Ian

On Mon, Oct 24, 2011 at 12:51 AM, Ranjib Dey ranjibd@thoughtworks.com
wrote:

I'll prefer knife ssh and pass -C 1 (or 5 or whatever block side i need)

On Sun, Oct 23, 2011 at 1:52 AM, Rob Guttman robguttman@gmail.com wrote:

If you use aws ec2 and Trac, I wrote a Trac plugin that does the
multi-node serial orchestration for push-button deployments:
http://trac-hacks.org/wiki/CloudPlugin
It's really just a thin webui wrapper to pychef and boto.

Rob

On Oct 22, 2011, at 9:55 AM, Brad Knowles wrote:

On Oct 22, 2011, at 3:10 AM, Anton Koldaev wrote:

Let's imagine we have 50 servers with and a distributed web application
and we need to upgrade software on all of them. New software requires
downtime while upgrade.

If upgrade is performed simultaneously on all servers the application
will be totally unavailable for this time. How to avoid it?

I don't have personal experience with it, but other Chef experts I've
heard have talked about using "rundeck" to handle the orchestration of
things like this. I would be very interested to hear what your thoughts
would be on this software.

--
Brad Knowles bknowles@ihiji.com
SAGE Level IV, Chef Level 0.0.1

Topic		Replies	Views
Upgrading cookbooks/resources Chef Infra (archive)	1	270	July 20, 2011
Managing redundant nodes with Chef Chef Infra (archive)	2	837	May 2, 2016
Apply cookbooks on multiple nodes at same time Chef Infra (archive)	2	906	October 29, 2014
How to deploy to nodes synchronously? Chef Infra (archive)	6	489	July 31, 2012
App deployments w/ chef Chef Infra (archive)	8	344	October 16, 2013

Simultaneous software upgrade on multiple nodes

Related topics