High availability for chef

Hi

As we move configuration of more critical components to our chef server, we need to implement some sort of high availability solution for our chef server.

Any one has experience in that matter ?

a) Is replication of couchdb, solr index viable ? or

b) it is more simple to just re-create the database from source code of cookbook, roles and databags ? in that case how to make sure that we don´t have to re-register each node


ADVERTENCIA LEGAL
Este mensaje se dirige exclusivamente a su destinatario y puede contener
información confidencial y/o sujeta al secreto profesional, cuya
divulgación no está permitida por la ley. Si no es vd. el destinatario de
este mensaje o lo ha recibido por error, queda informado de que la lectura,
utilización, divulgación y/o copia de este mensaje, cualquiera que fuera su
finalidad, está prohibida por la ley. Si ha recibido este mensaje por
error, le rogamos que nos lo comunique inmediatamente por esta misma vía y
proceda a su destrucción. El correo electrónico y las comunicaciones por
medio de Internet no permiten asegurar la confidencialidad de los mensajes
que se transmiten ni su integridad o correcta recepción. Si no consintiese
la utilización del correo electrónico, le rogamos nos lo comunique de forma
inmediata. ING DIRECT no asume ninguna responsabilidad por estas
circunstancias.

LEGAL WARNING
This message is intended exclusively for its addressee and may contain
information that is CONFIDENTIAL and/or protected by a professional
privilege, protected from disclosure by law. If you are not the intended
recipient or you have received it in error, you are hereby notified that
any read, dissemination, disclosure and/or copy of this message, for any
purpose, is strictly prohibited by law. If this message has been received
in error, please immediately notify us vía e-mail and delete it. E-mail and
Internet do not guarantee the confidentiality, nor the completeness or
proper reception of the messages sent. Should you not agree to the use of
e-mail, you are kindly requested to notify us immediately. ING DIRECT does
not assume any liability for those circumstances.

On Mon, Jul 4, 2011 at 7:30 AM, le.huy@ingdirect.es wrote:

As we move configuration of more critical components to our chef server, we
need to implement some sort of high availability solution for our chef
server.

Any one has experience in that matter ?

a) Is replication of couchdb, solr index viable ? or

It depends on what level of HA you require. For both CouchDB and Solr,
you can handle this with DRBD and passive failover for what is usually
sub-second takeover on failure - this is also good for things like
cookbook uploads (stick /var/chef on the DRBD drives.)

You can also make each component HA on their own, using the mechanisms
the upstream recommends - Replication in both cases you list above.

b) it is more simple to just re-create the database from source code
of cookbook, roles and databags ? in that case how to make sure that we
don´t have to re-register each node

You likely want to be keeping an active replica (either with app
replication or disk-level replication) for your failover - this helps
in DR, but that's what backups are for.

Best,
Adam

--
Opscode, Inc.
Adam Jacob, Chief Product Officer
T: (206) 619-7151 E: adam@opscode.com

You may also (in addition to the HA setup) want to check out Spiceweasel: GitHub - mattray/spiceweasel: Generates Chef knife commands from a simple JSON or YAML file. which will allow you to reference all of your nodes, cookbooks, roles etc in a yaml file and bulk load from it.

Ant

On Jul 5, 2011, at 1:42 PM, Adam Jacob wrote:

On Mon, Jul 4, 2011 at 7:30 AM, le.huy@ingdirect.es wrote:

As we move configuration of more critical components to our chef server, we
need to implement some sort of high availability solution for our chef
server.

Any one has experience in that matter ?

a) Is replication of couchdb, solr index viable ? or

It depends on what level of HA you require. For both CouchDB and Solr,
you can handle this with DRBD and passive failover for what is usually
sub-second takeover on failure - this is also good for things like
cookbook uploads (stick /var/chef on the DRBD drives.)

You can also make each component HA on their own, using the mechanisms
the upstream recommends - Replication in both cases you list above.

b) it is more simple to just re-create the database from source code
of cookbook, roles and databags ? in that case how to make sure that we
don´t have to re-register each node

You likely want to be keeping an active replica (either with app
replication or disk-level replication) for your failover - this helps
in DR, but that's what backups are for.

Best,
Adam

--
Opscode, Inc.
Adam Jacob, Chief Product Officer
T: (206) 619-7151 E: adam@opscode.com

I'd be interested in people's experience with the capacity of various
Chef server configurations. I'm working on the HA setup as well.
Active/passive failover seems relatively straightforward but the next
question is when that won't be good enough. Anyone have numbers they'd
be willing to post? Haven't started testing yet to see what the first
bottleneck will be. Hoping that it'll scale well enough to handle our
needs (only a couple thousand nodes) so that we won't have to do much
more than split off the data components on separate nodes. Not
terribly interested in mimicking the full multi-tier stack I imagine
the platform runs on if I can avoid it.

KC

On Tue, Jul 5, 2011 at 11:00 AM, Anthony Goddard agoddard@mbl.edu wrote:

You may also (in addition to the HA setup) want to check out Spiceweasel:
GitHub - mattray/spiceweasel: Generates Chef knife commands from a simple JSON or YAML file. which will allow you to reference all
of your nodes, cookbooks, roles etc in a yaml file and bulk load from it.
Ant

On Jul 5, 2011, at 1:42 PM, Adam Jacob wrote:

On Mon, Jul 4, 2011 at 7:30 AM, le.huy@ingdirect.es wrote:

As we move configuration of more critical components to our chef server, we

need to implement some sort of high availability solution for our chef

server.

Any one has experience in that matter ?

a) Is replication of couchdb, solr index viable ? or

It depends on what level of HA you require. For both CouchDB and Solr,
you can handle this with DRBD and passive failover for what is usually
sub-second takeover on failure - this is also good for things like
cookbook uploads (stick /var/chef on the DRBD drives.)

You can also make each component HA on their own, using the mechanisms
the upstream recommends - Replication in both cases you list above.

b) it is more simple to just re-create the database from source code

of cookbook, roles and databags ? in that case how to make sure that we

don´t have to re-register each node

You likely want to be keeping an active replica (either with app
replication or disk-level replication) for your failover - this helps
in DR, but that's what backups are for.

Best,
Adam

--
Opscode, Inc.
Adam Jacob, Chief Product Officer
T: (206) 619-7151 E: adam@opscode.com

On Tue, Jul 5, 2011 at 11:51 AM, KC Braunschweig
kcbraunschweig@gmail.com wrote:

I'd be interested in people's experience with the capacity of various
Chef server configurations. I'm working on the HA setup as well.
Active/passive failover seems relatively straightforward but the next
question is when that won't be good enough. Anyone have numbers they'd
be willing to post? Haven't started testing yet to see what the first
bottleneck will be. Hoping that it'll scale well enough to handle our
needs (only a couple thousand nodes) so that we won't have to do much
more than split off the data components on separate nodes. Not
terribly interested in mimicking the full multi-tier stack I imagine
the platform runs on if I can avoid it.

For a couple of thousand nodes, assuming you have control over the
hardware, you can scale active/passive vertically.

You need to be aware of, and control:

  1. The frequency of convergence on the chef-clients. (How often do
    they check in, at what splay.) This will impact the number of API
    workers you run (we recommend you do it on unicorn,) and the
    configuration of the upstream proxy (we use nginx.) This is often
    bound by memory, although CPU can be a pain point.

  2. Make sure you are compacting the CouchDB database and the views on
    a regular basis.

  3. Solr may need tuning, specifically memory/heap/permgen space.

  4. RabbitMQ really only does reliable HA by persisting to disk.

Best,
Adam

--
Opscode, Inc.
Adam Jacob, Chief Product Officer
T: (206) 619-7151 E: adam@opscode.com