Idiom for adding a node to a Cluster


#1

I’m wondering what the chef idioms are for a certain problem that comes
up a lot when expanding a cluster. Let’s say I have some kind of
persistence store and I want to enable replication, or add a new node
with replication to an already running cluster. The replication will
communicate on some custom protocol, but in order to work, I have to
move stateful data, like db logs or whatever, from the master to a new
node. The master is “the master right now”, so it needs to be
dynamically discovered, and accessed via rsync or scp, say, to pull the
files down. I’m thinking for this I should just provision every cluster
node with a fixed static public/private key.


#2

I currently have a mode attribute identifying the cluster it is part of.
Every X hours a backup / snapshot is made, and a new node auto detects and
imports the latest snap from the cluster by name
(foocluster-201310151830.tar.gz for example) which includes all the info to
join and catch up to the cluster.

Hope that helps.

Ps - this is really easy on AWA et al, a bit trickier on bare metal.

Graham

On Tuesday, October 15, 2013, Bryan Taylor wrote:

I’m wondering what the chef idioms are for a certain problem that comes up
a lot when expanding a cluster. Let’s say I have some kind of persistence
store and I want to enable replication, or add a new node with replication
to an already running cluster. The replication will communicate on some
custom protocol, but in order to work, I have to move stateful data, like
db logs or whatever, from the master to a new node. The master is “the
master right now”, so it needs to be dynamically discovered, and accessed
via rsync or scp, say, to pull the files down. I’m thinking for this I
should just provision every cluster node with a fixed static public/private
key.


#3

I like it. I hadn’t thought of pushing the state data to some external location. That solves all the complexities around finding the master or dealing with down/hung nodes.

Can this process be made generic enough, though to be fit for use in a community opscode cookbook? Where can the “backup location” be defaulted to? It’d be nice if the chef server could always be used as a file server.

From: Graham Christensen <graham@grahamc.commailto:graham@grahamc.com>
Date: Tuesday, October 15, 2013 6:47 PM
To: Bryan Taylor <btaylor@rackspace.commailto:btaylor@rackspace.com>
Cc: Chef Dev <chef-dev@lists.opscode.commailto:chef-dev@lists.opscode.com>
Subject: Re: [chef-dev] Idiom for adding a node to a Cluster

I currently have a mode attribute identifying the cluster it is part of. Every X hours a backup / snapshot is made, and a new node auto detects and imports the latest snap from the cluster by name (foocluster-201310151830.tar.gz for example) which includes all the info to join and catch up to the cluster.

Hope that helps.

Ps - this is really easy on AWA et al, a bit trickier on bare metal.

Graham

On Tuesday, October 15, 2013, Bryan Taylor wrote:

I’m wondering what the chef idioms are for a certain problem that comes up a lot when expanding a cluster. Let’s say I have some kind of persistence store and I want to enable replication, or add a new node with replication to an already running cluster. The replication will communicate on some custom protocol, but in order to work, I have to move stateful data, like db logs or whatever, from the master to a new node. The master is “the master right now”, so it needs to be dynamically discovered, and accessed via rsync or scp, say, to pull the files down. I’m thinking for this I should just provision every cluster node with a fixed static public/private key.


#4

I think it is not a recommended practice to make Chef server a file server
and upload/download huge files as part of cookbooks.
Can think of two solutions:

  1. using an FTP to procure files from it during each chef-client run
    (remote_file resource can be used for the purpose of moving zips)
  2. use torrents to move files - don’t know if there is a provider for it
    already

Regards
Aniket

On 16 October 2013 10:10, Bryan Taylor btaylor@rackspace.com wrote:

I like it. I hadn’t thought of pushing the state data to some external
location. That solves all the complexities around finding the master or
dealing with down/hung nodes.

Can this process be made generic enough, though to be fit for use in a
community opscode cookbook? Where can the “backup location” be defaulted
to? It’d be nice if the chef server could always be used as a file server.

From: Graham Christensen graham@grahamc.com
Date: Tuesday, October 15, 2013 6:47 PM
To: Bryan Taylor btaylor@rackspace.com
Cc: Chef Dev chef-dev@lists.opscode.com
Subject: Re: [chef-dev] Idiom for adding a node to a Cluster

I currently have a mode attribute identifying the cluster it is part of.
Every X hours a backup / snapshot is made, and a new node auto detects and
imports the latest snap from the cluster by name
(foocluster-201310151830.tar.gz for example) which includes all the info to
join and catch up to the cluster.

Hope that helps.

Ps - this is really easy on AWA et al, a bit trickier on bare metal.

Graham

On Tuesday, October 15, 2013, Bryan Taylor wrote:

I’m wondering what the chef idioms are for a certain problem that comes
up a lot when expanding a cluster. Let’s say I have some kind of
persistence store and I want to enable replication, or add a new node with
replication to an already running cluster. The replication will communicate
on some custom protocol, but in order to work, I have to move stateful
data, like db logs or whatever, from the master to a new node. The master
is “the master right now”, so it needs to be dynamically discovered, and
accessed via rsync or scp, say, to pull the files down. I’m thinking for
this I should just provision every cluster node with a fixed static
public/private key.


#5

That makes sense, I suppose, but the problem is that there’s nothing that can be used as a reasonable default for the file drop location. The FTP solution certainly works, but how do you implement it in a community cookbook?

From: Aniket Sharad <aniketsharad@gmail.commailto:aniketsharad@gmail.com>
Date: Wednesday, October 16, 2013 12:10 AM
To: Bryan Taylor <btaylor@rackspace.commailto:btaylor@rackspace.com>
Cc: Graham Christensen <graham@grahamc.commailto:graham@grahamc.com>, Chef Dev <chef-dev@lists.opscode.commailto:chef-dev@lists.opscode.com>
Subject: Re: [chef-dev] Re: Idiom for adding a node to a Cluster

I think it is not a recommended practice to make Chef server a file server and upload/download huge files as part of cookbooks.
Can think of two solutions:

  1. using an FTP to procure files from it during each chef-client run (remote_file resource can be used for the purpose of moving zips)
  2. use torrents to move files - don’t know if there is a provider for it already

Regards
Aniket

On 16 October 2013 10:10, Bryan Taylor <btaylor@rackspace.commailto:btaylor@rackspace.com> wrote:
I like it. I hadn’t thought of pushing the state data to some external location. That solves all the complexities around finding the master or dealing with down/hung nodes.

Can this process be made generic enough, though to be fit for use in a community opscode cookbook? Where can the “backup location” be defaulted to? It’d be nice if the chef server could always be used as a file server.

From: Graham Christensen <graham@grahamc.commailto:graham@grahamc.com>
Date: Tuesday, October 15, 2013 6:47 PM
To: Bryan Taylor <btaylor@rackspace.commailto:btaylor@rackspace.com>
Cc: Chef Dev <chef-dev@lists.opscode.commailto:chef-dev@lists.opscode.com>
Subject: Re: [chef-dev] Idiom for adding a node to a Cluster

I currently have a mode attribute identifying the cluster it is part of. Every X hours a backup / snapshot is made, and a new node auto detects and imports the latest snap from the cluster by name (foocluster-201310151830.tar.gz for example) which includes all the info to join and catch up to the cluster.

Hope that helps.

Ps - this is really easy on AWA et al, a bit trickier on bare metal.

Graham

On Tuesday, October 15, 2013, Bryan Taylor wrote:

I’m wondering what the chef idioms are for a certain problem that comes up a lot when expanding a cluster. Let’s say I have some kind of persistence store and I want to enable replication, or add a new node with replication to an already running cluster. The replication will communicate on some custom protocol, but in order to work, I have to move stateful data, like db logs or whatever, from the master to a new node. The master is “the master right now”, so it needs to be dynamically discovered, and accessed via rsync or scp, say, to pull the files down. I’m thinking for this I should just provision every cluster node with a fixed static public/private key.


#6

Hmm, right! I think the only control user would have provide the FTP URL in
the attribute.

On 16 October 2013 10:57, Bryan Taylor btaylor@rackspace.com wrote:

That makes sense, I suppose, but the problem is that there’s nothing
that can be used as a reasonable default for the file drop location. The
FTP solution certainly works, but how do you implement it in a community
cookbook?

From: Aniket Sharad aniketsharad@gmail.com
Date: Wednesday, October 16, 2013 12:10 AM
To: Bryan Taylor btaylor@rackspace.com
Cc: Graham Christensen graham@grahamc.com, Chef Dev <
chef-dev@lists.opscode.com>
Subject: Re: [chef-dev] Re: Idiom for adding a node to a Cluster

I think it is not a recommended practice to make Chef server a file
server and upload/download huge files as part of cookbooks.
Can think of two solutions:

  1. using an FTP to procure files from it during each chef-client run
    (remote_file resource can be used for the purpose of moving zips)
  2. use torrents to move files - don’t know if there is a provider for it
    already

Regards
Aniket

On 16 October 2013 10:10, Bryan Taylor btaylor@rackspace.com wrote:

I like it. I hadn’t thought of pushing the state data to some external
location. That solves all the complexities around finding the master or
dealing with down/hung nodes.

Can this process be made generic enough, though to be fit for use in a
community opscode cookbook? Where can the “backup location” be defaulted
to? It’d be nice if the chef server could always be used as a file server.

From: Graham Christensen graham@grahamc.com
Date: Tuesday, October 15, 2013 6:47 PM
To: Bryan Taylor btaylor@rackspace.com
Cc: Chef Dev chef-dev@lists.opscode.com
Subject: Re: [chef-dev] Idiom for adding a node to a Cluster

I currently have a mode attribute identifying the cluster it is part
of. Every X hours a backup / snapshot is made, and a new node auto detects
and imports the latest snap from the cluster by name
(foocluster-201310151830.tar.gz for example) which includes all the info to
join and catch up to the cluster.

Hope that helps.

Ps - this is really easy on AWA et al, a bit trickier on bare metal.

Graham

On Tuesday, October 15, 2013, Bryan Taylor wrote:

I’m wondering what the chef idioms are for a certain problem that comes
up a lot when expanding a cluster. Let’s say I have some kind of
persistence store and I want to enable replication, or add a new node with
replication to an already running cluster. The replication will communicate
on some custom protocol, but in order to work, I have to move stateful
data, like db logs or whatever, from the master to a new node. The master
is “the master right now”, so it needs to be dynamically discovered, and
accessed via rsync or scp, say, to pull the files down. I’m thinking for
this I should just provision every cluster node with a fixed static
public/private key.


#7

If you’re using something with autoclustering, adding node addresses to config files and rolling restarts is safe to do with chef. Use role/tag search to find nodes and populate host lists, notify service restart when the config file is changed, and bob’s your uncle. We do this for elasticsearch and hazelcast.

If you’re setting up slaves/replicas, you can probably set up a run-once resource to bootstrap the server from a backup, authenticate itself with the master, and turn on replication. We did this for free-ipa (ldap)

If you need something that needs stonith-style singletons, doesn’t handle split-brain on its own, &c, you need automation designed for that. Pacemaker & corosync are old school, things built on zookeeper, doozer or etcd are what the cool kids are doing. Everything I’ve heard of actually being in production does this on another band than chef, typically with a command-and-control tool like capistrano, fabric, mcollective, rundeck, &c. We use this approach for most things, notably mysql.

If you’re looking for a magic bullet, etcd-chef < https://github.com/coderanger/etcd-chef> has that hard-consistency in its data store and supports triggers on config changes, so (if you’re daring) that might meet your needs perfectly. I’m hoping to spike some work on it as soon as I migrate my entire company into Rackspace Chicago. But I doubt I’ll be doing a production master failover via etc-chef in the immediate future.

In a broader sense I think our industry’s terms for clusters are lacking, and our tools suffer for it.

~j
info janitor @ simply measured

On 2013-10-15, at 22:20, Bryan Taylor btaylor@rackspace.com wrote:

I’m wondering what the chef idioms are for a certain problem that comes up a lot when expanding a cluster. Let’s say I have some kind of persistence store and I want to enable replication, or add a new node with replication to an already running cluster. The replication will communicate on some custom protocol, but in order to work, I have to move stateful data, like db logs or whatever, from the master to a new node. The master is “the master right now”, so it needs to be dynamically discovered, and accessed via rsync or scp, say, to pull the files down. I’m thinking for this I should just provision every cluster node with a fixed static public/private key.


#8

How do you handle rolling restarts in elasticsearch? Do you guard against
restarts if the cluster is in the yellow state?

On Wed, Oct 16, 2013 at 4:16 PM, Joseph Holsten joseph@josephholsten.comwrote:

If you’re using something with autoclustering, adding node addresses to
config files and rolling restarts is safe to do with chef. Use role/tag
search to find nodes and populate host lists, notify service restart when
the config file is changed, and bob’s your uncle. We do this for
elasticsearch and hazelcast.

If you’re setting up slaves/replicas, you can probably set up a run-once
resource to bootstrap the server from a backup, authenticate itself with
the master, and turn on replication. We did this for free-ipa (ldap)

If you need something that needs stonith-style singletons, doesn’t handle
split-brain on its own, &c, you need automation designed for that.
Pacemaker & corosync are old school, things built on zookeeper, doozer or
etcd are what the cool kids are doing. Everything I’ve heard of actually
being in production does this on another band than chef, typically with a
command-and-control tool like capistrano, fabric, mcollective, rundeck, &c.
We use this approach for most things, notably mysql.

If you’re looking for a magic bullet, etcd-chef <
https://github.com/coderanger/etcd-chef> has that hard-consistency in its
data store and supports triggers on config changes, so (if you’re daring)
that might meet your needs perfectly. I’m hoping to spike some work on it
as soon as I migrate my entire company into Rackspace Chicago. But I doubt
I’ll be doing a production master failover via etc-chef in the immediate
future.

In a broader sense I think our industry’s terms for clusters are lacking,
and our tools suffer for it.

~j
info janitor @ simply measured

On 2013-10-15, at 22:20, Bryan Taylor btaylor@rackspace.com wrote:

I’m wondering what the chef idioms are for a certain problem that comes
up a lot when expanding a cluster. Let’s say I have some kind of
persistence store and I want to enable replication, or add a new node with
replication to an already running cluster. The replication will communicate
on some custom protocol, but in order to work, I have to move stateful
data, like db logs or whatever, from the master to a new node. The master
is “the master right now”, so it needs to be dynamically discovered, and
accessed via rsync or scp, say, to pull the files down. I’m thinking for
this I should just provision every cluster node with a fixed static
public/private key.


#9

With a splay I would just assume that the servers (this may be making an ass out of all of us) will not all restart at once, and also, only_if { curl_the_status_page_to_ensure_its_green } on the resource to update the configs, maybe?


Graham Christensen
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

On Wednesday, October 16, 2013 at 3:29 PM, Andrew Gross wrote:

How do you handle rolling restarts in elasticsearch? Do you guard against restarts if the cluster is in the yellow state?

On Wed, Oct 16, 2013 at 4:16 PM, Joseph Holsten <joseph@josephholsten.com (mailto:joseph@josephholsten.com)> wrote:

If you’re using something with autoclustering, adding node addresses to config files and rolling restarts is safe to do with chef. Use role/tag search to find nodes and populate host lists, notify service restart when the config file is changed, and bob’s your uncle. We do this for elasticsearch and hazelcast.

If you’re setting up slaves/replicas, you can probably set up a run-once resource to bootstrap the server from a backup, authenticate itself with the master, and turn on replication. We did this for free-ipa (ldap)

If you need something that needs stonith-style singletons, doesn’t handle split-brain on its own, &c, you need automation designed for that. Pacemaker & corosync are old school, things built on zookeeper, doozer or etcd are what the cool kids are doing. Everything I’ve heard of actually being in production does this on another band than chef, typically with a command-and-control tool like capistrano, fabric, mcollective, rundeck, &c. We use this approach for most things, notably mysql.

If you’re looking for a magic bullet, etcd-chef < https://github.com/coderanger/etcd-chef> has that hard-consistency in its data store and supports triggers on config changes, so (if you’re daring) that might meet your needs perfectly. I’m hoping to spike some work on it as soon as I migrate my entire company into Rackspace Chicago. But I doubt I’ll be doing a production master failover via etc-chef in the immediate future.

In a broader sense I think our industry’s terms for clusters are lacking, and our tools suffer for it.

~j
info janitor @ simply measured

On 2013-10-15, at 22:20, Bryan Taylor <btaylor@rackspace.com (mailto:btaylor@rackspace.com)> wrote:

I’m wondering what the chef idioms are for a certain problem that comes up a lot when expanding a cluster. Let’s say I have some kind of persistence store and I want to enable replication, or add a new node with replication to an already running cluster. The replication will communicate on some custom protocol, but in order to work, I have to move stateful data, like db logs or whatever, from the master to a new node. The master is “the master right now”, so it needs to be dynamically discovered, and accessed via rsync or scp, say, to pull the files down. I’m thinking for this I should just provision every cluster node with a fixed static public/private key.


#10

I think I’ve arrived to the point of your 2nd paragraph. It really just
comes down to how does an opscode community cookbook set a reasonable
default for the backup location. It’s easy enough to have the master set
up a cron to do a local backup and then copy those files us to this
location.

The problem is: where? I see three options:

  1. rsync backup to the chef server. It exists. Otherwise, yuck
  2. provision a node explicitly for this purpose. Also, yuck
  3. Use one of the db nodes for this purpose. Also yuck

Which one sucks least and would be accepted in a pull request? Or is
there another way? My assumption is anybody doing this for real, would
immediately override the backup location with a “real” location that
doesn’t suck.

On 10/16/2013 03:16 PM, Joseph Holsten wrote:

If you’re using something with autoclustering, adding node addresses to config files and rolling restarts is safe to do with chef. Use role/tag search to find nodes and populate host lists, notify service restart when the config file is changed, and bob’s your uncle. We do this for elasticsearch and hazelcast.

If you’re setting up slaves/replicas, you can probably set up a run-once resource to bootstrap the server from a backup, authenticate itself with the master, and turn on replication. We did this for free-ipa (ldap)

If you need something that needs stonith-style singletons, doesn’t handle split-brain on its own, &c, you need automation designed for that. Pacemaker & corosync are old school, things built on zookeeper, doozer or etcd are what the cool kids are doing. Everything I’ve heard of actually being in production does this on another band than chef, typically with a command-and-control tool like capistrano, fabric, mcollective, rundeck, &c. We use this approach for most things, notably mysql.

If you’re looking for a magic bullet, etcd-chef < https://github.com/coderanger/etcd-chef> has that hard-consistency in its data store and supports triggers on config changes, so (if you’re daring) that might meet your needs perfectly. I’m hoping to spike some work on it as soon as I migrate my entire company into Rackspace Chicago. But I doubt I’ll be doing a production master failover via etc-chef in the immediate future.

In a broader sense I think our industry’s terms for clusters are lacking, and our tools suffer for it.

~j
info janitor @ simply measured

On 2013-10-15, at 22:20, Bryan Taylor btaylor@rackspace.com wrote:

I’m wondering what the chef idioms are for a certain problem that comes up a lot when expanding a cluster. Let’s say I have some kind of persistence store and I want to enable replication, or add a new node with replication to an already running cluster. The replication will communicate on some custom protocol, but in order to work, I have to move stateful data, like db logs or whatever, from the master to a new node. The master is “the master right now”, so it needs to be dynamically discovered, and accessed via rsync or scp, say, to pull the files down. I’m thinking for this I should just provision every cluster node with a fixed static public/private key.


#11

I’ve been through this sort of thing quite a few times now, and tend to
embrace the Unix model for cookbooks more and more. That is, have a
cookbooks that do one small thing and do it well and are only loosely
coupled (if at all) to anything else in Chef.

So for what you are trying, Bryan, I think I’d end up with something like
this (assuming that your ‘cluster’ is a database cluster or some other type
of datastore):

  • A cookbook that that manages the data layer itself (for example, Postgres)

    • This cookbook includes a recipe for making local backups/dumps of data
      in some univerally-understood format (for example, .tgz)
    • Local backup directory is an overridable attribute
  • A cookbook that manages backups

    • Likely creates a cronjob
    • Knows about a few types of remote data storage (NFS/FTP/Joyent
      Manta/Amazon S3/etc)
    • Remote backup location and protocol are overridable attributes

By separating concerns, I’ve found that my infrastructure is much less
brittle and individual components can be improved without breaking anything
else (Unix model/service-orientedness).

(As an aside, I love using Manta for backups because I get to use all the
traditional Unix tools inside my object store, (for things like
md5sum/gzcat/log analysis, etc), and the cli interfaces are super
lightweight and easy to employ inside of cookbooks (npm install manta))

Blake

On Wed, Oct 16, 2013 at 1:42 PM, Bryan Taylor btaylor@rackspace.com wrote:

I think I’ve arrived to the point of your 2nd paragraph. It really just
comes down to how does an opscode community cookbook set a reasonable
default for the backup location. It’s easy enough to have the master set up
a cron to do a local backup and then copy those files us to this location.

The problem is: where? I see three options:

  1. rsync backup to the chef server. It exists. Otherwise, yuck
  2. provision a node explicitly for this purpose. Also, yuck
  3. Use one of the db nodes for this purpose. Also yuck

Which one sucks least and would be accepted in a pull request? Or is there
another way? My assumption is anybody doing this for real, would
immediately override the backup location with a “real” location that
doesn’t suck.

On 10/16/2013 03:16 PM, Joseph Holsten wrote:

If you’re using something with autoclustering, adding node addresses to config files and rolling restarts is safe to do with chef. Use role/tag search to find nodes and populate host lists, notify service restart when the config file is changed, and bob’s your uncle. We do this for elasticsearch and hazelcast.

If you’re setting up slaves/replicas, you can probably set up a run-once resource to bootstrap the server from a backup, authenticate itself with the master, and turn on replication. We did this for free-ipa (ldap)

If you need something that needs stonith-style singletons, doesn’t handle split-brain on its own, &c, you need automation designed for that. Pacemaker & corosync are old school, things built on zookeeper, doozer or etcd are what the cool kids are doing. Everything I’ve heard of actually being in production does this on another band than chef, typically with a command-and-control tool like capistrano, fabric, mcollective, rundeck, &c. We use this approach for most things, notably mysql.

If you’re looking for a magic bullet, etcd-chef < https://github.com/coderanger/etcd-chef> https://github.com/coderanger/etcd-chef has that hard-consistency in its data store and supports triggers on config changes, so (if you’re daring) that might meet your needs perfectly. I’m hoping to spike some work on it as soon as I migrate my entire company into Rackspace Chicago. But I doubt I’ll be doing a production master failover via etc-chef in the immediate future.

In a broader sense I think our industry’s terms for clusters are lacking, and our tools suffer for it.

~j
info janitor @ simply measured

On 2013-10-15, at 22:20, Bryan Taylor btaylor@rackspace.com btaylor@rackspace.com wrote:

I’m wondering what the chef idioms are for a certain problem that comes up a lot when expanding a cluster. Let’s say I have some kind of persistence store and I want to enable replication, or add a new node with replication to an already running cluster. The replication will communicate on some custom protocol, but in order to work, I have to move stateful data, like db logs or whatever, from the master to a new node. The master is “the master right now”, so it needs to be dynamically discovered, and accessed via rsync or scp, say, to pull the files down. I’m thinking for this I should just provision every cluster node with a fixed static public/private key.


#12

That’s very reasonable – clearly an improvement. The postgres recipe
that sets up the replication will need to depend on the backup piece,
since we have to bootstrap a slave from the backup data, but I can
definitely see that piece being reusable quite beyond just databases.

BUT – it still just relocates my fundamental problem, because what
location would a backup cookbook use for its default file drop location?

On 10/16/2013 03:58 PM, Blake Irvin wrote:

I’ve been through this sort of thing quite a few times now, and tend
to embrace the Unix model for cookbooks more and more. That is, have
a cookbooks that do one small thing and do it well and are only
loosely coupled (if at all) to anything else in Chef.

So for what you are trying, Bryan, I think I’d end up with something
like this (assuming that your ‘cluster’ is a database cluster or some
other type of datastore):

  • A cookbook that that manages the data layer itself (for example,
    Postgres)

    • This cookbook includes a recipe for making local backups/dumps of
      data in some univerally-understood format (for example, .tgz)
    • Local backup directory is an overridable attribute
  • A cookbook that manages backups

    • Likely creates a cronjob
    • Knows about a few types of remote data storage (NFS/FTP/Joyent
      Manta/Amazon S3/etc)
    • Remote backup location and protocol are overridable attributes

By separating concerns, I’ve found that my infrastructure is much less
brittle and individual components can be improved without breaking
anything else (Unix model/service-orientedness).

(As an aside, I love using Manta for backups because I get to use all
the traditional Unix tools inside my object store, (for things like
md5sum/gzcat/log analysis, etc), and the cli interfaces are super
lightweight and easy to employ inside of cookbooks (npm install manta))

Blake

On Wed, Oct 16, 2013 at 1:42 PM, Bryan Taylor <btaylor@rackspace.com
mailto:btaylor@rackspace.com> wrote:

I think I've arrived to the point of your 2nd paragraph. It really
just comes down to how does an opscode community cookbook set a
reasonable default for the backup location. It's easy enough to
have the master set up a cron to do a local backup and then copy
those files us to this location.

The problem is: where? I see three options:
 1) rsync backup to the chef server. It exists. Otherwise, yuck
 2) provision a node explicitly for this purpose. Also, yuck
 3) Use one of the db nodes for this purpose. Also yuck

Which one sucks least and would be accepted in a pull request? Or
is there another way? My assumption is anybody doing this for
real, would immediately override the backup location with a "real"
location that doesn't suck.

On 10/16/2013 03:16 PM, Joseph Holsten wrote:
If you're using something with autoclustering, adding node addresses to config files and rolling restarts is safe to do with chef. Use role/tag search to find nodes and populate host lists, notify service restart when the config file is changed, and bob's your uncle. We do this for elasticsearch and hazelcast.

If you're setting up slaves/replicas, you can probably set up a run-once resource to bootstrap the server from a backup, authenticate itself with the master, and turn on replication. We did this for free-ipa (ldap)

If you need something that needs stonith-style singletons, doesn't handle split-brain on its own, &c, you need automation designed for that. Pacemaker & corosync are old school, things built on zookeeper, doozer or etcd are what the cool kids are doing. Everything I've heard of actually being in production does this on another band than chef, typically with a command-and-control tool like capistrano, fabric, mcollective, rundeck, &c. We use this approach for most things, notably mysql.

If you're looking for a magic bullet, etcd-chef< https://github.com/coderanger/etcd-chef>  <https://github.com/coderanger/etcd-chef>  has that hard-consistency in its data store and supports triggers on config changes, so (if you're daring) that might meet your needs perfectly. I'm hoping to spike some work on it as soon as I migrate my entire company into Rackspace Chicago. But I doubt I'll be doing a production master failover via etc-chef in the immediate future.

In a broader sense I think our industry's terms for clusters are lacking, and our tools suffer for it.
--
~j
info janitor @ simply measured

On 2013-10-15, at 22:20, Bryan Taylor<btaylor@rackspace.com>  <mailto:btaylor@rackspace.com>  wrote:
I'm wondering what the chef idioms are for a certain problem that comes up a lot when expanding a cluster. Let's say I have some kind of persistence store and I want to enable replication, or add a new node with replication to an already running cluster. The replication will communicate on some custom protocol, but in order to work, I have to move stateful data, like db logs or whatever, from the master to a new node. The master is "the master right now", so it needs to be dynamically discovered, and accessed via rsync or scp, say,  to pull the files down. I'm thinking for this I should just provision every cluster node with a fixed static public/private key.

#13

I think that’s pretty arbitrary, provided it’s attribute-driven. As long
as that is true (the attribute can even be blank), the user can can
override.

On Wed, Oct 16, 2013 at 2:22 PM, Bryan Taylor btaylor@rackspace.com wrote:

BUT – it still just relocates my fundamental problem, because what
location would a backup cookbook use for its default file drop location?

On 10/16/2013 03:58 PM, Blake Irvin wrote:

I’ve been through this sort of thing quite a few times now, and tend to
embrace the Unix model for cookbooks more and more. That is, have a
cookbooks that do one small thing and do it well and are only loosely
coupled (if at all) to anything else in Chef.

So for what you are trying, Bryan, I think I’d end up with something
like this (assuming that your ‘cluster’ is a database cluster or some other
type of datastore):

  • A cookbook that that manages the data layer itself (for example,
    Postgres)

    • This cookbook includes a recipe for making local backups/dumps of
      data in some univerally-understood format (for example, .tgz)
    • Local backup directory is an overridable attribute
  • A cookbook that manages backups

    • Likely creates a cronjob
    • Knows about a few types of remote data storage (NFS/FTP/Joyent
      Manta/Amazon S3/etc)
    • Remote backup location and protocol are overridable attributes

By separating concerns, I’ve found that my infrastructure is much less
brittle and individual components can be improved without breaking anything
else (Unix model/service-orientedness).

(As an aside, I love using Manta for backups because I get to use all
the traditional Unix tools inside my object store, (for things like
md5sum/gzcat/log analysis, etc), and the cli interfaces are super
lightweight and easy to employ inside of cookbooks (npm install manta))

Blake

On Wed, Oct 16, 2013 at 1:42 PM, Bryan Taylor btaylor@rackspace.comwrote:

I think I’ve arrived to the point of your 2nd paragraph. It really just
comes down to how does an opscode community cookbook set a reasonable
default for the backup location. It’s easy enough to have the master set up
a cron to do a local backup and then copy those files us to this location.

The problem is: where? I see three options:

  1. rsync backup to the chef server. It exists. Otherwise, yuck
  2. provision a node explicitly for this purpose. Also, yuck
  3. Use one of the db nodes for this purpose. Also yuck

Which one sucks least and would be accepted in a pull request? Or is
there another way? My assumption is anybody doing this for real, would
immediately override the backup location with a “real” location that
doesn’t suck.

On 10/16/2013 03:16 PM, Joseph Holsten wrote:

If you’re using something with autoclustering, adding node addresses to config files and rolling restarts is safe to do with chef. Use role/tag search to find nodes and populate host lists, notify service restart when the config file is changed, and bob’s your uncle. We do this for elasticsearch and hazelcast.

If you’re setting up slaves/replicas, you can probably set up a run-once resource to bootstrap the server from a backup, authenticate itself with the master, and turn on replication. We did this for free-ipa (ldap)

If you need something that needs stonith-style singletons, doesn’t handle split-brain on its own, &c, you need automation designed for that. Pacemaker & corosync are old school, things built on zookeeper, doozer or etcd are what the cool kids are doing. Everything I’ve heard of actually being in production does this on another band than chef, typically with a command-and-control tool like capistrano, fabric, mcollective, rundeck, &c. We use this approach for most things, notably mysql.

If you’re looking for a magic bullet, etcd-chef < https://github.com/coderanger/etcd-chef> https://github.com/coderanger/etcd-chef has that hard-consistency in its data store and supports triggers on config changes, so (if you’re daring) that might meet your needs perfectly. I’m hoping to spike some work on it as soon as I migrate my entire company into Rackspace Chicago. But I doubt I’ll be doing a production master failover via etc-chef in the immediate future.

In a broader sense I think our industry’s terms for clusters are lacking, and our tools suffer for it.

~j
info janitor @ simply measured

On 2013-10-15, at 22:20, Bryan Taylor btaylor@rackspace.com btaylor@rackspace.com wrote:

I’m wondering what the chef idioms are for a certain problem that comes up a lot when expanding a cluster. Let’s say I have some kind of persistence store and I want to enable replication, or add a new node with replication to an already running cluster. The replication will communicate on some custom protocol, but in order to work, I have to move stateful data, like db logs or whatever, from the master to a new node. The master is “the master right now”, so it needs to be dynamically discovered, and accessed via rsync or scp, say, to pull the files down. I’m thinking for this I should just provision every cluster node with a fixed static public/private key.


#14

Yeah, you probably want the entire chef run to fail if the cluster isn’t healthy. Just ignoring the restart action on the service is probably to hide the error. Of course, you should have monitoring…

On 2013-10-16, at 20:34, Graham Christensen graham@grahamc.com wrote:

With a splay I would just assume that the servers (this may be making an ass out of all of us) will not all restart at once, and also, only_if { curl_the_status_page_to_ensure_its_green } on the resource to update the configs, maybe?


Graham Christensen
Sent with Sparrow

On Wednesday, October 16, 2013 at 3:29 PM, Andrew Gross wrote:

How do you handle rolling restarts in elasticsearch? Do you guard against restarts if the cluster is in the yellow state?

On Wed, Oct 16, 2013 at 4:16 PM, Joseph Holsten joseph@josephholsten.com wrote:

If you’re using something with autoclustering, adding node addresses to config files and rolling restarts is safe to do with chef. Use role/tag search to find nodes and populate host lists, notify service restart when the config file is changed, and bob’s your uncle. We do this for elasticsearch and hazelcast.

If you’re setting up slaves/replicas, you can probably set up a run-once resource to bootstrap the server from a backup, authenticate itself with the master, and turn on replication. We did this for free-ipa (ldap)

If you need something that needs stonith-style singletons, doesn’t handle split-brain on its own, &c, you need automation designed for that. Pacemaker & corosync are old school, things built on zookeeper, doozer or etcd are what the cool kids are doing. Everything I’ve heard of actually being in production does this on another band than chef, typically with a command-and-control tool like capistrano, fabric, mcollective, rundeck, &c. We use this approach for most things, notably mysql.

If you’re looking for a magic bullet, etcd-chef < https://github.com/coderanger/etcd-chef> has that hard-consistency in its data store and supports triggers on config changes, so (if you’re daring) that might meet your needs perfectly. I’m hoping to spike some work on it as soon as I migrate my entire company into Rackspace Chicago. But I doubt I’ll be doing a production master failover via etc-chef in the immediate future.

In a broader sense I think our industry’s terms for clusters are lacking, and our tools suffer for it.

~j
info janitor @ simply measured

On 2013-10-15, at 22:20, Bryan Taylor btaylor@rackspace.com wrote:

I’m wondering what the chef idioms are for a certain problem that comes up a lot when expanding a cluster. Let’s say I have some kind of persistence store and I want to enable replication, or add a new node with replication to an already running cluster. The replication will communicate on some custom protocol, but in order to work, I have to move stateful data, like db logs or whatever, from the master to a new node. The master is “the master right now”, so it needs to be dynamically discovered, and accessed via rsync or scp, say, to pull the files down. I’m thinking for this I should just provision every cluster node with a fixed static public/private key.


#15

If you can convince the community to agree on a One True Way to backup, I will buy you many beverages.

Say we’re talking about mysql, you’ve got issues:

creating the backup

  • mysqldump?
  • mysqlhotcopy?
  • percona xtrabackup?
  • filesystem (xfs, zfs) snapshot?

backup archive style

  • differential?
  • full?

frequency and rotation

transfer protocol & storage

  • fibre channel (lol)
  • iscsi
  • rsync
  • ftp
  • s3/swift

I can’t get my team to agree on what the best way is, much less the internet. So anyway, let me know when I can buy you those beverages.

On 2013-10-16, at 20:42, Bryan Taylor btaylor@rackspace.com wrote:

I think I’ve arrived to the point of your 2nd paragraph. It really just comes down to how does an opscode community cookbook set a reasonable default for the backup location. It’s easy enough to have the master set up a cron to do a local backup and then copy those files us to this location.

The problem is: where? I see three options:

  1. rsync backup to the chef server. It exists. Otherwise, yuck
  2. provision a node explicitly for this purpose. Also, yuck
  3. Use one of the db nodes for this purpose. Also yuck

Which one sucks least and would be accepted in a pull request? Or is there another way? My assumption is anybody doing this for real, would immediately override the backup location with a “real” location that doesn’t suck.

On 10/16/2013 03:16 PM, Joseph Holsten wrote:

If you’re using something with autoclustering, adding node addresses to config files and rolling restarts is safe to do with chef. Use role/tag search to find nodes and populate host lists, notify service restart when the config file is changed, and bob’s your uncle. We do this for elasticsearch and hazelcast.

If you’re setting up slaves/replicas, you can probably set up a run-once resource to bootstrap the server from a backup, authenticate itself with the master, and turn on replication. We did this for free-ipa (ldap)

If you need something that needs stonith-style singletons, doesn’t handle split-brain on its own, &c, you need automation designed for that. Pacemaker & corosync are old school, things built on zookeeper, doozer or etcd are what the cool kids are doing. Everything I’ve heard of actually being in production does this on another band than chef, typically with a command-and-control tool like capistrano, fabric, mcollective, rundeck, &c. We use this approach for most things, notably mysql.

If you’re looking for a magic bullet, etcd-chef
< https://github.com/coderanger/etcd-chef>
has that hard-consistency in its data store and supports triggers on config changes, so (if you’re daring) that might meet your needs perfectly. I’m hoping to spike some work on it as soon as I migrate my entire company into Rackspace Chicago. But I doubt I’ll be doing a production master failover via etc-chef in the immediate future.

In a broader sense I think our industry’s terms for clusters are lacking, and our tools suffer for it.

~j
info janitor @ simply measured

On 2013-10-15, at 22:20, Bryan Taylor
btaylor@rackspace.com
wrote:

I’m wondering what the chef idioms are for a certain problem that comes up a lot when expanding a cluster. Let’s say I have some kind of persistence store and I want to enable replication, or add a new node with replication to an already running cluster. The replication will communicate on some custom protocol, but in order to work, I have to move stateful data, like db logs or whatever, from the master to a new node. The master is “the master right now”, so it needs to be dynamically discovered, and accessed via rsync or scp, say, to pull the files down. I’m thinking for this I should just provision every cluster node with a fixed static public/private key.


#16

I feel your pain about agreeing on that stuff.

The Unix model allows us to ignore the whole mess by writing cookbooks that
only do very specific, well-defined things. So rather than one extremely
prescriptive backup cookbook/recipe, we can either have one cookbook per
backup tool (what I’m doing now) or one recipe per backup tool in a sort of
meta-backup-cookbook.

On Wed, Oct 16, 2013 at 3:13 PM, Joseph Holsten joseph@josephholsten.comwrote:

If you can convince the community to agree on a One True Way to backup, I
will buy you many beverages.

Say we’re talking about mysql, you’ve got issues:

creating the backup

  • mysqldump?
  • mysqlhotcopy?
  • percona xtrabackup?
  • filesystem (xfs, zfs) snapshot?

backup archive style

  • differential?
  • full?

frequency and rotation

transfer protocol & storage

  • fibre channel (lol)
  • iscsi
  • rsync
  • ftp
  • s3/swift

I can’t get my team to agree on what the best way is, much less the
internet. So anyway, let me know when I can buy you those beverages.

On 2013-10-16, at 20:42, Bryan Taylor btaylor@rackspace.com wrote:

I think I’ve arrived to the point of your 2nd paragraph. It really just
comes down to how does an opscode community cookbook set a reasonable
default for the backup location. It’s easy enough to have the master set up
a cron to do a local backup and then copy those files us to this location.

The problem is: where? I see three options:

  1. rsync backup to the chef server. It exists. Otherwise, yuck
  2. provision a node explicitly for this purpose. Also, yuck
  3. Use one of the db nodes for this purpose. Also yuck

Which one sucks least and would be accepted in a pull request? Or is
there another way? My assumption is anybody doing this for real, would
immediately override the backup location with a “real” location that
doesn’t suck.

On 10/16/2013 03:16 PM, Joseph Holsten wrote:

If you’re using something with autoclustering, adding node addresses to
config files and rolling restarts is safe to do with chef. Use role/tag
search to find nodes and populate host lists, notify service restart when
the config file is changed, and bob’s your uncle. We do this for
elasticsearch and hazelcast.

If you’re setting up slaves/replicas, you can probably set up a
run-once resource to bootstrap the server from a backup, authenticate
itself with the master, and turn on replication. We did this for free-ipa
(ldap)

If you need something that needs stonith-style singletons, doesn’t
handle split-brain on its own, &c, you need automation designed for that.
Pacemaker & corosync are old school, things built on zookeeper, doozer or
etcd are what the cool kids are doing. Everything I’ve heard of actually
being in production does this on another band than chef, typically with a
command-and-control tool like capistrano, fabric, mcollective, rundeck, &c.
We use this approach for most things, notably mysql.

If you’re looking for a magic bullet, etcd-chef
< https://github.com/coderanger/etcd-chef>
has that hard-consistency in its data store and supports triggers on
config changes, so (if you’re daring) that might meet your needs perfectly.
I’m hoping to spike some work on it as soon as I migrate my entire company
into Rackspace Chicago. But I doubt I’ll be doing a production master
failover via etc-chef in the immediate future.

In a broader sense I think our industry’s terms for clusters are
lacking, and our tools suffer for it.


~j
info janitor @ simply measured

On 2013-10-15, at 22:20, Bryan Taylor
btaylor@rackspace.com
wrote:

I’m wondering what the chef idioms are for a certain problem that
comes up a lot when expanding a cluster. Let’s say I have some kind of
persistence store and I want to enable replication, or add a new node with
replication to an already running cluster. The replication will communicate
on some custom protocol, but in order to work, I have to move stateful
data, like db logs or whatever, from the master to a new node. The master
is “the master right now”, so it needs to be dynamically discovered, and
accessed via rsync or scp, say, to pull the files down. I’m thinking for
this I should just provision every cluster node with a fixed static
public/private key.