Simultaneous node edits


#1

Hello.

I’d like to know how people avoid collisions, while two simultaneous node edits take place. Some very common (and painful) examples from my practice:1. Two sysadmins +/- simultaneously issue ‘knife node edit’ on the same node. First of them who save edits will lost them because of the save of the second admin;2. Person for sure lost his edits if he try to ‘knife node edit’ while chef client runs on this node. And you can’t predict it even if you are single person who can manage nodes because chef may run as a daemon or via cron.

I would we grateful for any advices or ideas.

–Thanks in advance, Daniil.


#2

We avoid it by abstaining: we don’t do individual node edits, ever. The
advantage of this is that nearly all of our chef server can be recreated by
an upload and client resync, with the exception of the client keys.

On Wed, Jul 1, 2015 at 3:23 PM, Daniil S daniil_sb@yahoo.com wrote:

Hello.

I’d like to know how people avoid collisions, while two simultaneous node
edits take place. Some very common (and painful) examples from my practice:

  1. Two sysadmins +/- simultaneously issue ‘knife node edit’ on the same
    node. First of them who save edits will lost them because of the save of
    the second admin;
  2. Person for sure lost his edits if he try to ‘knife node edit’ while
    chef client runs on this node. And you can’t predict it even if you are
    single person who can manage nodes because chef may run as a daemon or via
    cron.

I would we grateful for any advices or ideas.


Thanks in advance, Daniil.


#3

On Jul 1, 2015, at 12:23 PM, Daniil S daniil_sb@yahoo.com wrote:

Hello.

I’d like to know how people avoid collisions, while two simultaneous node edits take place. Some very common (and painful) examples from my practice:

  1. Two sysadmins +/- simultaneously issue ‘knife node edit’ on the same node. First of them who save edits will lost them because of the save of the second admin;
  2. Person for sure lost his edits if he try to ‘knife node edit’ while chef client runs on this node. And you can’t predict it even if you are single person who can manage nodes because chef may run as a daemon or via cron.

The Chef API is full of these race conditions. Generally the best approach is to not use knife node edit and when you need to do some kind of bulk update, have a “talking stick” negotiated out of band (usually in a chat room). For non-node objects this is “fixed” because the authoritative source of truth is source control and the second user would get a merge conflict.

Node data should be set once at bootstrap and then never changed. If you need to bring up a new node, do that instead.

–Noah


#4

To complement this idea, see nodes as Phoenix servers,i.e. if you need to change the runlist, kill it and re-create it from scratch (this may involve backup/detach of datas prior to destruction and restore/attach datas as part of bootstrap in case of data nodes, including databases, file sharing nodes and so on)

That’s a hard path to setup but it gives a great disaster recovery plan at end.

The other way is to ‘stage’ editions in scripts triggered by a node end of run with a run handler…

Feel free to ask whichever way sounds best to you with some details on your current way of doing things, I don’t wish to write a novel about each options :wink:

Le 1 juil. 2015 21:27, Noah Kantrowitz noah@coderanger.net a écrit :

On Jul 1, 2015, at 12:23 PM, Daniil S daniil_sb@yahoo.com wrote:

Hello.

I’d like to know how people avoid collisions, while two simultaneous node edits take place. Some very common (and painful) examples from my practice:

  1. Two sysadmins +/- simultaneously issue ‘knife node edit’ on the same node. First of them who save edits will lost them because of the save of the second admin;
  2. Person for sure lost his edits if he try to ‘knife node edit’ while chef client runs on this node. And you can’t predict it even if you are single person who can manage nodes because chef may run as a daemon or via cron.

The Chef API is full of these race conditions. Generally the best approach is to not use knife node edit and when you need to do some kind of bulk update, have a “talking stick” negotiated out of band (usually in a chat room). For non-node objects this is “fixed” because the authoritative source of truth is source control and the second user would get a merge conflict.

Node data should be set once at bootstrap and then never changed. If you need to bring up a new node, do that instead.

–Noah


#5

RFC #045


is the closest thing we have for plans for addressing node editing
conflicts, but it still won’t help you in the case of two admins doing a
knife node edit. It does address a lot of the use cases of #2.

#1 might be addressed better via something like a knife plugin the
converted the task the admins were doing into something more similar to
knife node run_list add 'role[foo]'. by getting it onto the command
line you narrow the race condition between reading the old value and
writing the new value. you could also make it even a bit more
declarative/idempotent/convergent so that running it twice by two admins
didn’t result in duplicated edits (unlike knife node run_list add).

On 7/1/15 12:26 PM, Brian Hatfield wrote:

We avoid it by abstaining: we don’t do individual node edits, ever.
The advantage of this is that nearly all of our chef server can be
recreated by an upload and client resync, with the exception of the
client keys.

On Wed, Jul 1, 2015 at 3:23 PM, Daniil S <daniil_sb@yahoo.com
mailto:daniil_sb@yahoo.com> wrote:

Hello.

I'd like to know how people avoid collisions, while two
simultaneous node edits take place. Some very common (and painful)
examples from my practice:
1. Two sysadmins +/- simultaneously issue 'knife node edit' on the
same node. First of them who save edits will lost them because of
the save of the second admin;
2. Person for sure lost his edits if he try to 'knife node edit'
while chef client runs on this node. And you can't predict it even
if you are single person who can manage nodes because chef may run
as a daemon or via cron.

I would we grateful for any advices or ideas.

--
Thanks in advance, Daniil.

#6

On Wednesday, July 1, 2015 at 3:08 PM, Lamont Granquist wrote:

RFC #045 https://github.com/chef/chef-rfc/blob/master/rfc045-node_state_separation.md is the closest thing we have for plans for addressing node editing conflicts, but it still won’t help you in the case of two admins doing a knife node edit. It does address a lot of the use cases of #2.

#1 might be addressed better via something like a knife plugin the converted the task the admins were doing into something more similar to knife node run_list add 'role[foo]'. by getting it onto the command line you narrow the race condition between reading the old value and writing the new value. you could also make it even a bit more declarative/idempotent/convergent so that running it twice by two admins didn’t result in duplicated edits (unlike knife node run_list add).
Policyfiles mitigate the problem by moving the really contentious part (the run_list) out of the node and into a different object which is shared between nodes. If you’re making heavy use of node-specific attributes it won’t help, but I’d recommend avoiding those as much as possible anyway.


Daniel DeLeo


#7

Hi,

This thread is interesting since I am trying to introduce a new usage
pattern with bare metal servers and chef-provisioning.

I have one Chef server and whenever we buy a new set of racks in one of our
datacenters, servers boot via PXE and automatically register in Chef in a
specific “firstboot” role.

For now we assign nodes manually with one node per file in a git repo and
#1 is avoided by using git to solve conflicts. #2 is not too bad since we
can re-sync from git if a Chef run happened during a modification.

I would like to stop using Chef nodes as file but use the new chefDK
provision command with a special driver that would “pick” a node from the
firstboot pool (so basically my “cloud” provider is the pool of firstboot
nodes in Chef). Without dealing with concurrent access to Chef provision,
this seem doable: to allocate a node I can “tag” a firstboot node and
delete it once the machine is ready.

But how to do this with concurrent access? It seems almost impossible. And
the way things are going with Policy files will tend towards a separate git
repo and provision cookbook per policy, all sharing the same pool of
firstboot nodes (for now I don’t use Policy files).

I wish I could have a way to “lock” a node or something like that.

Maxime
On Jul 2, 2015 1:51 AM, “Daniel DeLeo” dan@kallistec.com wrote:

On Wednesday, July 1, 2015 at 3:08 PM, Lamont Granquist wrote:

RFC #045
https://github.com/chef/chef-rfc/blob/master/rfc045-node_state_separation.md
is the closest thing we have for plans for addressing node editing
conflicts, but it still won’t help you in the case of two admins doing a
knife node edit. It does address a lot of the use cases of #2.

#1 might be addressed better via something like a knife plugin the
converted the task the admins were doing into something more similar to
knife node run_list add 'role[foo]'. by getting it onto the command line
you narrow the race condition between reading the old value and writing the
new value. you could also make it even a bit more
declarative/idempotent/convergent so that running it twice by two admins
didn’t result in duplicated edits (unlike knife node run_list add).
Policyfiles mitigate the problem by moving the really contentious part
(the run_list) out of the node and into a different object which is shared
between nodes. If you’re making heavy use of node-specific attributes it
won’t help, but I’d recommend avoiding those as much as possible anyway.


Daniel DeLeo


#8

On 07/01/2015 11:40 PM, Maxime Brugidou wrote:

I would like to stop using Chef nodes as file but use the new chefDK
provision command with a special driver that would “pick” a node from
the firstboot pool (so basically my “cloud” provider is the pool of
firstboot nodes in Chef). Without dealing with concurrent access to
Chef provision, this seem doable: to allocate a node I can “tag” a
firstboot node and delete it once the machine is ready.

But how to do this with concurrent access? It seems almost impossible.
And the way things are going with Policy files will tend towards a
separate git repo and provision cookbook per policy, all sharing the
same pool of firstboot nodes (for now I don’t use Policy files).

I wish I could have a way to “lock” a node or something like that.

The way to do this is to make sure only one agent on your network can
move the node between states. A simple design would be to have the node
responsible for publishing that its done with firstboot by tagging
itself and then the node.save at the end publishes the write. Then
write a simple web endpoint which is your API to ‘allocate’ a new
firstboot’ed node. By centralizing it you don’t have to worry about
race conditions between multiple clients all trying to get the same node
at the same time. You can then write command line tools that talk to
the endpoint you wrote to get a node, rather than wanting a distributed
lock that the CLI commands can grab on the node object itself. If
you’ve already got etcd or something similar that you’re using
internally you could probably use that instead.


#9

On Mon, 06 Jul 2015 10:20:15 -0700
Lamont Granquist lamont@chef.io wrote:

On 07/01/2015 11:40 PM, Maxime Brugidou wrote:

I would like to stop using Chef nodes as file but use the new
chefDK provision command with a special driver that would “pick” a
node from the firstboot pool (so basically my “cloud” provider is
the pool of firstboot nodes in Chef). Without dealing with
concurrent access to Chef provision, this seem doable: to allocate
a node I can “tag” a firstboot node and delete it once the machine
is ready.

But how to do this with concurrent access? It seems almost
impossible. And the way things are going with Policy files will
tend towards a separate git repo and provision cookbook per policy,
all sharing the same pool of firstboot nodes (for now I don’t use
Policy files).

I wish I could have a way to “lock” a node or something like that.

The way to do this is to make sure only one agent on your network can
move the node between states. A simple design would be to have the
node responsible for publishing that its done with firstboot by
tagging itself and then the node.save at the end publishes the
write. Then write a simple web endpoint which is your API to
’allocate’ a new firstboot’ed node. By centralizing it you don’t
have to worry about race conditions between multiple clients all
trying to get the same node at the same time. You can then write
command line tools that talk to the endpoint you wrote to get a node,
rather than wanting a distributed lock that the CLI commands can grab
on the node object itself. If you’ve already got etcd or something
similar that you’re using internally you could probably use that
instead.

Hello

Had anyone analysed lock-free and optimistic approach by using
’If-Match:’ HTTP header on the write stage ?

The scenario would look like :

  • Every object (node, role, environment) would have some token
    (could be timestamp, or any other value changed on each edit)
  • When the user invokes ‘knife node edit’ the version is sent to client
    (possibly in HTTP Header)
  • When the user edits the object, the value is stored somewhere
  • When the user sends write API call to the server, it sends 'If-Match’
    header with value received in first call
    • If the token matches the old one - the object is updated
    • If the token does not match the old one - the update is rejected.

That won’t solve all the problems, but it will fix many of them with
(i suppose) less work and changes. Such behaviour would also be
non-braking change.

Regards,

Rafał Trójniak
WEB : http://trojniak.net/
m@il : rafal@trojniak.net
Jid : rafal@trojniak.net
GPG key-ID : 9A9A9E98
ABC8 83DF E717 6B76 CE49
BAFD 4F6F 854F 9A9A 9E98


#10

On Jul 7, 2015, at 2:33 PM, Rafał Trójniak rafal@trojniak.net wrote:

On Mon, 06 Jul 2015 10:20:15 -0700
Lamont Granquist lamont@chef.io wrote:

On 07/01/2015 11:40 PM, Maxime Brugidou wrote:

I would like to stop using Chef nodes as file but use the new
chefDK provision command with a special driver that would “pick” a
node from the firstboot pool (so basically my “cloud” provider is
the pool of firstboot nodes in Chef). Without dealing with
concurrent access to Chef provision, this seem doable: to allocate
a node I can “tag” a firstboot node and delete it once the machine
is ready.

But how to do this with concurrent access? It seems almost
impossible. And the way things are going with Policy files will
tend towards a separate git repo and provision cookbook per policy,
all sharing the same pool of firstboot nodes (for now I don’t use
Policy files).

I wish I could have a way to “lock” a node or something like that.

The way to do this is to make sure only one agent on your network can
move the node between states. A simple design would be to have the
node responsible for publishing that its done with firstboot by
tagging itself and then the node.save at the end publishes the
write. Then write a simple web endpoint which is your API to
’allocate’ a new firstboot’ed node. By centralizing it you don’t
have to worry about race conditions between multiple clients all
trying to get the same node at the same time. You can then write
command line tools that talk to the endpoint you wrote to get a node,
rather than wanting a distributed lock that the CLI commands can grab
on the node object itself. If you’ve already got etcd or something
similar that you’re using internally you could probably use that
instead.

Hello

Had anyone analysed lock-free and optimistic approach by using
’If-Match:’ HTTP header on the write stage ?

The scenario would look like :

  • Every object (node, role, environment) would have some token
    (could be timestamp, or any other value changed on each edit)
  • When the user invokes ‘knife node edit’ the version is sent to client
    (possibly in HTTP Header)
  • When the user edits the object, the value is stored somewhere
  • When the user sends write API call to the server, it sends 'If-Match’
    header with value received in first call
  • If the token matches the old one - the object is updated
  • If the token does not match the old one - the update is rejected.

That won’t solve all the problems, but it will fix many of them with
(i suppose) less work and changes. Such behaviour would also be
non-braking change.

This was discussed way back at the first community summit, but no one has written the code. I’m sure it would be accepted if someone sent in a patch though.

–Noah