Changes to run_list (sometimes) won't stick!

I’m being bitten by a race-condition when altering the run-list for a node (using either the Web UI, or “knife”). Sometimes, the changes just don’t stick!

In particular, they don’t stick if I make the change while “chef-client” is running on the node. The problem appears to be that chef-client saves the Node object at the end of it’s run, reverting the run-list to it’s previous state.

Someone logged a bug about this a couple of weeks ago:

https://tickets.opscode.com/browse/CHEF-1812

but the problem appears to have existed for a while. I’m a little surprised that it hasn’t been reported as an issue before now! I guess most people aren’t in the habit of messing with run-lists that much … at this stage, we’re doing do mainly in automated tests for our provisioning scripts.

It seems weird to me that chef-client attempts to update the node’s own “configuration” data, rather than just it’s “status”. Should there be a better separation between the two in Chef::Node?

Can anyone suggest how I might (a) work around the problem, or (b) fix it, in Chef::Node or chef-client?

Is it likely that changes to other Node attributes would be similarly affected, i.e. reverted at the end of each chef-client run?


cheers,
Mike Williams

I'm the original reporter (though not reflected in JIRA; Nuo filed it on my
behalf) and yeah - it's a pretty ugly bug. I've had similar troubles with
data bags and roles edited using knife - I'll often edit a databag just to
view it in vim (I know, bad habit) and then forget about it, make a change
in another terminal to the same databag, and then eventually quit the first,
overwriting my changes. I believe the client needs to have some form of
locking or revision control to really make this problem go away - optimistic
locking wouldn't be too difficult to retrofit, but something with vector
clocks and an archived history would allow for more intelligent merges of
conflicting data than "last one wins."

I've been working around it for now by extending the client run interval
(reducing the likelihood of a race condition) or just eliminating the
chef-client's daemon process entirely. Certain critical nodes I only run
chef in an attended fashion - thereby ensuring that I get a consistent
state. Unfortunately, not the greatest workaround. :frowning:

-Paul

On Wed, Nov 3, 2010 at 7:43 PM, Mike Williams
mike@cogentconsulting.com.auwrote:

I'm being bitten by a race-condition when altering the run-list for a node
(using either the Web UI, or "knife"). Sometimes, the changes just don't
stick!

In particular, they don't stick if I make the change while "chef-client" is
running on the node. The problem appears to be that chef-client saves the
Node object at the end of it's run, reverting the run-list to it's previous
state.

Someone logged a bug about this a couple of weeks ago:

https://tickets.opscode.com/browse/CHEF-1812

but the problem appears to have existed for a while. I'm a little
surprised that it hasn't been reported as an issue before now! I guess
most people aren't in the habit of messing with run-lists that much ... at
this stage, we're doing do mainly in automated tests for our provisioning
scripts.

It seems weird to me that chef-client attempts to update the node's own
"configuration" data, rather than just it's "status". Should there be a
better separation between the two in Chef::Node?

Can anyone suggest how I might (a) work around the problem, or (b) fix it,
in Chef::Node or chef-client?

Is it likely that changes to other Node attributes would be similarly
affected, i.e. reverted at the end of each chef-client run?

--
cheers,
Mike Williams

On 04/11/2010, at 16:14 , Paul Paradise wrote:

I believe the client needs to have some form of locking or revision control to really make this problem go away - optimistic locking wouldn't be too difficult to retrofit, but something with vector clocks and an archived history would allow for more intelligent merges of conflicting data than "last one wins."

How about the idea of teasing apart node "config" data from "status" data. The server would be the source of truth for the "config", and the node itself the source of truth for it's "state".

Something like optimistic locking would still be desirable to prevent concurrent changes to the "config", but at least the node would be able to continue to communicate it's "state" to the server.

I've been working around it for now by extending the client run interval (reducing the likelihood of a race condition) or just eliminating the chef-client's daemon process entirely. Certain critical nodes I only run chef in an attended fashion - thereby ensuring that I get a consistent state. Unfortunately, not the greatest workaround. :frowning:

Thanks Paul. Not ideal, as you say. But it's nice to know it's a real problem, and not (just) my own stupidity.

--
cheers,
Mike Williams

Hi all,

I have a problem related this topic. I have created a stackoverflow question [1]. Sorry for cross-posting:

Is there any method to get mutual exclusion in a chef node?

For example, If a process updates a node when a chef-client is running the chef-client will overwrite the node data:

  1. chef-client gets node data (state 1)
  2. The process A gets node data (state 1)
  3. The process A updates locally the node data (state 2)
  4. This process saves node data (state 2)
  5. chef-client updates locally the node data (state 2*)
  6. chef-client saves node data, and this node data does not contains the changes from the process A (state 2). The chef-client overwrite the node data. (state 2*)

We need to external modification because we have a nice UI of Chef server to manage remotely a lot of computers, showing like a tree (similar to LDAP). An administrator can update the value of the recipes from here. This project is OpenSource: https://github.com/gecos-team/

Although we had a semaphore system, we have detected that if we have two or more simultaneous requests, we can have a concurrence problem:

The regular case is that the system works
But sometimes the system does not work

Please, could you give me any solution?

REF’s

  1. http://stackoverflow.com/questions/33419695/is-there-any-method-to-get-mutual-exclusion-in-a-chef-node

The best method I know of is having a cookbook for this, you can base it on whatever you wish (scm, web page, db) and update the runlist accordingly.

As it’s a cookbook which change the runlist, there’s no race condition as you’re not touching the node object out of the node.

For the rest of the objects, I’m against a lock mechanism or any automatic ‘merge’ within chef itself. Work with a proper SCM, handle the conflicts, and use something like jenkins to update the chef-server with latest commit of the databag, environments, etc.

I have created a document with a lot of information about our problem:

Thanks!!

Sincerely,