Why an environment's override_attributes are not set until chef-client completes successfully?

Been trying to diagnose why something is happening in our environment,
and can’t seem to figure it out.

Here is some commands showing the issue:

root@o1r3.iad1 16:52:06:/opt/chef-repo# knife node show
c5r3.int.iad1.attcompute.com
Node Name: c5r3.int.iad1.attcompute.com
Environment: production
FQDN: c5r3.int.iad1.attcompute.com
IP: 192.168.112.143
Run List: role[iad1], role[openstack-identity]
Roles: booted
Recipes: apt, ohai, chef-client::cron, users::sysadmins, openssh,
sudo, reboot-handler, networking, raid, sol
Platform: ubuntu 12.04
Tags:
root@o1r3.iad1 16:52:11:/opt/chef-repo# grep -n2 "admin_user"
environments/production.json
98- "demo"
99- ],
100: “admin_user”: “ksadmin”,
101- “users”: {
102- “ksadmin”: {
root@o1r3.iad1 16:52:16:/opt/chef-repo# knife node show
c5r3.int.iad1.attcompute.com -Fj | grep admin_user
root@o1r3.iad1 16:52:28:/opt/chef-repo#

From above, you can see that:

a) The node in question has the production environment
b) The production environment has the keystone:admin_user attribute set
to “ksadmin”, not “admin”

Unfortunately, when running chef-client on the node above, the override
"ksadmin" value set in the environment’s override_attributes does not
get used. Instead, the recipe’s default value of “admin” gets used,
which results in a failure.

Here is the output of chef-client on the node:

http://paste.openstack.org/show/30418/

and here is the code that is calling the above:

http://paste.openstack.org/show/30419/

Why doesn’t a node’s environment override_attributes get merged to the
node’s attribute collection before chef-client runs? Why would
convergence need to occur before a node’s environment attributes are set
in the node’s attributes collection?

A more general question would be: data is data, why on Earth do Chef
searches return different data about a node depending on whether
chef-client has run successfully on a node or not? I can understand this
behaviour for automatic attributes from Ohai, but it does not make much
sense for any other attributes, IMHO.

Best,
-jay

On Monday, February 4, 2013 at 9:02 AM, Jay Pipes wrote:

Been trying to diagnose why something is happening in our environment,
and can't seem to figure it out.

Here is some commands showing the issue:

root@o1r3.iad1 (mailto:root@o1r3.iad1) 16:52:06:/opt/chef-repo# knife node show
c5r3.int.iad1.attcompute.com (http://c5r3.int.iad1.attcompute.com)
Node Name: c5r3.int.iad1.attcompute.com (http://c5r3.int.iad1.attcompute.com)
Environment: production
FQDN: c5r3.int.iad1.attcompute.com (http://c5r3.int.iad1.attcompute.com)
IP: 192.168.112.143
Run List: role[iad1], role[openstack-identity]
Roles: booted
Recipes: apt, ohai, chef-client::cron, users::sysadmins, openssh,
sudo, reboot-handler, networking, raid, sol
Platform: ubuntu 12.04
Tags:
root@o1r3.iad1 (mailto:root@o1r3.iad1) 16:52:11:/opt/chef-repo# grep -n2 "admin_user"
environments/production.json
98- "demo"
99- ],
100: "admin_user": "ksadmin",
101- "users": {
102- "ksadmin": {
root@o1r3.iad1 (mailto:root@o1r3.iad1) 16:52:16:/opt/chef-repo# knife node show
c5r3.int.iad1.attcompute.com (http://c5r3.int.iad1.attcompute.com) -Fj | grep admin_user
root@o1r3.iad1 (mailto:root@o1r3.iad1) 16:52:28:/opt/chef-repo#

From above, you can see that:

a) The node in question has the production environment
b) The production environment has the keystone:admin_user attribute set
to "ksadmin", not "admin"

Unfortunately, when running chef-client on the node above, the override
"ksadmin" value set in the environment's override_attributes does not
get used. Instead, the recipe's default value of "admin" gets used,
which results in a failure.

Here is the output of chef-client on the node:

Paste #30418 | LodgeIt!

and here is the code that is calling the above:

Paste #30419 | LodgeIt!

Why doesn't a node's environment override_attributes get merged to the
node's attribute collection before chef-client runs? Why would
convergence need to occur before a node's environment attributes are set
in the node's attributes collection?

Most likely, you're seeing this: http://docs.opscode.com/breaking_changes_chef_11.html#role-and-environment-attribute-changes Chef 11 should work like you expect.

A more general question would be: data is data, why on Earth do Chef
searches return different data about a node depending on whether
chef-client has run successfully on a node or not? I can understand this
behaviour for automatic attributes from Ohai, but it does not make much
sense for any other attributes, IMHO.

The chef client run builds up attributes from environments, roles, and cookbooks. Since you can change these at any time you like, Chef doesn't save the node data until the chef run has completed successfully. For example, you could populate attributes from a ruby_block resource as the very last step of a Chef run, and depend on those values being present for search. If Chef saves and indexes your node data without these, your nodes would disappear and reappear in searches.

That said, you can manually save the node data with node.save in a recipe at any time you like.

Best,
-jay

HTH,

Dan DeLeo

On 02/04/2013 12:18 PM, Daniel DeLeo wrote:

On Monday, February 4, 2013 at 9:02 AM, Jay Pipes wrote:

Why doesn't a node's environment override_attributes get merged to the
node's attribute collection before chef-client runs? Why would
convergence need to occur before a node's environment attributes are set
in the node's attributes collection?

Most likely, you're seeing
this: http://docs.opscode.com/breaking_changes_chef_11.html#role-and-environment-attribute-changes
Chef 11 should work like you expect.

Yeah, unfortunately the unexpected release of Chef 11 last night borked
our deployment as cookbooks suddenly just started failing. John Dewey
posted to the list about it.

A more general question would be: data is data, why on Earth do Chef
searches return different data about a node depending on whether
chef-client has run successfully on a node or not? I can understand this
behaviour for automatic attributes from Ohai, but it does not make much
sense for any other attributes, IMHO.

The chef client run builds up attributes from environments, roles, and
cookbooks. Since you can change these at any time you like, Chef doesn't
save the node data until the chef run has completed successfully. For
example, you could populate attributes from a ruby_block resource as the
very last step of a Chef run, and depend on those values being present
for search. If Chef saves and indexes your node data without these, your
nodes would disappear and reappear in searches.

This doesn't make a whole lot of functional sense to me. Like I said,
data is data. It doesn't magically change after a chef-client run (other
than automatic attributes like I mention above). It's just confusing to
have attribute values either be available or not available depending on
whether a chef-client run has completed successfully.

That said, you can manually save the node data with node.save in a
recipe at any time you like.

Yes, we ended up having to use node.save in a Galera cluster cookbook
were were using when we saw that Chef searches were entirely
non-deterministic if you were relying on attributes that would only be
set if the chef-client run had succeeeded. It's a major design flaw, IMHO.

-jay

Best,
-jay
HTH,

Dan DeLeo

On 2/4/13 9:27 AM, "Jay Pipes" jaypipes@gmail.com wrote:

Yes, we ended up having to use node.save in a Galera cluster cookbook
were were using when we saw that Chef searches were entirely
non-deterministic if you were relying on attributes that would only be
set if the chef-client run had succeeeded. It's a major design flaw, IMHO.

One man's design flaw is another mans safety feature. :slight_smile:

(ie: if chef didn't succeed, how do you know the system is correct, and
that you should rely on it?)

Love,
Adam

On 02/04/2013 01:41 PM, Adam Jacob wrote:

On 2/4/13 9:27 AM, "Jay Pipes" jaypipes@gmail.com wrote:

Yes, we ended up having to use node.save in a Galera cluster cookbook
were were using when we saw that Chef searches were entirely
non-deterministic if you were relying on attributes that would only be
set if the chef-client run had succeeeded. It's a major design flaw, IMHO.

One man's design flaw is another mans safety feature. :slight_smile:

Chef seems to have a lot of safety features.

(ie: if chef didn't succeed, how do you know the system is correct, and
that you should rely on it?)

This is a silly statement. The purpose of an environment override
attribute is to describe the intended state of a system belonging to
that environment. Why on Earth would chef-client succeeding or not
succeeding change anything related to the intended state of a system
belonging to an environment?

It just doesn't make sense.

-jay

I would not want a new server who's deployment failed to show up in the
search, and I'm confused why you would...
(I'm also using galera clusters)
if one of the new DB nodes didn't deploy properly, and it saved, the
cookbook that updates the load balancer (runs somewhere else) would add
this failed machine to the pool.
If the machine failed to deploy, who knows how far it got... maybe mysql is
running with half of the tables and broken sync? yikes!

I would definitely call this a safety feature that I rely on (and I'm not
sure what you mean by "a lot of safety features")

On Mon, Feb 4, 2013 at 1:47 PM, Jay Pipes jaypipes@gmail.com wrote:

On 02/04/2013 01:41 PM, Adam Jacob wrote:

On 2/4/13 9:27 AM, "Jay Pipes" jaypipes@gmail.com wrote:

Yes, we ended up having to use node.save in a Galera cluster cookbook
were were using when we saw that Chef searches were entirely
non-deterministic if you were relying on attributes that would only be
set if the chef-client run had succeeeded. It's a major design flaw,
IMHO.

One man's design flaw is another mans safety feature. :slight_smile:

Chef seems to have a lot of safety features.

(ie: if chef didn't succeed, how do you know the system is correct, and
that you should rely on it?)

This is a silly statement. The purpose of an environment override
attribute is to describe the intended state of a system belonging to
that environment. Why on Earth would chef-client succeeding or not
succeeding change anything related to the intended state of a system
belonging to an environment?

It just doesn't make sense.

-jay

On 2/4/13 10:47 AM, "Jay Pipes" jaypipes@gmail.com wrote:

(ie: if chef didn't succeed, how do you know the system is correct, and
that you should rely on it?)

This is a silly statement. The purpose of an environment override
attribute is to describe the intended state of a system belonging to
that environment. Why on Earth would chef-client succeeding or not
succeeding change anything related to the intended state of a system
belonging to an environment?

It just doesn't make sense.

Contrast this with another feature of Chef, which says that a system is
correct when Chef has completed a run successfully. You can absolutely
control when node attributes appear as part of a Chef run - you do it
through calling node.save. By default, though, we assume that the entire
run list was required in order for you to feel comfortable relying on how
the machine behaves. I get that it does not match your use case, and that
it causes you frustration, at least somewhat from being surprised at the
behavior. I'm sorry for that, as we certainly didn't sit down and write
Chef with the explicit purpose of causing you pain.

I, the rest of Opscode, and everyone else on the list (I hope) want
nothing more than for you to be happy and successful - not just with Chef,
but with the entire scope of what you need to be happy in your job. I'm
sorry if my response was too short, and didn't do a good enough job of
conveying that.

Let me know how I can help,
Adam

I think there are two important things here:

  • Data that describes a system, but has nothing to do with the system's
    state
  • "Control data" -- or data points that indicate the current state of a
    node at a particular point in time

Data that describes a system might be something like "cluster name". The
value of this data doesn't change regardless of whether a node is
running chef-client, completed chef-client successfully, or whether it's
April 1st. Such data, IMHO, should not be mixed with control data, that
would necessarily change with the aforementioned events.

Unfortunately, this data is mixed together with control data points,
and by this virtue, if you try to query for the former type of data
(such as "get me all the nodes in Chef server with this cluster name"
stuff using Chef search, the results returned are determined by what
state the chef-client last run on the node was. And this is what is,
again, IMHO, a design flaw.

Finally, I'd like to note that we found the cause of our issue below,
and it actually didn't have to do with node.save or any of that. It had
to do with the environment JSON file in question having two sections
named "keystone", and the latter section was overwriting the former
samed-named section, essentially deleting the values set in the former
section.

Now if only there was a safety feature that notified us of such a problem!

-jay

On 02/04/2013 02:01 PM, Jesse Campbell wrote:

I would not want a new server who's deployment failed to show up in the
search, and I'm confused why you would...
(I'm also using galera clusters)
if one of the new DB nodes didn't deploy properly, and it saved, the
cookbook that updates the load balancer (runs somewhere else) would add
this failed machine to the pool.
If the machine failed to deploy, who knows how far it got... maybe mysql
is running with half of the tables and broken sync? yikes!

I would definitely call this a safety feature that I rely on (and I'm
not sure what you mean by "a lot of safety features")

On Mon, Feb 4, 2013 at 1:47 PM, Jay Pipes <jaypipes@gmail.com
mailto:jaypipes@gmail.com> wrote:

On 02/04/2013 01:41 PM, Adam Jacob wrote:
> On 2/4/13 9:27 AM, "Jay Pipes" <jaypipes@gmail.com
<mailto:jaypipes@gmail.com>> wrote:
>> Yes, we ended up having to use node.save in a Galera cluster cookbook
>> were were using when we saw that Chef searches were entirely
>> non-deterministic if you were relying on attributes that would
only be
>> set if the chef-client run had succeeeded. It's a major design
flaw, IMHO.
>
> One man's design flaw is another mans safety feature. :)

Chef seems to have a lot of safety features.

> (ie: if chef didn't succeed, how do you know the system is
correct, and
> that you should rely on it?)

This is a silly statement. The purpose of an environment override
attribute is to describe the intended state of a system belonging to
that environment. Why on Earth would chef-client succeeding or not
succeeding change anything related to the intended state of a system
belonging to an environment?

It just doesn't make sense.

-jay

But if you do that(post a node's data back to server even if the
convergence fail), you are making a failed node's data available to other
nodes, and they might use it to build further configs, which is wrong, its
a wrong assumption. When I query chef server for an attribute (say
mysql_server_version) i assume that the node has successfully converged.
The fact that this attribute is available, proves this (unless you do
something fishy). But if we follow your logic, we have to do an additional
check other than search just to ensure if its actually a working node,
isn't it

On Mon, Feb 4, 2013 at 10:47 AM, Jay Pipes jaypipes@gmail.com wrote:

On 02/04/2013 01:41 PM, Adam Jacob wrote:

On 2/4/13 9:27 AM, "Jay Pipes" jaypipes@gmail.com wrote:

Yes, we ended up having to use node.save in a Galera cluster cookbook
were were using when we saw that Chef searches were entirely
non-deterministic if you were relying on attributes that would only be
set if the chef-client run had succeeeded. It's a major design flaw,
IMHO.

One man's design flaw is another mans safety feature. :slight_smile:

Chef seems to have a lot of safety features.

(ie: if chef didn't succeed, how do you know the system is correct, and
that you should rely on it?)

This is a silly statement. The purpose of an environment override
attribute is to describe the intended state of a system belonging to
that environment. Why on Earth would chef-client succeeding or not
succeeding change anything related to the intended state of a system
belonging to an environment?

It just doesn't make sense.

-jay