Node attributes ignored and defaults used only once. Pointers to debug

Hi!

I’m using chef server managed from chef.io and once, and only once, a recipe used the default attributes instead of the node attributes. I can’t understand why that happened, and as this ended up causing downtime, I want to understand why it happened and how to prevent it.

The default attributes are defined like this in /attributes/default.rb:

  ...

And the template that uses this attributes is, /templates/default/percona.cnf.erb, like this:

<% node['asd_mysql']['settings'].sort.each do |key, value| %>
<% next unless value -%>
<%= key %><%=
 case value
 when TrueClass then ''
 else " = #{value}"
 end
%>
<% end %>

This template before the chef run that changed attributes to their default value, used the node’s attributes values in the recipe. This was working fine for years and using the node attributes, no change was made to the recipe, and suddenly once chef run and changed the node’s attributes to their default value.

This happened using Chef client: 11.14.2

After this problematic chef run, the node’s attributes were set to the default value in the recipe. Before this, it was using the node attributes configured that were different to the default value in the recipe.

I checked out the chef code and found commit e9f303b9f288c03baee9d8b40cca58838ff3c3a4, merged in a newer version of the one I’m running, that might be related. But I’m not sure that it happens when looking for the node attributes. If it does, maybe an option is that the chef server returned 5xx (or some network error, maybe, too), it might not handled the error correctly and eded up using the default attributes? Then updating the chef client to that version might help prevent it if that is the case

Does anyone know if that is involved in the call chain for looking at the node attributes or have some other clue on where to look or debug this issue that happened only once (we stopped chef on the machine for now)?

Thanks a lot!
Rodrigo

That’s quite hard to spot what did goes wrong with what you give us…

We have no clue of which cookbook you’re using, you’re speaking about attributes set in the recipe which will override the attribute file at same precedence level as they’re evaluated after but without any clue on how they are set at all…

Your attribute file should not compile as is, there’s no level before the attribute hash, so we can’t base ourselves on this either.

You’re not really clear if the chef client version did change to create this bug or not, and last but not least, chef 11.14 is quite old now.

All in all, there’s no reason a chef server would make this happen unless someone edited the node attributes inside the chef server (this would set the attributes at normal level which has precedence over default level).

Could you link to the cookbook used and give some more details (even on a long post) on your setup ?

Edit after re-reading, could the node object have been “wiped” from the chef server ?

Thanks for the answer!

The recipe is this: https://github.com/nomadium/mysql, branch COOK-4689. That is, basically, upstream mysql (https://github.com/chef-cookbooks/mysql) with a patch from an ex coworker.

We then have our mysql recipe, that depends on that, that the server recipe does:

our_mysql_service 'default' do
  action :create
end

Just that. Is that simple.

In the attributes some mysql configs, like:

default['our_mysql']['settings']['user']                           = 'mysql'
default['our_mysql']['settings']['default-storage-engine']         = 'InnoDB'
default['our_mysql']['settings']['socket']                         = '/var/lib/mysql/mysql.sock'
default['our_mysql']['settings']['pid-file']                       = '/var/lib/mysql/mysql.pid'

And a template using it is just as this:

[mysqld]
<% node['our_mysql']['settings'].sort.each do |key, value| %>
<% next unless value -%>
<%= key %><%=
 case value
 when TrueClass then ''
 else " = #{value}"
 end
%>
<% end %>

We didn’t change the chef client version in years. It was always working. Just one time the default attributes, instead of the ones defined in the node (like in knife node edit …), were used in the template above.

After this run, the node attributes (as seen with knife edit …) were not there. And I can see in the chef server management console that chef was running periodically and exit successfully before this. After this run, we stopped chef on that server and fixed the config manually.

And no, the node is present at the chef.io interface, and the node name hasn’t changed.

Do you have any idea on how to further debug this? Or what could have happened?

I’m not sure if there is any other option that the chef server, for some network problem intermittent bug or something, returned something that made the chef client think there are no node attributes.

Thanks again!
Rodrigo

What I mean is: could something in your infrastructure have wiped the node object on the chef server, if yes the node has rebuild it at next run, with default values as it was the only thing known.

From what you’re exposing a bad script could have wiped the normal level attributes, or someone did a mistake.

As you’re using hosted chef, opening a ticket sounds the only way to know what did happen in your organization as you don’t have the server logs to investigate who did act on this node before the failure.

@Tensibai ohh, I see. No, we have reviewed and nothing changes the attributes. I thought of that too, maybe in some error path that is not commonly used happens, but the recipe is trivial and doesn’t seem to be the case.

Thanks a lot for your time and tips, I’ve opened a ticket to our hosted chef server provider. Thanks again!

The mistake could happen totally out of a chef run, a bad knife exec command on a too large search could be the root cause for example.

Can you elaborate on how “a too large search” could cause the node object to get removed? That’s somewhat concerning.

The search in itself is OK. It could have be a knife exec command aimed at certain nodes with a search and the search terms did include this specific node.

It could be a lot of human errors, this one is just a guess.

I’ve no idea on your workflows nor your team, I just see no way the server would have wiped the attributes by itself.

Ok, that clarifies things for me and reassures me on that front… thanks :slight_smile: