File created from template has broken (binary) and incorrect data in it after Chef run


#1

Hi, Chefs –

After a Chef run on one of my machines last night, a text file on the
system, created from a template, had binary data in it. On the next Chef
run, everything was fine. I’m wondering if this is something that anyone
else has experienced. (This is also filed as
http://tickets.opscode.com/browse/CHEF-3179, but I wanted to cast a wide
net since it’s not an easily searchable issue.)

Basic details:
Chef 0.10.10, Ruby 1.9.2p180, Ohai 0.6.12

Many more details: read on!

I have a resource definition called “nrpe_service”. It’s used to define
services for NRPE (Nagios Remote Plugin framework). In this case, it’s
defined like this:
nrpe_service “check_couch_two_way_replication” do
command "check_couch_replication.py --database ARG1 --source --dest"
end

The definition does some sanity checking (making sure that needed
attributes are defined, etc), then does this if the sanity check works out:
template “#{node[:nrpe][:confdir]}/#{params[:name]}.cfg” do
cookbook "nagios"
source "nrpe_service.cfg.erb"
owner "root"
group "root"
mode "0644"
variables({:params => params})
end

So far, so good.

The template itself is pretty simple (though, wow, what was I thinking when
I put some of the modification logic in the template instead of the
resource definiton? That was silly…):
<% if @params[:command][0] == ‘/’ %>
<% cmd = @params[:command] %>
<% else %>
<% cmd = “#{@node[:nagios][:plugin_path]}/#{@params[:command]}” %>
<% end %>
command[<%= @params[:name] -%>]=<%= cmd %>

The resulting service definition looks like this:
[imarlier nrpe.d]$ cat check_couch_two_way_replication.cfg
command[check_couch_two_way_replication]=/usr/lib64/nagios/plugins/check_couch_replication.py
–database ARG1 --source --dest
[imarlier nrpe.d]$

This has been in place, and has been working just fine, for months now.
Then last night, chef-client ran (a scheduled run) and the file turned into
this:
[imarlier nrpe.d] cat check_couch_two_way_replication.cfg.chef-20120604182416 command[1?en way_replication]=/usr/lib64/nagios/plugins/["uid", 10003] [imarlier nrpe.d]

This is…wrong.

The next time that Chef ran, everything went back to how it should be.
There were no alterations to the Chef server at this time, and no local
modifications on the client either.

It’s interesting to me that the things that appear to be corrupted are
parameter values that should be derived entirely locally – they’re
elements of the resource definition, and not parameters that are being
pulled down from the server or anything like that.

Has anyone seen similar behavior? Given that this isn’t something that
seems to happen with any frequency – and it’s not something that I can
reproduce at will – anyone have ideas as to what I might do to debug
this? Are there any known data corruption bugs in Chef? In Ruby? I’m
kind of at a loss, but seeing something like this is pretty terrifying.

Thanks,

  • Ian


Ian Marlier | Senior Systems Engineer
Brightcove, Inc.
290 Congress Street, 4th Floor, Boston, MA 02110
imarlier@brightcove.com


#2

Please show full debug logs for the period and/or the contents of the
Chef File Backup Cache and/or logging statements regarding moving
files into the Backup Cache

This is pretty heavy stuff… reminds me of a bug long ago, on a planet
far away in another configuration management universe.

Regards,

AJ

On 6 June 2012 02:13, Ian Marlier imarlier@brightcove.com wrote:

Hi, Chefs –

After a Chef run on one of my machines last night, a text file on the
system, created from a template, had binary data in it. On the next Chef
run, everything was fine. I’m wondering if this is something that anyone
else has experienced. (This is also filed as
http://tickets.opscode.com/browse/CHEF-3179, but I wanted to cast a wide net
since it’s not an easily searchable issue.)

Basic details:
Chef 0.10.10, Ruby 1.9.2p180, Ohai 0.6.12

Many more details: read on!

I have a resource definition called “nrpe_service”. It’s used to define
services for NRPE (Nagios Remote Plugin framework). In this case, it’s
defined like this:
nrpe_service “check_couch_two_way_replication” do
command "check_couch_replication.py --database ARG1 --source --dest"
end

The definition does some sanity checking (making sure that needed attributes
are defined, etc), then does this if the sanity check works out:
template “#{node[:nrpe][:confdir]}/#{params[:name]}.cfg” do
cookbook "nagios"
source "nrpe_service.cfg.erb"
owner "root"
group "root"
mode "0644"
variables({:params => params})
end

So far, so good.

The template itself is pretty simple (though, wow, what was I thinking when
I put some of the modification logic in the template instead of the resource
definiton? That was silly…):
<% if @params[:command][0] == ‘/’ %>
<% cmd = @params[:command] %>
<% else %>
<% cmd = “#{@node[:nagios][:plugin_path]}/#{@params[:command]}” %>
<% end %>
command[<%= @params[:name] -%>]=<%= cmd %>

The resulting service definition looks like this:
[imarlier nrpe.d]$ cat check_couch_two_way_replication.cfg
command[check_couch_two_way_replication]=/usr/lib64/nagios/plugins/check_couch_replication.py
–database ARG1 --source --dest
[imarlier nrpe.d]$

This has been in place, and has been working just fine, for months now.
Then last night, chef-client ran (a scheduled run) and the file turned into
this:
[imarlier nrpe.d] cat check_couch_two_way_replication.cfg.chef-20120604182416 command[1?en way_replication]=/usr/lib64/nagios/plugins/["uid", 10003] [imarlier nrpe.d]

This is…wrong.

The next time that Chef ran, everything went back to how it should be.
There were no alterations to the Chef server at this time, and no local
modifications on the client either.

It’s interesting to me that the things that appear to be corrupted are
parameter values that should be derived entirely locally – they’re elements
of the resource definition, and not parameters that are being pulled down
from the server or anything like that.

Has anyone seen similar behavior? Given that this isn’t something that
seems to happen with any frequency – and it’s not something that I can
reproduce at will – anyone have ideas as to what I might do to debug this?
Are there any known data corruption bugs in Chef? In Ruby? I’m kind of at
a loss, but seeing something like this is pretty terrifying.

Thanks,

  • Ian


Ian Marlier | Senior Systems Engineer
Brightcove, Inc.
290 Congress Street, 4th Floor, Boston, MA 02110
imarlier@brightcove.com


#3

I don’t have debug logs for the time that this issue happened – I don’t
run with debug logging enabled for the most part (though if this happens
again, I might start!).

Here’s the INFO level logs from the period where the issue happened,
trimmed a bit. I’ve included the log lines from initialization (showing
that no changed cookbooks/recipes were downloaded), and the log lines that
show the operations on this particular resource. The actions at 17:53 are
a good copy of the file being moved out of the way and a broken copy being
put into place. The actions at 18:24 are the broken file being moved out
of the way, and a good copy being put back in place.

[Mon, 04 Jun 2012 17:52:56 -0700] INFO: *** Chef 0.10.10 ***
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Run List is [role[falkland],
role[base], role[xen_guest], role[fiji_db_server]]
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Run List expands to [chef_handler,
chef_handler::register_handlers, chef-client::service, bc_ec2,
bc_ec2::pullkeys, openldap::client, users::root, users::ubuntu, openssh,
dns::hosts, dns::resolv.conf, dns::nsupdate, hostname, motd, bash,
yum::client, apt::brightcove, apt, python, diamond, zip, sudoers,
nagios::nrpe, nagios::send_nsca, ntpd, genders, sysstat, syslog, xml,
postfix, tmpreaper, xen::guest, mysql::default, mysql::nrpe,
pacemaker::default, mysql::pacemaker]
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Starting Chef Run for fijidb01.fal
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Running start handlers
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Start handlers complete.
[Mon, 04 Jun 2012 17:53:06 -0700] INFO: Loading cookbooks [apache2, apt,
aws, bash, bc_ec2, benson, bind9, bluepill, build-essential, bundler, chef,
chef-client, chef_handler, couchdb, daemontools, dell, diamond, dns,
erlang, fiji, genders, git, gluster, haproxy, hostname,
ios_resque_endpoint, java, logrotate, modprobe, mongodb, motd, mysql,
nagios, newrelic, nginx, node, ntpd, openldap, openssh, openssl, openvpn,
pacemaker, postfix, python, rabbitmq_chef, redis, resque, ruby-shadow,
runit, stunnel, sudoers, sysctl, syslog, sysstat, tmpreaper, ucspi-tcp,
unicorn, users, varnish, xen, xinetd, xml, yum, zip, zlib]
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNpm light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeServer light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNodejs light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonVirtualenv light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonPip light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: ChefHandler light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AptRepository light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: BluepillService light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsElasticIp light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsEbsVolume light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: DaemontoolsService light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNodejs light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeServer light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNpm light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonPip light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonVirtualenv light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: ChefHandler light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AptRepository light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: BluepillService light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsElasticIp light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsEbsVolume light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: DaemontoolsService light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: Chef Handlers will be at:
/var/chef/handlers
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: Processing
remote_directory[/var/chef/handlers] action create (chef_handler::default
line 23)

[Mon, 04 Jun 2012 17:53:24 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 17:53:24 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
backed up to
/var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604175324
[Mon, 04 Jun 2012 17:53:24 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
mode changed to 644
[Mon, 04 Jun 2012 17:53:24 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
updated content

[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Run List is [role[falkland],
role[base], role[xen_guest], role[fiji_db_server]]
[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Run List expands to [chef_handler,
chef_handler::register_handlers, chef-client::service, bc_ec2,
bc_ec2::pullkeys, openldap::client,
users::root, users::ubuntu, openssh, dns::hosts, dns::resolv.conf,
dns::nsupdate, hostname, motd, bash, yum::client, apt::brightcove, apt,
python, diamond, zip, sudoers, nagios::nrpe, nagios::send_nsca, ntpd,
genders, sysstat, syslog, xml, postfix, tmpreaper, xen::guest,
mysql::default, mysql::nrpe, pacemaker::default, mysql::pacemaker][Mon, 04
Jun 2012 18:23:39 -0700] INFO: Starting Chef Run for fijidb01.fal[Mon, 04
Jun 2012 18:23:39 -0700] INFO: Running start handlers
[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Start handlers complete.
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: Loading cookbooks [apache2, apt,
aws, bash, bc_ec2, benson, bind9, bluepill, build-essential, bundler, chef,
chef-client, chef_handle
r, couchdb, daemontools, dell, diamond, dns, erlang, fiji, genders, git,
gluster, haproxy, hostname, ios_resque_endpoint, java, logrotate, modprobe,
mongodb, motd, mysql, na
gios, newrelic, nginx, node, ntpd, openldap, openssh, openssl, openvpn,
pacemaker, postfix, python, rabbitmq_chef, redis, resque, ruby-shadow,
runit, stunnel, sudoers, sysct
l, syslog, sysstat, tmpreaper, ucspi-tcp, unicorn, users, varnish, xen,
xinetd, xml, yum, zip, zlib]
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNpm light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeServer light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNodejs light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonVirtualenv light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonPip light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: ChefHandler light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AptRepository light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: BluepillService light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsElasticIp light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsEbsVolume light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: DaemontoolsService light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNodejs light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeServer light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNpm light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonPip light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonVirtualenv light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: ChefHandler light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AptRepository light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: BluepillService light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsElasticIp light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsEbsVolume light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: DaemontoolsService light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: Chef Handlers will be at:
/var/chef/handlers
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: Processing
remote_directory[/var/chef/handlers] action create (chef_handler::default
line 23)

[Mon, 04 Jun 2012 18:24:16 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 18:24:16 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
backed up to
/var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604182416
[Mon, 04 Jun 2012 18:24:16 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
mode changed to 644
[Mon, 04 Jun 2012 18:24:16 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
updated content
[Mon, 04 Jun 2012 18:54:52 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 19:56:16 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 20:27:07 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 20:58:04 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)

Copies of this file that exist in the Chef file backup cache:
[imarlier ~] find /var/chef/backup/ -type f -name "check_couch_two_way*" -ls 11108655 4 -rw-r--r-- 1 root root 82 Jun 4 17:53 /var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604182416 11108654 4 -rw-r--r-- 1 root root 128 Mar 17 13:19 /var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604175324 [imarlier ~]

The ticket that I created at
http://tickets.opscode.com/browse/CHEF-3179has copies of the broken
file (as moved out of the way at 18:24) and the
good file that was created at 18:24. I can also upload a copy of the file
that existed prior to the 17:53 chef-client run, but, well:

[imarlier ~] diff /var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604175324 /usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg [imarlier ~]

I’m like you in having flashbacks to other bugs in the config management
world, which is why I think that this is a pretty critical issue to track
down. This is really way too reminiscent of
1010http://projects.puppetlabs.com/issues/1010.
There appears to be some scenario that leads to data corruption within the
context of a chef-client run, and it appears to be independent of
chef-server. If the data that was being corrupted was node attribute data,
that would be one thing. But this is resource attributes, and should have
been set at the time that the template resource was created and added to
the run list. Shoot, the template knew it’s name when the resource was
created (because it put the broken file at the right path), but it didn’t
know it’s name when it wrote the template contents (because the command
name is @params[:name])!

On Tue, Jun 5, 2012 at 6:26 PM, AJ Christensen aj@junglist.gen.nz wrote:

Please show full debug logs for the period and/or the contents of the
Chef File Backup Cache and/or logging statements regarding moving
files into the Backup Cache

This is pretty heavy stuff… reminds me of a bug long ago, on a planet
far away in another configuration management universe.

Regards,

AJ

On 6 June 2012 02:13, Ian Marlier imarlier@brightcove.com wrote:

Hi, Chefs –

After a Chef run on one of my machines last night, a text file on the
system, created from a template, had binary data in it. On the next Chef
run, everything was fine. I’m wondering if this is something that anyone
else has experienced. (This is also filed as
http://tickets.opscode.com/browse/CHEF-3179, but I wanted to cast a
wide net
since it’s not an easily searchable issue.)

Basic details:
Chef 0.10.10, Ruby 1.9.2p180, Ohai 0.6.12

Many more details: read on!

I have a resource definition called “nrpe_service”. It’s used to define
services for NRPE (Nagios Remote Plugin framework). In this case, it’s
defined like this:
nrpe_service “check_couch_two_way_replication” do
command "check_couch_replication.py --database ARG1 --source --dest"
end

The definition does some sanity checking (making sure that needed
attributes
are defined, etc), then does this if the sanity check works out:
template “#{node[:nrpe][:confdir]}/#{params[:name]}.cfg” do
cookbook "nagios"
source "nrpe_service.cfg.erb"
owner "root"
group "root"
mode "0644"
variables({:params => params})
end

So far, so good.

The template itself is pretty simple (though, wow, what was I thinking
when
I put some of the modification logic in the template instead of the
resource
definiton? That was silly…):
<% if @params[:command][0] == ‘/’ %>
<% cmd = @params[:command] %>
<% else %>
<% cmd = “#{@node[:nagios][:plugin_path]}/#{@params[:command]}” %>
<% end %>
command[<%= @params[:name] -%>]=<%= cmd %>

The resulting service definition looks like this:
[imarlier nrpe.d]$ cat check_couch_two_way_replication.cfg

command[check_couch_two_way_replication]=/usr/lib64/nagios/plugins/check_couch_replication.py

–database ARG1 --source --dest
[imarlier nrpe.d]$

This has been in place, and has been working just fine, for months now.
Then last night, chef-client ran (a scheduled run) and the file turned
into
this:
[imarlier nrpe.d] cat check_couch_two_way_replication.cfg.chef-20120604182416 command[1?en way_replication]=/usr/lib64/nagios/plugins/["uid", 10003] [imarlier nrpe.d]

This is…wrong.

The next time that Chef ran, everything went back to how it should be.
There were no alterations to the Chef server at this time, and no local
modifications on the client either.

It’s interesting to me that the things that appear to be corrupted are
parameter values that should be derived entirely locally – they’re
elements
of the resource definition, and not parameters that are being pulled down
from the server or anything like that.

Has anyone seen similar behavior? Given that this isn’t something that
seems to happen with any frequency – and it’s not something that I can
reproduce at will – anyone have ideas as to what I might do to debug
this?
Are there any known data corruption bugs in Chef? In Ruby? I’m kind of
at
a loss, but seeing something like this is pretty terrifying.

Thanks,

  • Ian


Ian Marlier | Senior Systems Engineer
Brightcove, Inc.
290 Congress Street, 4th Floor, Boston, MA 02110
imarlier@brightcove.com


Ian Marlier | Senior Systems Engineer
Brightcove, Inc.
290 Congress Street, 4th Floor, Boston, MA 02110
imarlier@brightcove.com


#4

I’ve allocated Heavy Water engineering resources in attempt to
reproduce and track this down. One of our installations saw this
earlier today.

Please do provide any update or additional information, especially
anyone else reading this who is able to reproduce or has seen similar.

Thankfully the diagnostics will be much easier than #1010.

Cheers,

–AJ

On 6 June 2012 13:24, Ian Marlier imarlier@brightcove.com wrote:

I don’t have debug logs for the time that this issue happened – I don’t run
with debug logging enabled for the most part (though if this happens again,
I might start!).

Here’s the INFO level logs from the period where the issue happened, trimmed
a bit. I’ve included the log lines from initialization (showing that no
changed cookbooks/recipes were downloaded), and the log lines that show the
operations on this particular resource. The actions at 17:53 are a good
copy of the file being moved out of the way and a broken copy being put into
place. The actions at 18:24 are the broken file being moved out of the way,
and a good copy being put back in place.

[Mon, 04 Jun 2012 17:52:56 -0700] INFO: *** Chef 0.10.10 ***
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Run List is [role[falkland],
role[base], role[xen_guest], role[fiji_db_server]]
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Run List expands to [chef_handler,
chef_handler::register_handlers, chef-client::service, bc_ec2,
bc_ec2::pullkeys, openldap::client, users::root, users::ubuntu, openssh,
dns::hosts, dns::resolv.conf, dns::nsupdate, hostname, motd, bash,
yum::client, apt::brightcove, apt, python, diamond, zip, sudoers,
nagios::nrpe, nagios::send_nsca, ntpd, genders, sysstat, syslog, xml,
postfix, tmpreaper, xen::guest, mysql::default, mysql::nrpe,
pacemaker::default, mysql::pacemaker]
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Starting Chef Run for fijidb01.fal
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Running start handlers
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Start handlers complete.
[Mon, 04 Jun 2012 17:53:06 -0700] INFO: Loading cookbooks [apache2, apt,
aws, bash, bc_ec2, benson, bind9, bluepill, build-essential, bundler, chef,
chef-client, chef_handler, couchdb, daemontools, dell, diamond, dns, erlang,
fiji, genders, git, gluster, haproxy, hostname, ios_resque_endpoint, java,
logrotate, modprobe, mongodb, motd, mysql, nagios, newrelic, nginx, node,
ntpd, openldap, openssh, openssl, openvpn, pacemaker, postfix, python,
rabbitmq_chef, redis, resque, ruby-shadow, runit, stunnel, sudoers, sysctl,
syslog, sysstat, tmpreaper, ucspi-tcp, unicorn, users, varnish, xen, xinetd,
xml, yum, zip, zlib]
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNpm light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeServer light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNodejs light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonVirtualenv light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonPip light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: ChefHandler light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AptRepository light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: BluepillService light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsElasticIp light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsEbsVolume light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: DaemontoolsService light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNodejs light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeServer light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNpm light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonPip light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonVirtualenv light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: ChefHandler light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AptRepository light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: BluepillService light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsElasticIp light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsEbsVolume light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: DaemontoolsService light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: Chef Handlers will be at:
/var/chef/handlers
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: Processing
remote_directory[/var/chef/handlers] action create (chef_handler::default
line 23)

[Mon, 04 Jun 2012 17:53:24 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 17:53:24 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
backed up to
/var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604175324
[Mon, 04 Jun 2012 17:53:24 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
mode changed to 644
[Mon, 04 Jun 2012 17:53:24 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
updated content

[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Run List is [role[falkland],
role[base], role[xen_guest], role[fiji_db_server]]
[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Run List expands to [chef_handler,
chef_handler::register_handlers, chef-client::service, bc_ec2,
bc_ec2::pullkeys, openldap::client,
users::root, users::ubuntu, openssh, dns::hosts, dns::resolv.conf,
dns::nsupdate, hostname, motd, bash, yum::client, apt::brightcove, apt,
python, diamond, zip, sudoers, nagios::nrpe, nagios::send_nsca, ntpd,
genders, sysstat, syslog, xml, postfix, tmpreaper, xen::guest,
mysql::default, mysql::nrpe, pacemaker::default, mysql::pacemaker][Mon, 04
Jun 2012 18:23:39 -0700] INFO: Starting Chef Run for fijidb01.fal[Mon, 04
Jun 2012 18:23:39 -0700] INFO: Running start handlers
[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Start handlers complete.
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: Loading cookbooks [apache2, apt,
aws, bash, bc_ec2, benson, bind9, bluepill, build-essential, bundler, chef,
chef-client, chef_handle
r, couchdb, daemontools, dell, diamond, dns, erlang, fiji, genders, git,
gluster, haproxy, hostname, ios_resque_endpoint, java, logrotate, modprobe,
mongodb, motd, mysql, na
gios, newrelic, nginx, node, ntpd, openldap, openssh, openssl, openvpn,
pacemaker, postfix, python, rabbitmq_chef, redis, resque, ruby-shadow,
runit, stunnel, sudoers, sysct
l, syslog, sysstat, tmpreaper, ucspi-tcp, unicorn, users, varnish, xen,
xinetd, xml, yum, zip, zlib]
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNpm light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeServer light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNodejs light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonVirtualenv light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonPip light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: ChefHandler light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AptRepository light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: BluepillService light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsElasticIp light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsEbsVolume light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: DaemontoolsService light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNodejs light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeServer light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNpm light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonPip light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonVirtualenv light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: ChefHandler light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AptRepository light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: BluepillService light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsElasticIp light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsEbsVolume light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: DaemontoolsService light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: Chef Handlers will be at:
/var/chef/handlers
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: Processing
remote_directory[/var/chef/handlers] action create (chef_handler::default
line 23)

[Mon, 04 Jun 2012 18:24:16 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 18:24:16 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
backed up to
/var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604182416
[Mon, 04 Jun 2012 18:24:16 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
mode changed to 644
[Mon, 04 Jun 2012 18:24:16 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
updated content
[Mon, 04 Jun 2012 18:54:52 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 19:56:16 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 20:27:07 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 20:58:04 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)

Copies of this file that exist in the Chef file backup cache:
[imarlier ~] find /var/chef/backup/ -type f -name "check_couch_two_way*" -ls 11108655 4 -rw-r--r-- 1 root root 82 Jun 4 17:53 /var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604182416 11108654 4 -rw-r--r-- 1 root root 128 Mar 17 13:19 /var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604175324 [imarlier ~]

The ticket that I created at http://tickets.opscode.com/browse/CHEF-3179 has
copies of the broken file (as moved out of the way at 18:24) and the good
file that was created at 18:24. I can also upload a copy of the file that
existed prior to the 17:53 chef-client run, but, well:

[imarlier ~] diff /var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604175324 /usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg [imarlier ~]

I’m like you in having flashbacks to other bugs in the config management
world, which is why I think that this is a pretty critical issue to track
down. This is really way too reminiscent of 1010. There appears to be some
scenario that leads to data corruption within the context of a chef-client
run, and it appears to be independent of chef-server. If the data that was
being corrupted was node attribute data, that would be one thing. But this
is resource attributes, and should have been set at the time that the
template resource was created and added to the run list. Shoot, the
template knew it’s name when the resource was created (because it put the
broken file at the right path), but it didn’t know it’s name when it wrote
the template contents
(because the command name is @params[:name])!

On Tue, Jun 5, 2012 at 6:26 PM, AJ Christensen aj@junglist.gen.nz wrote:

Please show full debug logs for the period and/or the contents of the
Chef File Backup Cache and/or logging statements regarding moving
files into the Backup Cache

This is pretty heavy stuff… reminds me of a bug long ago, on a planet
far away in another configuration management universe.

Regards,

AJ

On 6 June 2012 02:13, Ian Marlier imarlier@brightcove.com wrote:

Hi, Chefs –

After a Chef run on one of my machines last night, a text file on the
system, created from a template, had binary data in it. On the next
Chef
run, everything was fine. I’m wondering if this is something that
anyone
else has experienced. (This is also filed as
http://tickets.opscode.com/browse/CHEF-3179, but I wanted to cast a wide
net
since it’s not an easily searchable issue.)

Basic details:
Chef 0.10.10, Ruby 1.9.2p180, Ohai 0.6.12

Many more details: read on!

I have a resource definition called “nrpe_service”. It’s used to define
services for NRPE (Nagios Remote Plugin framework). In this case, it’s
defined like this:
nrpe_service “check_couch_two_way_replication” do
command "check_couch_replication.py --database ARG1 --source --dest"
end

The definition does some sanity checking (making sure that needed
attributes
are defined, etc), then does this if the sanity check works out:
template “#{node[:nrpe][:confdir]}/#{params[:name]}.cfg” do
cookbook "nagios"
source "nrpe_service.cfg.erb"
owner "root"
group "root"
mode "0644"
variables({:params => params})
end

So far, so good.

The template itself is pretty simple (though, wow, what was I thinking
when
I put some of the modification logic in the template instead of the
resource
definiton? That was silly…):
<% if @params[:command][0] == ‘/’ %>
<% cmd = @params[:command] %>
<% else %>
<% cmd = “#{@node[:nagios][:plugin_path]}/#{@params[:command]}” %>
<% end %>
command[<%= @params[:name] -%>]=<%= cmd %>

The resulting service definition looks like this:
[imarlier nrpe.d]$ cat check_couch_two_way_replication.cfg

command[check_couch_two_way_replication]=/usr/lib64/nagios/plugins/check_couch_replication.py
–database ARG1 --source --dest
[imarlier nrpe.d]$

This has been in place, and has been working just fine, for months now.
Then last night, chef-client ran (a scheduled run) and the file turned
into
this:
[imarlier nrpe.d] cat check_couch_two_way_replication.cfg.chef-20120604182416 command[1?en way_replication]=/usr/lib64/nagios/plugins/["uid", 10003] [imarlier nrpe.d]

This is…wrong.

The next time that Chef ran, everything went back to how it should be.
There were no alterations to the Chef server at this time, and no local
modifications on the client either.

It’s interesting to me that the things that appear to be corrupted are
parameter values that should be derived entirely locally – they’re
elements
of the resource definition, and not parameters that are being pulled
down
from the server or anything like that.

Has anyone seen similar behavior? Given that this isn’t something that
seems to happen with any frequency – and it’s not something that I can
reproduce at will – anyone have ideas as to what I might do to debug
this?
Are there any known data corruption bugs in Chef? In Ruby? I’m kind of
at
a loss, but seeing something like this is pretty terrifying.

Thanks,

  • Ian


Ian Marlier | Senior Systems Engineer
Brightcove, Inc.
290 Congress Street, 4th Floor, Boston, MA 02110
imarlier@brightcove.com


Ian Marlier | Senior Systems Engineer
Brightcove, Inc.
290 Congress Street, 4th Floor, Boston, MA 02110
imarlier@brightcove.com


#5

Thanks, AJ – I’ll let you know if we see this again, or if there’s any
other update. Please do contribute to the Jira ticket as well, if you’ve
got information that you can contribute that might help to track this
down. I’m also happy to contribute myself or folks on my team to help
track this down, but for the moment we don’t really have a starting place.

On Tue, Jun 5, 2012 at 10:59 PM, AJ Christensen aj@junglist.gen.nz wrote:

I’ve allocated Heavy Water engineering resources in attempt to
reproduce and track this down. One of our installations saw this
earlier today.

Please do provide any update or additional information, especially
anyone else reading this who is able to reproduce or has seen similar.

Thankfully the diagnostics will be much easier than #1010.

Cheers,

–AJ

On 6 June 2012 13:24, Ian Marlier imarlier@brightcove.com wrote:

I don’t have debug logs for the time that this issue happened – I don’t
run
with debug logging enabled for the most part (though if this happens
again,
I might start!).

Here’s the INFO level logs from the period where the issue happened,
trimmed
a bit. I’ve included the log lines from initialization (showing that no
changed cookbooks/recipes were downloaded), and the log lines that show
the
operations on this particular resource. The actions at 17:53 are a good
copy of the file being moved out of the way and a broken copy being put
into
place. The actions at 18:24 are the broken file being moved out of the
way,
and a good copy being put back in place.

[Mon, 04 Jun 2012 17:52:56 -0700] INFO: *** Chef 0.10.10 ***
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Run List is [role[falkland],
role[base], role[xen_guest], role[fiji_db_server]]
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Run List expands to
[chef_handler,
chef_handler::register_handlers, chef-client::service, bc_ec2,
bc_ec2::pullkeys, openldap::client, users::root, users::ubuntu, openssh,
dns::hosts, dns::resolv.conf, dns::nsupdate, hostname, motd, bash,
yum::client, apt::brightcove, apt, python, diamond, zip, sudoers,
nagios::nrpe, nagios::send_nsca, ntpd, genders, sysstat, syslog, xml,
postfix, tmpreaper, xen::guest, mysql::default, mysql::nrpe,
pacemaker::default, mysql::pacemaker]
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Starting Chef Run for
fijidb01.fal
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Running start handlers
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Start handlers complete.
[Mon, 04 Jun 2012 17:53:06 -0700] INFO: Loading cookbooks [apache2, apt,
aws, bash, bc_ec2, benson, bind9, bluepill, build-essential, bundler,
chef,
chef-client, chef_handler, couchdb, daemontools, dell, diamond, dns,
erlang,
fiji, genders, git, gluster, haproxy, hostname, ios_resque_endpoint,
java,
logrotate, modprobe, mongodb, motd, mysql, nagios, newrelic, nginx, node,
ntpd, openldap, openssh, openssl, openvpn, pacemaker, postfix, python,
rabbitmq_chef, redis, resque, ruby-shadow, runit, stunnel, sudoers,
sysctl,
syslog, sysstat, tmpreaper, ucspi-tcp, unicorn, users, varnish, xen,
xinetd,
xml, yum, zip, zlib]
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNpm light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeServer light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNodejs light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonVirtualenv light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonPip light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: ChefHandler light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AptRepository light-weight
provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: BluepillService light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsElasticIp light-weight
provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsEbsVolume light-weight
provider
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: DaemontoolsService light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNodejs light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeServer light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNpm light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonPip light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonVirtualenv light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: ChefHandler light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AptRepository light-weight
resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: BluepillService light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsElasticIp light-weight
resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsEbsVolume light-weight
resource
already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: DaemontoolsService light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: Chef Handlers will be at:
/var/chef/handlers
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: Processing
remote_directory[/var/chef/handlers] action create (chef_handler::default
line 23)

[Mon, 04 Jun 2012 17:53:24 -0700] INFO: Processing

template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]

action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 17:53:24 -0700] INFO:

template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]

backed up to

/var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604175324

[Mon, 04 Jun 2012 17:53:24 -0700] INFO:

template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]

mode changed to 644
[Mon, 04 Jun 2012 17:53:24 -0700] INFO:

template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]

updated content

[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Run List is [role[falkland],
role[base], role[xen_guest], role[fiji_db_server]]
[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Run List expands to
[chef_handler,
chef_handler::register_handlers, chef-client::service, bc_ec2,
bc_ec2::pullkeys, openldap::client,
users::root, users::ubuntu, openssh, dns::hosts, dns::resolv.conf,
dns::nsupdate, hostname, motd, bash, yum::client, apt::brightcove, apt,
python, diamond, zip, sudoers, nagios::nrpe, nagios::send_nsca, ntpd,
genders, sysstat, syslog, xml, postfix, tmpreaper, xen::guest,
mysql::default, mysql::nrpe, pacemaker::default, mysql::pacemaker][Mon,
04
Jun 2012 18:23:39 -0700] INFO: Starting Chef Run for fijidb01.fal[Mon, 04
Jun 2012 18:23:39 -0700] INFO: Running start handlers
[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Start handlers complete.
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: Loading cookbooks [apache2, apt,
aws, bash, bc_ec2, benson, bind9, bluepill, build-essential, bundler,
chef,
chef-client, chef_handle
r, couchdb, daemontools, dell, diamond, dns, erlang, fiji, genders, git,
gluster, haproxy, hostname, ios_resque_endpoint, java, logrotate,
modprobe,
mongodb, motd, mysql, na
gios, newrelic, nginx, node, ntpd, openldap, openssh, openssl, openvpn,
pacemaker, postfix, python, rabbitmq_chef, redis, resque, ruby-shadow,
runit, stunnel, sudoers, sysct
l, syslog, sysstat, tmpreaper, ucspi-tcp, unicorn, users, varnish, xen,
xinetd, xml, yum, zip, zlib]
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNpm light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeServer light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNodejs light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonVirtualenv light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonPip light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: ChefHandler light-weight provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AptRepository light-weight
provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: BluepillService light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsElasticIp light-weight
provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsEbsVolume light-weight
provider
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: DaemontoolsService light-weight
provider already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNodejs light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeServer light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNpm light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonPip light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonVirtualenv light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: ChefHandler light-weight resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AptRepository light-weight
resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: BluepillService light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsElasticIp light-weight
resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsEbsVolume light-weight
resource
already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: DaemontoolsService light-weight
resource already initialized – overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: Chef Handlers will be at:
/var/chef/handlers
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: Processing
remote_directory[/var/chef/handlers] action create (chef_handler::default
line 23)

[Mon, 04 Jun 2012 18:24:16 -0700] INFO: Processing

template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]

action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 18:24:16 -0700] INFO:

template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]

backed up to

/var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604182416

[Mon, 04 Jun 2012 18:24:16 -0700] INFO:

template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]

mode changed to 644
[Mon, 04 Jun 2012 18:24:16 -0700] INFO:

template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]

updated content
[Mon, 04 Jun 2012 18:54:52 -0700] INFO: Processing

template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]

action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 19:56:16 -0700] INFO: Processing

template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]

action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 20:27:07 -0700] INFO: Processing

template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]

action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 20:58:04 -0700] INFO: Processing

template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]

action create (couchdb::nrpe line 16)

Copies of this file that exist in the Chef file backup cache:
[imarlier ~]$ find /var/chef/backup/ -type f -name “check_couch_two_way*”
-ls
11108655 4 -rw-r–r-- 1 root root 82 Jun 4 17:53

/var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604182416

11108654 4 -rw-r–r-- 1 root root 128 Mar 17 13:19

/var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604175324

[imarlier ~]$

The ticket that I created at http://tickets.opscode.com/browse/CHEF-3179has
copies of the broken file (as moved out of the way at 18:24) and the good
file that was created at 18:24. I can also upload a copy of the file
that
existed prior to the 17:53 chef-client run, but, well:

[imarlier ~]$ diff

/var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604175324

/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg
[imarlier ~]$

I’m like you in having flashbacks to other bugs in the config management
world, which is why I think that this is a pretty critical issue to track
down. This is really way too reminiscent of 1010. There appears to be
some
scenario that leads to data corruption within the context of a
chef-client
run, and it appears to be independent of chef-server. If the data that
was
being corrupted was node attribute data, that would be one thing. But
this
is resource attributes, and should have been set at the time that the
template resource was created and added to the run list. Shoot, the
template knew it’s name when the resource was created (because it put the
broken file at the right path), but it didn’t know it’s name when it
wrote
the template contents
(because the command name is @params[:name])!

On Tue, Jun 5, 2012 at 6:26 PM, AJ Christensen aj@junglist.gen.nz
wrote:

Please show full debug logs for the period and/or the contents of the
Chef File Backup Cache and/or logging statements regarding moving
files into the Backup Cache

This is pretty heavy stuff… reminds me of a bug long ago, on a planet
far away in another configuration management universe.

Regards,

AJ

On 6 June 2012 02:13, Ian Marlier imarlier@brightcove.com wrote:

Hi, Chefs –

After a Chef run on one of my machines last night, a text file on the
system, created from a template, had binary data in it. On the next
Chef
run, everything was fine. I’m wondering if this is something that
anyone
else has experienced. (This is also filed as
http://tickets.opscode.com/browse/CHEF-3179, but I wanted to cast a
wide

net
since it’s not an easily searchable issue.)

Basic details:
Chef 0.10.10, Ruby 1.9.2p180, Ohai 0.6.12

Many more details: read on!

I have a resource definition called “nrpe_service”. It’s used to
define

services for NRPE (Nagios Remote Plugin framework). In this case,
it’s

defined like this:
nrpe_service “check_couch_two_way_replication” do
command “check_couch_replication.py --database ARG1 --source
–dest”

end

The definition does some sanity checking (making sure that needed
attributes
are defined, etc), then does this if the sanity check works out:
template “#{node[:nrpe][:confdir]}/#{params[:name]}.cfg” do
cookbook "nagios"
source "nrpe_service.cfg.erb"
owner "root"
group "root"
mode "0644"
variables({:params => params})
end

So far, so good.

The template itself is pretty simple (though, wow, what was I thinking
when
I put some of the modification logic in the template instead of the
resource
definiton? That was silly…):
<% if @params[:command][0] == ‘/’ %>
<% cmd = @params[:command] %>
<% else %>
<% cmd = “#{@node[:nagios][:plugin_path]}/#{@params[:command]}” %>
<% end %>
command[<%= @params[:name] -%>]=<%= cmd %>

The resulting service definition looks like this:
[imarlier nrpe.d]$ cat check_couch_two_way_replication.cfg

command[check_couch_two_way_replication]=/usr/lib64/nagios/plugins/check_couch_replication.py

–database ARG1 --source --dest
[imarlier nrpe.d]$

This has been in place, and has been working just fine, for months
now.

Then last night, chef-client ran (a scheduled run) and the file turned
into
this:
[imarlier nrpe.d] cat check_couch_two_way_replication.cfg.chef-20120604182416 command[1?en way_replication]=/usr/lib64/nagios/plugins/["uid", 10003] [imarlier nrpe.d]

This is…wrong.

The next time that Chef ran, everything went back to how it should be.
There were no alterations to the Chef server at this time, and no
local

modifications on the client either.

It’s interesting to me that the things that appear to be corrupted are
parameter values that should be derived entirely locally – they’re
elements
of the resource definition, and not parameters that are being pulled
down
from the server or anything like that.

Has anyone seen similar behavior? Given that this isn’t something
that

seems to happen with any frequency – and it’s not something that I
can

reproduce at will – anyone have ideas as to what I might do to debug
this?
Are there any known data corruption bugs in Chef? In Ruby? I’m kind
of

at
a loss, but seeing something like this is pretty terrifying.

Thanks,

  • Ian


Ian Marlier | Senior Systems Engineer
Brightcove, Inc.
290 Congress Street, 4th Floor, Boston, MA 02110
imarlier@brightcove.com


Ian Marlier | Senior Systems Engineer
Brightcove, Inc.
290 Congress Street, 4th Floor, Boston, MA 02110
imarlier@brightcove.com


Ian Marlier | Senior Systems Engineer
Brightcove, Inc.
290 Congress Street, 4th Floor, Boston, MA 02110
imarlier@brightcove.com


#6

On Tue, Jun 5, 2012 at 10:59 PM, AJ Christensen aj@junglist.gen.nz wrote:

I’ve allocated Heavy Water engineering resources in attempt to
reproduce and track this down. One of our installations saw this
earlier today.

Can you comment on CHEF-3179 as to what you saw and your environment?

Bryan


#7

Ian,

I’d like to suggest a possible explanation: You (or someone) is modifying
the params object after declaring the resource. This is a potential
explanation because there are no guarantees about what does or does not
happen to objects that something else owns, and the things that can happen
or not happen are likely to change for reasons that you cannot predict.

Properly speaking, you are of course never allowed to hold onto objects
which you don’t own, let alone with the assumption that they don’t change.
This isn’t a constraint with Chef in particular. It’s a principle of proper
software design in general. In this case, you are holding onto the
paramsobject by assigning it in the resource’s variables attribute.
To solve this
issue, make a copy of the params object, preferably one holding just the
slices of data you actually need for the template resource, and assign the
copy in the resource’s variables attribute.

Cheers,
Jay

On Wed, Jun 6, 2012 at 11:37 AM, Bryan McLellan btm@loftninjas.org wrote:

On Tue, Jun 5, 2012 at 10:59 PM, AJ Christensen aj@junglist.gen.nz
wrote:

I’ve allocated Heavy Water engineering resources in attempt to
reproduce and track this down. One of our installations saw this
earlier today.

Can you comment on CHEF-3179 as to what you saw and your environment?

Bryan


#8

Interesting – can you give an example of what you mean when you suggest
making a copy of the params object and assigning the copy? I don’t really
understand what that translates into as far as code…

  • Ian

On Wed, Jun 6, 2012 at 4:41 PM, Jay Feldblum y_feldblum@yahoo.com wrote:

Ian,

I’d like to suggest a possible explanation: You (or someone) is modifying
the params object after declaring the resource. This is a potential
explanation because there are no guarantees about what does or does not
happen to objects that something else owns, and the things that can happen
or not happen are likely to change for reasons that you cannot predict.

Properly speaking, you are of course never allowed to hold onto objects
which you don’t own, let alone with the assumption that they don’t change.
This isn’t a constraint with Chef in particular. It’s a principle of proper
software design in general. In this case, you are holding onto the *params

  • object by assigning it in the resource’s variables attribute. To solve
    this issue, make a copy of the params object, preferably one holding
    just the slices of data you actually need for the template resource, and
    assign the copy in the resource’s variables attribute.

Cheers,
Jay

On Wed, Jun 6, 2012 at 11:37 AM, Bryan McLellan btm@loftninjas.orgwrote:

On Tue, Jun 5, 2012 at 10:59 PM, AJ Christensen aj@junglist.gen.nz
wrote:

I’ve allocated Heavy Water engineering resources in attempt to
reproduce and track this down. One of our installations saw this
earlier today.

Can you comment on CHEF-3179 as to what you saw and your environment?

Bryan


Ian Marlier | Senior Systems Engineer
Brightcove, Inc.
290 Congress Street, 4th Floor, Boston, MA 02110
imarlier@brightcove.com


#9

Ian,

Here is the original:

    template "#{node[:nrpe][:confdir]}/#{params[:name]}.cfg" do
        cookbook "nagios"
        source "nrpe_service.cfg.erb"
        owner "root"
        group "root"
        mode "0644"
        variables({:params => params})
    end

Here is the version with the issue I noted corrected, and with a measure of
extra caution thrown in:

    template "#{node[:nrpe][:confdir]}/#{params[:name]}.cfg" do
        p = Mash.new
        p["name"] = params["name"].dup
        p["command"] = params["command"].dup

        cookbook "nagios"
        source "nrpe_service.cfg.erb"
        owner "root"
        group "root"
        mode "0644"
        variables({:params => p})
    end

Cheers,
Jay

On Wed, Jun 6, 2012 at 4:54 PM, Ian Marlier imarlier@brightcove.com wrote:

Interesting – can you give an example of what you mean when you suggest
making a copy of the params object and assigning the copy? I don’t really
understand what that translates into as far as code…

  • Ian

On Wed, Jun 6, 2012 at 4:41 PM, Jay Feldblum y_feldblum@yahoo.com wrote:

Ian,

I’d like to suggest a possible explanation: You (or someone) is modifying
the params object after declaring the resource. This is a potential
explanation because there are no guarantees about what does or does not
happen to objects that something else owns, and the things that can happen
or not happen are likely to change for reasons that you cannot predict.

Properly speaking, you are of course never allowed to hold onto
objects which you don’t own, let alone with the assumption that they don’t
change. This isn’t a constraint with Chef in particular. It’s a principle
of proper software design in general. In this case, you are holding onto
the params object by assigning it in the resource’s variables
attribute. To solve this issue, make a copy of the params object,
preferably one holding just the slices of data you actually need for the
template resource, and assign the copy in the resource’s variables
attribute.

Cheers,
Jay

On Wed, Jun 6, 2012 at 11:37 AM, Bryan McLellan btm@loftninjas.orgwrote:

On Tue, Jun 5, 2012 at 10:59 PM, AJ Christensen aj@junglist.gen.nz
wrote:

I’ve allocated Heavy Water engineering resources in attempt to
reproduce and track this down. One of our installations saw this
earlier today.

Can you comment on CHEF-3179 as to what you saw and your environment?

Bryan


Ian Marlier | Senior Systems Engineer
Brightcove, Inc.
290 Congress Street, 4th Floor, Boston, MA 02110
imarlier@brightcove.com