I don't have debug logs for the time that this issue happened -- I don't
run with debug logging enabled for the most part (though if this happens
again, I might start!).
Here's the INFO level logs from the period where the issue happened,
trimmed a bit. I've included the log lines from initialization (showing
that no changed cookbooks/recipes were downloaded), and the log lines that
show the operations on this particular resource. The actions at 17:53 are
a good copy of the file being moved out of the way and a broken copy being
put into place. The actions at 18:24 are the broken file being moved out
of the way, and a good copy being put back in place.
[Mon, 04 Jun 2012 17:52:56 -0700] INFO: *** Chef 0.10.10 ***
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Run List is [role[falkland],
role[base], role[xen_guest], role[fiji_db_server]]
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Run List expands to [chef_handler,
chef_handler::register_handlers, chef-client::service, bc_ec2,
bc_ec2::pullkeys, openldap::client, users::root, users::ubuntu, openssh,
dns::hosts, dns::resolv.conf, dns::nsupdate, hostname, motd, bash,
yum::client, apt::brightcove, apt, python, diamond, zip, sudoers,
nagios::nrpe, nagios::send_nsca, ntpd, genders, sysstat, syslog, xml,
postfix, tmpreaper, xen::guest, mysql::default, mysql::nrpe,
pacemaker::default, mysql::pacemaker]
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Starting Chef Run for fijidb01.fal
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Running start handlers
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Start handlers complete.
[Mon, 04 Jun 2012 17:53:06 -0700] INFO: Loading cookbooks [apache2, apt,
aws, bash, bc_ec2, benson, bind9, bluepill, build-essential, bundler, chef,
chef-client, chef_handler, couchdb, daemontools, dell, diamond, dns,
erlang, fiji, genders, git, gluster, haproxy, hostname,
ios_resque_endpoint, java, logrotate, modprobe, mongodb, motd, mysql,
nagios, newrelic, nginx, node, ntpd, openldap, openssh, openssl, openvpn,
pacemaker, postfix, python, rabbitmq_chef, redis, resque, ruby-shadow,
runit, stunnel, sudoers, sysctl, syslog, sysstat, tmpreaper, ucspi-tcp,
unicorn, users, varnish, xen, xinetd, xml, yum, zip, zlib]
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNpm light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeServer light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNodejs light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonVirtualenv light-weight
provider already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonPip light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: ChefHandler light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AptRepository light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: BluepillService light-weight
provider already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsElasticIp light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsEbsVolume light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: DaemontoolsService light-weight
provider already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNodejs light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeServer light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNpm light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonPip light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonVirtualenv light-weight
resource already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: ChefHandler light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AptRepository light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: BluepillService light-weight
resource already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsElasticIp light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsEbsVolume light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: DaemontoolsService light-weight
resource already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: Chef Handlers will be at:
/var/chef/handlers
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: Processing
remote_directory[/var/chef/handlers] action create (chef_handler::default
line 23)
[Mon, 04 Jun 2012 17:53:24 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 17:53:24 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
backed up to
/var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604175324
[Mon, 04 Jun 2012 17:53:24 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
mode changed to 644
[Mon, 04 Jun 2012 17:53:24 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
updated content
[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Run List is [role[falkland],
role[base], role[xen_guest], role[fiji_db_server]]
[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Run List expands to [chef_handler,
chef_handler::register_handlers, chef-client::service, bc_ec2,
bc_ec2::pullkeys, openldap::client,
users::root, users::ubuntu, openssh, dns::hosts, dns::resolv.conf,
dns::nsupdate, hostname, motd, bash, yum::client, apt::brightcove, apt,
python, diamond, zip, sudoers, nagios::nrpe, nagios::send_nsca, ntpd,
genders, sysstat, syslog, xml, postfix, tmpreaper, xen::guest,
mysql::default, mysql::nrpe, pacemaker::default, mysql::pacemaker][Mon, 04
Jun 2012 18:23:39 -0700] INFO: Starting Chef Run for fijidb01.fal[Mon, 04
Jun 2012 18:23:39 -0700] INFO: Running start handlers
[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Start handlers complete.
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: Loading cookbooks [apache2, apt,
aws, bash, bc_ec2, benson, bind9, bluepill, build-essential, bundler, chef,
chef-client, chef_handle
r, couchdb, daemontools, dell, diamond, dns, erlang, fiji, genders, git,
gluster, haproxy, hostname, ios_resque_endpoint, java, logrotate, modprobe,
mongodb, motd, mysql, na
gios, newrelic, nginx, node, ntpd, openldap, openssh, openssl, openvpn,
pacemaker, postfix, python, rabbitmq_chef, redis, resque, ruby-shadow,
runit, stunnel, sudoers, sysct
l, syslog, sysstat, tmpreaper, ucspi-tcp, unicorn, users, varnish, xen,
xinetd, xml, yum, zip, zlib]
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNpm light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeServer light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNodejs light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonVirtualenv light-weight
provider already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonPip light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: ChefHandler light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AptRepository light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: BluepillService light-weight
provider already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsElasticIp light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsEbsVolume light-weight provider
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: DaemontoolsService light-weight
provider already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNodejs light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeServer light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNpm light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonPip light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonVirtualenv light-weight
resource already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: ChefHandler light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AptRepository light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: BluepillService light-weight
resource already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsElasticIp light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsEbsVolume light-weight resource
already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: DaemontoolsService light-weight
resource already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: Chef Handlers will be at:
/var/chef/handlers
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: Processing
remote_directory[/var/chef/handlers] action create (chef_handler::default
line 23)
[Mon, 04 Jun 2012 18:24:16 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 18:24:16 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
backed up to
/var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604182416
[Mon, 04 Jun 2012 18:24:16 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
mode changed to 644
[Mon, 04 Jun 2012 18:24:16 -0700] INFO:
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
updated content
[Mon, 04 Jun 2012 18:54:52 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 19:56:16 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 20:27:07 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 20:58:04 -0700] INFO: Processing
template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg]
action create (couchdb::nrpe line 16)
Copies of this file that exist in the Chef file backup cache:
[imarlier ~]$ find /var/chef/backup/ -type f -name "check_couch_two_way*"
-ls
11108655 4 -rw-r--r-- 1 root root 82 Jun 4 17:53
/var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604182416
11108654 4 -rw-r--r-- 1 root root 128 Mar 17 13:19
/var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604175324
[imarlier ~]$
The ticket that I created at
http://tickets.opscode.com/browse/CHEF-3179has copies of the broken
file (as moved out of the way at 18:24) and the
good file that was created at 18:24. I can also upload a copy of the file
that existed prior to the 17:53 chef-client run, but, well:
[imarlier ~]$ diff
/var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604175324
/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg
[imarlier ~]$
I'm like you in having flashbacks to other bugs in the config management
world, which is why I think that this is a pretty critical issue to track
down. This is really way too reminiscent of
1010http://projects.puppetlabs.com/issues/1010.
There appears to be some scenario that leads to data corruption within the
context of a chef-client run, and it appears to be independent of
chef-server. If the data that was being corrupted was node attribute data,
that would be one thing. But this is resource attributes, and should have
been set at the time that the template resource was created and added to
the run list. Shoot, the template knew it's name when the resource was
created (because it put the broken file at the right path), but it didn't
know it's name when it wrote the template contents (because the command
name is @params[:name])!
On Tue, Jun 5, 2012 at 6:26 PM, AJ Christensen aj@junglist.gen.nz wrote:
Please show full debug logs for the period and/or the contents of the
Chef File Backup Cache and/or logging statements regarding moving
files into the Backup Cache
This is pretty heavy stuff.. reminds me of a bug long ago, on a planet
far away in another configuration management universe.
Regards,
AJ
On 6 June 2012 02:13, Ian Marlier imarlier@brightcove.com wrote:
Hi, Chefs --
After a Chef run on one of my machines last night, a text file on the
system, created from a template, had binary data in it. On the next Chef
run, everything was fine. I'm wondering if this is something that anyone
else has experienced. (This is also filed as
http://tickets.opscode.com/browse/CHEF-3179, but I wanted to cast a
wide net
since it's not an easily searchable issue.)
Basic details:
Chef 0.10.10, Ruby 1.9.2p180, Ohai 0.6.12
Many more details: read on!
I have a resource definition called "nrpe_service". It's used to define
services for NRPE (Nagios Remote Plugin framework). In this case, it's
defined like this:
nrpe_service "check_couch_two_way_replication" do
command "check_couch_replication.py --database $ARG1$ --source --dest"
end
The definition does some sanity checking (making sure that needed
attributes
are defined, etc), then does this if the sanity check works out:
template "#{node[:nrpe][:confdir]}/#{params[:name]}.cfg" do
cookbook "nagios"
source "nrpe_service.cfg.erb"
owner "root"
group "root"
mode "0644"
variables({:params => params})
end
So far, so good.
The template itself is pretty simple (though, wow, what was I thinking
when
I put some of the modification logic in the template instead of the
resource
definiton? That was silly...):
<% if @params[:command][0] == '/' %>
<% cmd = @params[:command] %>
<% else %>
<% cmd = "#{@node[:nagios][:plugin_path]}/#{@params[:command]}" %>
<% end %>
command[<%= @params[:name] -%>]=<%= cmd %>
The resulting service definition looks like this:
[imarlier nrpe.d]$ cat check_couch_two_way_replication.cfg
command[check_couch_two_way_replication]=/usr/lib64/nagios/plugins/check_couch_replication.py
--database $ARG1$ --source --dest
[imarlier nrpe.d]$
This has been in place, and has been working just fine, for months now.
Then last night, chef-client ran (a scheduled run) and the file turned
into
this:
[imarlier nrpe.d]$ cat
check_couch_two_way_replication.cfg.chef-20120604182416
command[1?en
way_replication]=/usr/lib64/nagios/plugins/["uid", 10003]
[imarlier nrpe.d]$
This is...wrong.
The next time that Chef ran, everything went back to how it should be.
There were no alterations to the Chef server at this time, and no local
modifications on the client either.
It's interesting to me that the things that appear to be corrupted are
parameter values that should be derived entirely locally -- they're
elements
of the resource definition, and not parameters that are being pulled down
from the server or anything like that.
Has anyone seen similar behavior? Given that this isn't something that
seems to happen with any frequency -- and it's not something that I can
reproduce at will -- anyone have ideas as to what I might do to debug
this?
Are there any known data corruption bugs in Chef? In Ruby? I'm kind of
at
a loss, but seeing something like this is pretty terrifying.
Thanks,
--
Ian Marlier | Senior Systems Engineer
Brightcove, Inc.
290 Congress Street, 4th Floor, Boston, MA 02110
imarlier@brightcove.com
--
Ian Marlier | Senior Systems Engineer
Brightcove, Inc.
290 Congress Street, 4th Floor, Boston, MA 02110
imarlier@brightcove.com