Bizarre inodes on RHEL6 after chef-client run


#1

I have come across a rather large (and repeatable) problem trying to use
Chef on RHEL6 systems. This works fine on Ubuntu, SLES11, and our own
BastionLinux (a Fedora variant) - I very much suspect it’s a kernel bug
in RHEL6, but would very much like to know if anyone else has come
across it.

To exemplify the issue, we’re using the SNMP recipe to write
/etc/snmp/snmpd.conf, start snmpd, then attempt a walk. Unfortunately,
after the chef-client run, /etc/snmp/snmpd.conf has a suspiciously large
inode number and the snmpd daemon, whilst up, isn’t responding.

So - prior to the chef-client run (all good):

[root@rheltest snmp]# ls -il
total 8
132819 -rw-r–r--. 1 root root 2024 Apr 26 10:58 snmpd.conf
132810 -rw-r–r--. 1 root root 220 Nov 16 01:07 snmptrapd.conf

[root@rheltest snmp]# snmpwalk -v1 -cpublic 127.0.0.1 system
SNMPv2-MIB::sysDescr.0 = STRING: Linux rheltest.drives.xxxx 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64
SNMPv2-MIB::sysObjectID.0 = OID: NET-SNMP-MIB::netSnmpAgentOIDs.10
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1415) 0:00:14.15
SNMPv2-MIB::sysContact.0 = STRING: Rootroot@localhost
SNMPv2-MIB::sysName.0 = STRING: rheltest.drives.xxxx

Now taint snmpd.conf to enforce overwriting on next chef run …

[root@rheltest snmp]# vi snmpd.conf
[root@rheltest snmp]# /etc/rc.d/init.d/snmpd restart
Stopping snmpd: [ OK ]
Starting snmpd: [ OK ]
[root@rheltest snmp]# chef-client
[Thu, 26 Apr 2012 11:00:08 +1000] INFO: *** Chef 0.10.8 ***
[Thu, 26 Apr 2012 11:00:08 +1000] INFO: Run List is [role[myxxx-debug]]
[Thu, 26 Apr 2012 11:00:08 +1000] INFO: Run List expands to [myxxx]
[Thu, 26 Apr 2012 11:00:08 +1000] INFO: Starting Chef Run for rheltest.drives.xxxxx
[Thu, 26 Apr 2012 11:00:08 +1000] INFO: Running start handlers
[Thu, 26 Apr 2012 11:00:08 +1000] INFO: Start handlers complete.
[Thu, 26 Apr 2012 11:00:09 +1000] INFO: Loading cookbooks [perl, snmp]
[Thu, 26 Apr 2012 11:00:09 +1000] INFO: Processing package[net-snmp] action install (snmp::default line 21)
[Thu, 26 Apr 2012 11:00:10 +1000] INFO: Processing package[net-snmp-utils] action install (snmp::default line 21)
[Thu, 26 Apr 2012 11:00:10 +1000] INFO: Processing service[snmpd] action start (snmp::default line 34)
[Thu, 26 Apr 2012 11:00:10 +1000] INFO: Processing service[snmpd] action enable (snmp::default line 34)
[Thu, 26 Apr 2012 11:00:10 +1000] INFO: Processing template[/etc/snmp/snmpd.conf] action create (snmp::default line 38)
[Thu, 26 Apr 2012 11:00:10 +1000] INFO: template[/etc/snmp/snmpd.conf] backed up to /var/chef/backup/etc/snmp/snmpd.conf.chef-20120426110010
[Thu, 26 Apr 2012 11:00:10 +1000] INFO: template[/etc/snmp/snmpd.conf] mode changed to 644
[Thu, 26 Apr 2012 11:00:10 +1000] INFO: template[/etc/snmp/snmpd.conf] updated content
[Thu, 26 Apr 2012 11:00:10 +1000] INFO: template[/etc/snmp/snmpd.conf] sending restart action to service[snmpd] (delayed)
[Thu, 26 Apr 2012 11:00:10 +1000] INFO: Processing service[snmpd] action restart (snmp::default line 34)
[Thu, 26 Apr 2012 11:00:12 +1000] INFO: service[snmpd] restarted
[Thu, 26 Apr 2012 11:00:12 +1000] INFO: Chef Run complete in 3.802594 seconds
[Thu, 26 Apr 2012 11:00:12 +1000] INFO: Running report handlers
[Thu, 26 Apr 2012 11:00:12 +1000] INFO: Report handlers complete
[root@rheltest snmp]#
[root@rheltest snmp]#

[root@rheltest snmp]# ls -il
total 8
544050 -rw-r–r--. 1 root root 2024 Apr 26 11:00 snmpd.conf
132810 -rw-r–r--. 1 root root 220 Nov 16 01:07 snmptrapd.conf

[root@rheltest snmp]# snmpwalk -v1 -cpublic 127.0.0.1 system
Timeout: No Response from 127.0.0.1

The inode number assigned should be something circa 132xxx, not
544xxx. The file itself is still readable and appears to behave
completely normally except within sysv init scripts (starting snmpd
from the command line works). If I copy/move the file such that it gets
a 132xxx inode, then everything works fine out of init.d, all with
exactly the same file content. I also have synonymous issues with
OpenLDAP recipe’s and sysv startup, so do anticipate this being quite
endemic.

Chef itself is entirely innocuous with it’s template generation: it
simply does pure Ruby File.open(‘w+’), and FileUtils.mv which I expect
rely on stdio and whatever underlying file implementation (in our case,
ext4). It is quite difficult to point at Chef or Ruby software as being
at fault as it’s all sooo simple at this level.

The chef/gems stack is exactly the same on our unaffected BastionLinux
as RHEL6, chef 0.10.8
(http://linux.last-bastion.net/LBN/up2date/cloud/13 for full repo details).

The version of Ruby RHEL6 ships is 1.8.7.352. I actually compiled and
deployed BastionLinux’s Ruby on the RHEL box (1.8.7.334 but with
rb-readline support for irb command history), and still have the issue -
with everything but the kernel identical. The RHEL6 kernel version is
2.6.32-220, whereas we’re at 2.6.34.8-68 on BastionLinux.

I’d love to know if anyone else has experienced similar issues as it
basically makes chef useless for RHEL6 deployments. I am happy to raise
this with Red Hat, but RHEL6 is hardly beta, and I really want to make
sure I’m not missing something …

Alan