Moved node to new chef server, no longer converges


#1

Ohai Chefs,

We are in the process of moving several nodes from older hosted Chef to Enterprise Chef servers. I’ve moved several alpha (test) nodes but they no longer converge. Here are steps I followed.

  1. Noted the IP address of the node
  2. Deleted the node from the old chef server
  3. Via ssh visited the node and clean out the /etc/chef directory
  4. Ran knife bootstrap
  5. Visited the node and confirmed via ps -ef | grep chef-client that chef-client was running with the proper interval and splay.

After waiting the 15 minutes for the interval the node fails to converge with these messages in /var/log/chef/client.log

[2014-07-16T13:38:50-05:00] INFO: *** Chef 11.12.2 ***
[2014-07-16T13:38:50-05:00] INFO: Chef-client pid: 7550
[2014-07-16T13:38:50-05:00] FATAL: Stacktrace dumped to /var/chef/cache/chef-stacktrace.out
[2014-07-16T13:38:50-05:00] ERROR: Unable to determine node name: configure node_name or configure the system’s hostname and fqdn
[2014-07-16T13:38:50-05:00] ERROR: Chef::Exceptions::ChildConvergeError: Chef run process exited unsuccessfully (exit code 1)
[2014-07-16T13:38:50-05:00] ERROR: Sleeping for 900 seconds before trying again

And these messages in /var/chef/cache/chef-stacktrace.out

Generated at 2014-07-16 13:38:50 -0500
Chef::Exceptions::CannotDetermineNodeName: Unable to determine node name: configure node_name or configure the system’s hostname and fqdn
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:299:in node_name' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:313:inregister’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:416:in do_run' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:213:inblock in run’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:207:in fork' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:207:inrun’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/application.rb:217:in run_chef_client' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/application/client.rb:328:inblock in run_application’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/application/client.rb:317:in loop' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/application/client.rb:317:inrun_application’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/application.rb:67:in run' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/bin/chef-client:26:in<top (required)>’
/usr/bin/chef-client:23:in load' /usr/bin/chef-client:23:in'
/var/chef/cache/chef-stacktrace.out (END)

I’ve checked hostname -f and it produces the FQDN. Running chef-client manually on the node succeeds, it’s only the background process that’s failing. For what it’s worth, here is the output from ps -ef | grep chef-client

root 2662 1 0 Jul11 ? 00:00:13 /opt/chef/embedded/bin/ruby /usr/bin/chef-client -d -c /etc/chef/client.rb -L /var/log/chef/client.log -P /var/run/chef/client.pid -i 900 -s 180

I’m stumped.

What am I not seeing?

Thanks,
Mark


#2

Can you increase the log level to debug in client.rb? That should show the failing ohai output. You can also try running ohai manually as well.

On Wednesday, July 16, 2014 at 11:53 AM, Mark Nichols wrote:

Ohai Chefs,

We are in the process of moving several nodes from older hosted Chef to Enterprise Chef servers. I’ve moved several alpha (test) nodes but they no longer converge. Here are steps I followed.

  1. Noted the IP address of the node
  2. Deleted the node from the old chef server
  3. Via ssh visited the node and clean out the /etc/chef directory
  4. Ran knife bootstrap
  5. Visited the node and confirmed via ps -ef | grep chef-client that chef-client was running with the proper interval and splay.

After waiting the 15 minutes for the interval the node fails to converge with these messages in /var/log/chef/client.log

[2014-07-16T13:38:50-05:00] INFO: *** Chef 11.12.2 ***
[2014-07-16T13:38:50-05:00] INFO: Chef-client pid: 7550
[2014-07-16T13:38:50-05:00] FATAL: Stacktrace dumped to /var/chef/cache/chef-stacktrace.out
[2014-07-16T13:38:50-05:00] ERROR: Unable to determine node name: configure node_name or configure the system’s hostname and fqdn
[2014-07-16T13:38:50-05:00] ERROR: Chef::Exceptions::ChildConvergeError: Chef run process exited unsuccessfully (exit code 1)
[2014-07-16T13:38:50-05:00] ERROR: Sleeping for 900 seconds before trying again

And these messages in /var/chef/cache/chef-stacktrace.out

Generated at 2014-07-16 13:38:50 -0500
Chef::Exceptions::CannotDetermineNodeName: Unable to determine node name: configure node_name or configure the system’s hostname and fqdn
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:299:in node_name' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:313:inregister’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:416:in do_run' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:213:inblock in run’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:207:in fork' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:207:inrun’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/application.rb:217:in run_chef_client' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/application/client.rb:328:inblock in run_application’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/application/client.rb:317:in loop' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/application/client.rb:317:inrun_application’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/application.rb:67:in run' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/bin/chef-client:26:in<top (required)>’
/usr/bin/chef-client:23:in load' /usr/bin/chef-client:23:in'
/var/chef/cache/chef-stacktrace.out (END)

I’ve checked hostname -f and it produces the FQDN. Running chef-client manually on the node succeeds, it’s only the background process that’s failing. For what it’s worth, here is the output from ps -ef | grep chef-client

root 2662 1 0 Jul11 ? 00:00:13 /opt/chef/embedded/bin/ruby /usr/bin/chef-client -d -c /etc/chef/client.rb -L /var/log/chef/client.log -P /var/run/chef/client.pid -i 900 -s 180

I’m stumped.

What am I not seeing?

Thanks,
Mark


#3

Very strange that manual chef-client run succeeds but chef-client daemon
does not.

This is silly, but did you kill the old chef-client process before
bootstrapping? Also, have you tried rebooting the host?

On Wed, Jul 16, 2014 at 11:53 AM, Mark Nichols chef@zanshin.net wrote:

Ohai Chefs,

We are in the process of moving several nodes from older hosted Chef to
Enterprise Chef servers. I’ve moved several alpha (test) nodes but they no
longer converge. Here are steps I followed.

  1. Noted the IP address of the node
  2. Deleted the node from the old chef server
  3. Via ssh visited the node and clean out the /etc/chef directory
  4. Ran knife bootstrap
  5. Visited the node and confirmed via ps -ef | grep chef-client that
    chef-client was running with the proper interval and splay.

After waiting the 15 minutes for the interval the node fails to converge
with these messages in /var/log/chef/client.log

[2014-07-16T13:38:50-05:00] INFO: *** Chef 11.12.2 ***
[2014-07-16T13:38:50-05:00] INFO: Chef-client pid: 7550
[2014-07-16T13:38:50-05:00] FATAL: Stacktrace dumped to
/var/chef/cache/chef-stacktrace.out
[2014-07-16T13:38:50-05:00] ERROR: Unable to determine node name:
configure node_name or configure the system’s hostname and fqdn
[2014-07-16T13:38:50-05:00] ERROR: Chef::Exceptions::ChildConvergeError:
Chef run process exited unsuccessfully (exit code 1)
[2014-07-16T13:38:50-05:00] ERROR: Sleeping for 900 seconds before trying
again

And these messages in /var/chef/cache/chef-stacktrace.out

Generated at 2014-07-16 13:38:50 -0500
Chef::Exceptions::CannotDetermineNodeName: Unable to determine node name:
configure node_name or configure the system’s hostname and fqdn
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:299:in
node_name' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:313:inregister’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:416:in
do_run' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:213:inblock in run’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:207:in
fork' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/client.rb:207:inrun’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/application.rb:217:in
run_chef_client' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/application/client.rb:328:inblock in run_application’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/application/client.rb:317:in
loop' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/application/client.rb:317:inrun_application’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/lib/chef/application.rb:67:in
run' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.12.2/bin/chef-client:26:in<top (required)>’
/usr/bin/chef-client:23:in load' /usr/bin/chef-client:23:in'
/var/chef/cache/chef-stacktrace.out (END)

I’ve checked hostname -f and it produces the FQDN. Running chef-client
manually on the node succeeds, it’s only the background process that’s
failing. For what it’s worth, here is the output from ps -ef | grep chef-client

root 2662 1 0 Jul11 ? 00:00:13
/opt/chef/embedded/bin/ruby /usr/bin/chef-client -d -c /etc/chef/client.rb
-L /var/log/chef/client.log -P /var/run/chef/client.pid -i 900 -s 180

I’m stumped.

What am I not seeing?

Thanks,
Mark


Best regards, Dmitriy V.


#4

On Jul 16, 2014, at 7:21 PM, DV vindimy@gmail.com wrote:

Very strange that manual chef-client run succeeds but chef-client daemon does not.

This is silly, but did you kill the old chef-client process before bootstrapping? Also, have you tried rebooting the host?

I did kill the old chef-client process prior to the new bootstrap command. I haven’t tried rebooting the host. That would be, ah, complicated. These are legacy (i.e., pre-chef) servers that aren’t easily rebooted.

I’ll explore more with tcpdump and debug turned on for the chef client.

— Mark