Edge case bug when switching Chef servers via bootstrap

I've run into a very interesting edge case which temporarily breaks Chef when switching Chef servers using a bootstrap process.

A little background, to explain... for security reasons, my company is splitting our networks into Prod and Non-Prod spaces, which won't be able to talk to each other. I'm preparing our infrastructure by creating a non-prod Chef server which matches most of our existing prod Chef server.

The cookbooks, recipes, roles, etc. are identical between the two systems (yay for Infrastructure as Code - just push up the data!). So telling (for example) a QA node to switch over from the prod server to the non-prod one involves:

  • Removing the node from the prod server (knife client delete NODE --profile prod, knife node delete NODE --profile prod).
  • Bootstrap the node from the non-prod server (knife bootstrap NODE AUTH HERE --profile nonprod).

Simple and clean. Works just fine under Windows. But because of how the Chef Client runs in Linux, it actually causes issues.

After the bootstrap, the old chef-client daemon is still running - and loaded using the old client.rb configuration, which is now pointing at the wrong server.

All I have to do to fix this is the bounce the chef-client service, but it would be nice if the bootstrap process was able to handle this. It's able to tell that there's an existing Chef installation, so it should be a simple matter of triggering a daemon restart at the end of the run.

Hi Pahanda,

wouldn't appending "--bootstrap-preinstall-command stop chef-client service" to your bootstrap (assuming this is the correct service manager syntax) work in the interim?

If its an unattended bootstrap you may be able to also augment a user data script to check and stop an chef-client configured as a service - https://docs.chef.io/install_bootstrap/#bootstrapping-with-user-data

I appreciate the bootstrap process could discover whether there is a chef client service running in the background and do this correctly on a per platform basis. This sounds like a feature request you could raise at https://chef-software.ideas.aha.io where it may garner other voices with the same interests.

Good point. Yes, "--bootstrap-preinstall-command systemctl stop chef-client" would do the trick (and I'll be using this, going forward, so thanks for the tip!).

I wasn't sure if this would qualify as a bug or a feature request, because it's a very narrow edge case, but definitely has undesired consequences.

Another possible solution: You can use the chef_client_updater cookbook https://github.com/chef-cookbooks/chef_client_updater it handles things like stopping the running service.

We are actually using that, but unfortunately, it didn't do handle the service management (at least, not by default).

I'm interested to see what the new "native" service handlers in v16 can do, although unfortunately, there seems to be some serious gem/cookbook conflicts with our existing setup, so we're sticking to v15.10.12 for now.

if you're interested in giving the cookbook another go, take a moment to read the README's "init system caveats" section, it covers what you'd need to know for the service to transition smoothly