Systemd-journald.service restart kills chef-client on EL 7

Hello,

I have a recipes and cookbooks that manage systemd services (NetworkManager, systemd-journald).

Since upgrading to RHEL 7.2 latest, performing a systemctl restart on either of these during the course of a chef-client run causes the chef-client service to abuptly exit and requires a manual restart of the chef-client service to get the node converging again. Affects both virtualization hosts and guests.

Is this a known issue? Is this an issue that is resolved in a newer version of the chef client package?

  • RHEL 7.2 (kernel-3.10.0-327.4.5.el7) systemd 219-19
  • chef 12.0.3 package

Error output:

[root@host001 ~]# systemctl status chef-client
● chef-client.service - Chef Client daemon
   Loaded: loaded (/etc/systemd/system/chef-client.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2016-02-05 00:47:10 UTC; 1 day 15h ago
 Main PID: 20633 (code=exited, status=1/FAILURE)

Feb 05 00:47:10 host001 chef-client[20633]: [2016-02-05T00:47:10+00:00] INFO: Processing cookbook_file[journald.conf] action create (linux-core::default line 438)
Feb 05 00:47:10 host001 chef-client[20633]: [2016-02-05T00:47:10+00:00] INFO: cookbook_file[journald.conf] backed up to /var/chef/backup/etc/systemd/journald.conf.chef-20...04710.069193
Feb 05 00:47:10 host001 chef-client[20633]: [2016-02-05T00:47:10+00:00] INFO: cookbook_file[journald.conf] updated file contents /etc/systemd/journald.conf
Feb 05 00:47:10 host001 chef-client[20633]: [2016-02-05T00:47:10+00:00] INFO: cookbook_file[journald.conf] sending restart action to service[systemd-journald] (immediate)
Feb 05 00:47:10 host001 chef-client[20633]: [2016-02-05T00:47:10+00:00] INFO: Processing service[systemd-journald] action restart (linux-core::default line 451)
Feb 05 00:47:10 host001 systemd[1]: chef-client.service: main process exited, code=exited, status=1/FAILURE
Feb 05 00:47:10 host001 systemd[1]: Unit chef-client.service entered failed state.
Feb 05 00:47:10 host001 systemd[1]: chef-client.service failed.

Update: 12.6.0 has the same issue. Investigating systemd 219 changes

Edit: Testing on a newly-kicked 7.1 image, this seems to affect systemd 208 as well.

service 'systemd-journald.service' do
  action :restart
end

Causes chef-client running in the system.slice to bomb out with exit status ‘1’. Any way to get more useful information from chef-client?

This smells like a ruby+systemd bug, but I’m still unsure. From more testing, it seems that the stdout pipe in use gets clobbered when a systemd dbus service restarts, causing chef-client to exit ungracefully.

I’ve resolved the issue by setting log_location from STDOUT to /var/log/chef/client.log in /etc/chef/client.rb.

The cookbook managing chef-client piece is ‘chef-client 4.3.2’, using the default systemd parameters.

/etc/systemd/system/chef-client.service:

[Unit]
Description=Chef Client daemon
After=network.target auditd.service

[Service]
EnvironmentFile=/etc/sysconfig/chef-client
ExecStart=/usr/bin/chef-client -c $CONFIG -i $INTERVAL -s $SPLAY $OPTIONS
ExecReload=/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

/etc/chef/client.rb:

chef_server_url "https://chefserver.int.net"
client_fork true
log_level :info
log_location "/var/log/chef/client.log"
trusted_certs_dir "/etc/chef/trusted_certs"
validation_client_name "chef-validator"
verify_api_cert true
node_name "host001"

Also, FWIW, a forked chef-client run from the daemon doesn’t pick up the new log_location settings until the chef-client service is restarted, based on my testing.

Actually, setting the logfile in client.rb masked the issue. After restarting systemd-journald, the service stays up but all future chef-client runs spawned via the service fail.

I’ve opened an issue report against the chef-client cookbook.

Can anyone else validate this issue?

weird issue… almost like the STDOUT file descriptor gets mucked up somehow during the restart. probably this issue? it sounds like either ruby or chef-client itself is not handling SIGPIPE correctly? i’ve updated the bug you filed with a link to the relevant discussion.

in the meantime, i’d think you should at least be able to avoid the need for manual intervention by telling systemd to always restart the chef-client service with the Restart=always or Restart=on-failure via a drop-in unit

e.g.

/etc/systemd/system/chef-client.service.d/always-restart.conf

[Service]
Restart=always

if you’re using the systemd cookbook for unit management, instead of a cookbook file or template and a notify to systemctl daemon-reload execute resource, the above can be achieved with something like:

systemd_service 'restart-chef-client' do
  drop_in true
  override 'chef-client'
  restart 'always'
end

obviously not an ideal solution, but maybe enough to be going on with while the main issue’s worked on.

so i haven’t tested this, but it looks like you may actually be able to configure systemd-journald to serialize it’s state and reclaim the FDs after a restart!

following up on the fact that the previously linked bug is marked resolved-fixed, it appears that an option has been added to allow a service (such as systemd-journald) to preserve it’s file-descriptors across a restart in version 219 (available in EL 7.2). this is supported in systemd-journald, but disabled by default. i believe it’s this option.

i have no idea what a reasonable maximum number of FDs for your systemd-journald to store in the service manager might be, but you should be able to get a rough estimate by counting the number of files under /proc/PID/fd/, where PID is systemd-journald’s pid (my systems seem to average around 30).