How to monitor if chef-client is actually working

Robert_Keng · June 29, 2011, 11:24pm

Hi all,

Just wondering if anyone knows if there is a way to monitor chef-client to make sure it isn’t stuck in retries because of some error in some recipe? Does the client actually reply back to the chef-server if its able to apply all recipes and things are happy?

I apologize if this is a really simple question, fairly new to chef…Just I’d rather not have to parse /var/log/chef/client.log for this, thanks!

-Robert

Michael_Herman · June 29, 2011, 11:31pm

Robert,

You can use an Exception and Report handlers to perform arbitrary code on a
successful or failed chef-client run.

http://wiki.opscode.com/display/chef/Exception+and+Report+Handlers

We use a report handler, so that nagios will alarm if a successful run
hasn't completed in the last 60 minutes or so.

Rgds,

mgh

On Thu, Jun 30, 2011 at 9:24 AM, Robert Keng robert@sv.comcast.com wrote:

Hi all,

Just wondering if anyone knows if there is a way to monitor chef-client to
make sure it isn't stuck in retries because of some error in some recipe?
Does the client actually reply back to the chef-server if its able to apply
all recipes and things are happy?

I apologize if this is a really simple question, fairly new to
chef...Just I'd rather not have to parse /var/log/chef/client.log for this,
thanks!

-Robert

KC_Braunschweig · June 30, 2011, 1:11am

It'd be awesome if people shared their custom report/exception
handlers for tying into common monitoring/alerting tools like nagios,
zenoss etc. This is something I haven't looked at yet but will need to
get sorted before we go to prod. However, I also intend to get all
chef logging flowing through syslog and into Splunk so we can do
additional processing and alerting based on parsing the logs. Adam
already published a handler for splunk to help make this easier though
I don't know if much has been done with it besides the basic stubbing
out. I think there was also a stub for a Splunk app designed for chef
logs. Haven't played with either yet, but both are on the short term
roadmap for me.

KC

On Wed, Jun 29, 2011 at 4:31 PM, Michael Herman mgh@historyhound.com wrote:

Robert,
You can use an Exception and Report handlers to perform arbitrary code on a
successful or failed chef-client run.
http://wiki.opscode.com/display/chef/Exception+and+Report+Handlers
We use a report handler, so that nagios will alarm if a successful run
hasn't completed in the last 60 minutes or so.
Rgds,
mgh

On Thu, Jun 30, 2011 at 9:24 AM, Robert Keng robert@sv.comcast.com wrote:

Hi all,
Just wondering if anyone knows if there is a way to monitor chef-client to
make sure it isn't stuck in retries because of some error in some recipe?
Does the client actually reply back to the chef-server if its able to apply
all recipes and things are happy?
I apologize if this is a really simple question, fairly new to chef...Just
I'd rather not have to parse /var/log/chef/client.log for this, thanks!
-Robert

Ranjib_Dey · June 30, 2011, 5:44am

We use nagios as our alerting/monitoring solution and i have written an nsca
based chef report handler that submits the chef client run status via
send_nsca command to a nsca server (which in turn submits a passive check on
the nagios.cmd pipe). It assumes nsca-client is installed on the chef client
nodes and nsca daemon is running (and configured ) on the nagios box.
You can get the script here: https://github.com/ranjibd/nsca_handler

You can also use nrpe based active checks, but i guess that will introduce
some delay, but also helps you detecting in case the client crashes (which
im experiencing with ruby 1.9.2 )
regards
ranjib

On Thu, Jun 30, 2011 at 4:54 AM, Robert Keng robert@sv.comcast.com wrote:

Hi all,

Just wondering if anyone knows if there is a way to monitor chef-client to
make sure it isn't stuck in retries because of some error in some recipe?
Does the client actually reply back to the chef-server if its able to apply
all recipes and things are happy?

I apologize if this is a really simple question, fairly new to
chef...Just I'd rather not have to parse /var/log/chef/client.log for this,
thanks!

-Robert

KC_Braunschweig · June 30, 2011, 8:23am

You can also use nrpe based active checks, but i guess that will introduce
some delay, but also helps you detecting in case the client crashes (which
im experiencing with ruby 1.9.2 )

Thanks!

Do you have a bug open for the 1.9.2 crash issues? We were looking at
moving to 1.9.2 hoping to resolve some intermittent segfaulting with
1.8.7. Maybe we should consider a Ruby Enterprise 1.8 build instead?

KC

Rob_Guttman · July 1, 2011, 2:46pm

I monitor the chef-client log via nagios nrpe (active check) for two
purposes: (1) check the log's last modified time to know whether the service
is running correctly, and (2) report multi-line log errors. The problem
with just monitoring the chef-client (or any process) via a check_proc
command (or comparable) is that the process could be running but wedged.
Ensuring that the process is continually writing to its log, while not
perfect, goes a bit further to make sure the process is not wedged.

Getting nrpe to report multiple lines takes a bit of work as that's not
standard - but it is possible (whereas it's not possible at all AFAIK via
nsca passive checks). I can dig up my notes on how I did this if people are
interested.

Rob

On Thu, Jun 30, 2011 at 1:44 AM, Ranjib Dey ranjibd@thoughtworks.comwrote:

We use nagios as our alerting/monitoring solution and i have written an
nsca based chef report handler that submits the chef client run status via
send_nsca command to a nsca server (which in turn submits a passive check on
the nagios.cmd pipe). It assumes nsca-client is installed on the chef client
nodes and nsca daemon is running (and configured ) on the nagios box.
You can get the script here: https://github.com/ranjibd/nsca_handler

You can also use nrpe based active checks, but i guess that will introduce
some delay, but also helps you detecting in case the client crashes (which
im experiencing with ruby 1.9.2 )
regards
ranjib

On Thu, Jun 30, 2011 at 4:54 AM, Robert Keng robert@sv.comcast.comwrote:

Hi all,

Just wondering if anyone knows if there is a way to monitor chef-client
to make sure it isn't stuck in retries because of some error in some recipe?
Does the client actually reply back to the chef-server if its able to apply
all recipes and things are happy?

I apologize if this is a really simple question, fairly new to
chef...Just I'd rather not have to parse /var/log/chef/client.log for this,
thanks!

-Robert

Joshua_Timberman · July 10, 2011, 7:26am

Hello,

On Wed, Jun 29, 2011 at 7:11 PM, KC Braunschweig
kcbraunschweig@gmail.com wrote:

It'd be awesome if people shared their custom report/exception
handlers for tying into common monitoring/alerting tools like nagios,
zenoss etc.

We encourage people to edit the exception and report handler page for
custom handlers they've written.

http://wiki.opscode.com/display/chef/Exception+and+Report+Handlers#ExceptionandReportHandlers-ExistingHandlers

"Other existing handlers"

--
Opscode, Inc
Joshua Timberman, Director of Training and Services
IRC, Skype, Twitter, Github: jtimberman

Topic		Replies	Views
Monitoring chef-client failures Chef Infra (archive)	3	792	October 10, 2011
How to tell when recipes fail? Chef Infra (archive)	10	960	November 16, 2010
Monitoring chef-clients Chef Infra (archive)	8	525	February 11, 2013
How to alert on chef recipe not running or failed Chef Infra (archive)	1	505	July 22, 2019
Feedback from chef clients Chef Infra (archive)	2	235	June 10, 2010

How to monitor if chef-client is actually working

Related topics