How to tell when recipes fail?

Mike_Williams · November 8, 2010, 12:42am

On the Chef web_ui “Status” page, I can see when each node last checked-in with the server, but there does not appear to be any indication of failures during running of recipes!

What’s the recommended way of detecting and reporting on failures?

When chef-client uploads node state to the server, at the end of a run, does it persist any information about the success/failure of the run_list?

Should I be registering an exception/report handler? And if so, has anyone written a handler that persists saves the run-reports to a central DB of some kind?

–
cheers,
Mike Williams

kallistec · November 8, 2010, 4:49am

Ohai!

On Sun, Nov 7, 2010 at 4:42 PM, Mike Williams
mike@cogentconsulting.com.au wrote:

Should I be registering an exception/report handler? And if so, has anyone written a handler that persists saves the run-reports to a central DB of some kind?

Yep, you want a report/exception handler. I started the wiki page for
this feature earlier this week:
http://wiki.opscode.com/display/chef/Exception+and+Report+Handlers

Also, I've been working on a handler paired with a simple sinatra app
that stores run history in redis. It's still pretty early going, but
you can find the source here:
GitHub - danielsdeleo/nom_nom_nom: A Simple Status Server for Chef Just a word of warning,
this hasn't been run in a production scenario anywhere, ever. I've
been meaning to set it up for all of our non-production environments
at opscode but I haven't had the time.

HTH,
Dan DeLeo

--
cheers,
Mike Williams

Mike_Williams · November 8, 2010, 5:25am

On 08/11/2010, at 15:49 , Daniel DeLeo wrote:

Yep, you want a report/exception handler. I started the wiki page for
this feature earlier this week:
http://wiki.opscode.com/display/chef/Exception+and+Report+Handlers

Also, I've been working on a handler paired with a simple sinatra app
that stores run history in redis. It's still pretty early going, but
you can find the source here:
GitHub - danielsdeleo/nom_nom_nom: A Simple Status Server for Chef

Nice one, thanks Daniel.

Out of interest, why Redis, as opposed to CouchDB? I was thinking that it might make sense to store the result of the last Chef run in the existing server data-store, along with everything else, so that (for example) you could target failed nodes using "knife". Are you intentionally trying to keep "node status" and "run history" separate?

--
cheers,
Mike Williams

kallistec · November 8, 2010, 6:36am

On Sun, Nov 7, 2010 at 9:25 PM, Mike Williams
mike@cogentconsulting.com.au wrote:

On 08/11/2010, at 15:49 , Daniel DeLeo wrote:

Also, I've been working on a handler paired with a simple sinatra app
that stores run history in redis. It's still pretty early going, but
you can find the source here:
GitHub - danielsdeleo/nom_nom_nom: A Simple Status Server for Chef

Nice one, thanks Daniel.

Out of interest, why Redis, as opposed to CouchDB? I was thinking that it might make sense to store the result of the last Chef run in the existing server data-store, along with everything else, so that (for example) you could target failed nodes using "knife". Are you intentionally trying to keep "node status" and "run history" separate?

I wrote it on the side as a way to get some exposure to technology I
find interesting but don't use day-to-day. My plan is to get some
experience using/operating what I have so far and then re-evaluate the
technology decisions.

The app as it is allows you to fetch the list of only successful or
failed nodes (it's part of how the data is modeled in redis) and the
UI shows failed/success in the node list, so I'm not trying to keep
them separate. But I did want to take the opportunity to design the UI
(and the whole app, really) from first principles instead of trying to
shoehorn it into what already exists so I can have a fresh perspective
on the problem.

Cheers,

Dan DeLeo

--
cheers,
Mike Williams

Rob_Guttman · November 8, 2010, 3:01pm

I've taken a different approach to monitoring failed recipes. I use a
nagios passive check to monitor each chef-client's log file using LMF
(similar to swatch) for both failed and successful runs. By adding a
freshness threshold, it also doubles as a check that each chef-client is
still running (i.e., the check goes critical if it hasn't received a passive
check update within the chef client interval + splay + some buffer). Works
well so far.

Rob

On Mon, Nov 8, 2010 at 1:36 AM, Daniel DeLeo dan@kallistec.com wrote:

On Sun, Nov 7, 2010 at 9:25 PM, Mike Williams
mike@cogentconsulting.com.au wrote:

On 08/11/2010, at 15:49 , Daniel DeLeo wrote:

Also, I've been working on a handler paired with a simple sinatra app
that stores run history in redis. It's still pretty early going, but
you can find the source here:
GitHub - danielsdeleo/nom_nom_nom: A Simple Status Server for Chef

Nice one, thanks Daniel.

Out of interest, why Redis, as opposed to CouchDB? I was thinking that
it might make sense to store the result of the last Chef run in the existing
server data-store, along with everything else, so that (for example) you
could target failed nodes using "knife". Are you intentionally trying to
keep "node status" and "run history" separate?

I wrote it on the side as a way to get some exposure to technology I
find interesting but don't use day-to-day. My plan is to get some
experience using/operating what I have so far and then re-evaluate the
technology decisions.

The app as it is allows you to fetch the list of only successful or
failed nodes (it's part of how the data is modeled in redis) and the
UI shows failed/success in the node list, so I'm not trying to keep
them separate. But I did want to take the opportunity to design the UI
(and the whole app, really) from first principles instead of trying to
shoehorn it into what already exists so I can have a fresh perspective
on the problem.

Cheers,

Dan DeLeo

--
cheers,
Mike Williams

Sean_OMeara · November 8, 2010, 3:15pm

I run chef-client from cron and check the exist status with $?

On Mon, Nov 8, 2010 at 10:01 AM, Rob Guttman robguttman@gmail.com wrote:

I've taken a different approach to monitoring failed recipes. I use a
nagios passive check to monitor each chef-client's log file using LMF
(similar to swatch) for both failed and successful runs. By adding a
freshness threshold, it also doubles as a check that each chef-client is
still running (i.e., the check goes critical if it hasn't received a passive
check update within the chef client interval + splay + some buffer). Works
well so far.

Rob

On Mon, Nov 8, 2010 at 1:36 AM, Daniel DeLeo dan@kallistec.com wrote:

On Sun, Nov 7, 2010 at 9:25 PM, Mike Williams
mike@cogentconsulting.com.au wrote:

On 08/11/2010, at 15:49 , Daniel DeLeo wrote:

Also, I've been working on a handler paired with a simple sinatra app
that stores run history in redis. It's still pretty early going, but
you can find the source here:
GitHub - danielsdeleo/nom_nom_nom: A Simple Status Server for Chef

Nice one, thanks Daniel.

Out of interest, why Redis, as opposed to CouchDB? I was thinking that
it might make sense to store the result of the last Chef run in the existing
server data-store, along with everything else, so that (for example) you
could target failed nodes using "knife". Are you intentionally trying to
keep "node status" and "run history" separate?

I wrote it on the side as a way to get some exposure to technology I
find interesting but don't use day-to-day. My plan is to get some
experience using/operating what I have so far and then re-evaluate the
technology decisions.

The app as it is allows you to fetch the list of only successful or
failed nodes (it's part of how the data is modeled in redis) and the
UI shows failed/success in the node list, so I'm not trying to keep
them separate. But I did want to take the opportunity to design the UI
(and the whole app, really) from first principles instead of trying to
shoehorn it into what already exists so I can have a fresh perspective
on the problem.

Cheers,

Dan DeLeo

--
cheers,
Mike Williams

Thibaut_Barrere · November 8, 2010, 3:38pm

Hello!

fwiw, someone mentioned on IRC that they were planning to open-source a
hoptoad exception handler that works with recent versions of chef.

I will definitely use that.

– Thibaut

Paul_Choi · November 8, 2010, 10:26pm

I have a script that runs in cron, checking for the last time that a client has checked in.

I found that node[ohai_time] records when a client has checked in. So if threshold < Time.now - node[ohai_time], notify.

This is the code that runs via cron on my servers:

gist.github.com

https://gist.github.com/paulchoi/668386

chef check for checkins

#!/usr/bin/ruby

# lists hosts whose chef-client hasn't checked in with the server for a while

# how many seconds before we alert
threshold = 3600

# requires that your user account is set up for Chef's "knife" utility

me = ENV["USER"]

This file has been truncated. show original

Please feel free to use it ifyou’d like, and feedback is welcome.

-Paul

-----Original Message-----
From: Mike Williams [mailto:mike@cogentconsulting.com.au]
Sent: Sunday, November 07, 2010 4:43 PM
To: chef@lists.opscode.com
Subject: [chef] How to tell when recipes fail?

On the Chef web_ui “Status” page, I can see when each node last checked-in with the server, but there does not appear to be any indication of failures during running of recipes!

What’s the recommended way of detecting and reporting on failures?

When chef-client uploads node state to the server, at the end of a run, does it persist any information about the success/failure of the run_list?

Should I be registering an exception/report handler? And if so, has anyone written a handler that persists saves the run-reports to a central DB of some kind?

–
cheers,
Mike Williams

Mike_Williams · November 8, 2010, 11:00pm

On 09/11/2010, at 09:26 , Paul Choi wrote:

I have a script that runs in cron, checking for the last time that a client has checked in.

That's definitely useful, but (correct me if I'm wrong) only ensures that the chef-client daemon is still running ... not that the recipes actually apply without error.

--
cheers,
Mike Williams

Paul_Choi · November 9, 2010, 12:53am

Ah, you are right, sir... Monday brain fart on my part. What I have been meaning to do is... write a reporter cookbook, which is run at the end of a cookbook, where it does an HTTP POST to Chef's data bags, and can be parsed for status.

Thanks for the feedback.

-Paul

-----Original Message-----
From: Mike Williams [mailto:mike@cogentconsulting.com.au]
Sent: Monday, November 08, 2010 3:01 PM
To: chef@lists.opscode.com
Subject: [chef] Re: RE: How to tell when recipes fail?

On 09/11/2010, at 09:26 , Paul Choi wrote:

I have a script that runs in cron, checking for the last time that a client has checked in.

That's definitely useful, but (correct me if I'm wrong) only ensures that the chef-client daemon is still running ... not that the recipes actually apply without error.

--
cheers,
Mike Williams

Gilles_Devaux · November 16, 2010, 5:40pm

Same here

On Mon, Nov 8, 2010 at 7:01 AM, Rob Guttman robguttman@gmail.com wrote:

I've taken a different approach to monitoring failed recipes. I use a
nagios passive check to monitor each chef-client's log file using LMF
(similar to swatch) for both failed and successful runs. By adding a
freshness threshold, it also doubles as a check that each chef-client is
still running (i.e., the check goes critical if it hasn't received a passive
check update within the chef client interval + splay + some buffer). Works
well so far.

Rob

On Mon, Nov 8, 2010 at 1:36 AM, Daniel DeLeo dan@kallistec.com wrote:

On Sun, Nov 7, 2010 at 9:25 PM, Mike Williams
mike@cogentconsulting.com.au wrote:

On 08/11/2010, at 15:49 , Daniel DeLeo wrote:

Also, I've been working on a handler paired with a simple sinatra app
that stores run history in redis. It's still pretty early going, but
you can find the source here:
GitHub - danielsdeleo/nom_nom_nom: A Simple Status Server for Chef

Nice one, thanks Daniel.

Out of interest, why Redis, as opposed to CouchDB? I was thinking that
it might make sense to store the result of the last Chef run in the existing
server data-store, along with everything else, so that (for example) you
could target failed nodes using "knife". Are you intentionally trying to
keep "node status" and "run history" separate?

I wrote it on the side as a way to get some exposure to technology I
find interesting but don't use day-to-day. My plan is to get some
experience using/operating what I have so far and then re-evaluate the
technology decisions.

The app as it is allows you to fetch the list of only successful or
failed nodes (it's part of how the data is modeled in redis) and the
UI shows failed/success in the node list, so I'm not trying to keep
them separate. But I did want to take the opportunity to design the UI
(and the whole app, really) from first principles instead of trying to
shoehorn it into what already exists so I can have a fresh perspective
on the problem.

Cheers,

Dan DeLeo

--
cheers,
Mike Williams

Topic		Replies	Views
Monitoring chef-client failures Chef Infra (archive)	3	792	October 10, 2011
How to monitor if chef-client is actually working Chef Infra (archive)	6	3459	July 10, 2011
Monitoring chef-clients Chef Infra (archive)	8	525	February 11, 2013
How could I know from the chef server side that whether the running of a chef-client is done? Chef Infra (archive)	5	309	March 8, 2012
Re: Re: Re: Re: Re: Reporting: list of nodes with cookbook versions Chef Infra (archive)	0	289	February 7, 2015

How to tell when recipes fail?

Related topics