How to tell when recipes fail?


#1

On the Chef web_ui “Status” page, I can see when each node last checked-in with the server, but there does not appear to be any indication of failures during running of recipes!

What’s the recommended way of detecting and reporting on failures?

When chef-client uploads node state to the server, at the end of a run, does it persist any information about the success/failure of the run_list?

Should I be registering an exception/report handler? And if so, has anyone written a handler that persists saves the run-reports to a central DB of some kind?


cheers,
Mike Williams


#2

Ohai!

On Sun, Nov 7, 2010 at 4:42 PM, Mike Williams
mike@cogentconsulting.com.au wrote:

Should I be registering an exception/report handler? And if so, has anyone written a handler that persists saves the run-reports to a central DB of some kind?

Yep, you want a report/exception handler. I started the wiki page for
this feature earlier this week:
http://wiki.opscode.com/display/chef/Exception+and+Report+Handlers

Also, I’ve been working on a handler paired with a simple sinatra app
that stores run history in redis. It’s still pretty early going, but
you can find the source here:
https://github.com/danielsdeleo/nom_nom_nom Just a word of warning,
this hasn’t been run in a production scenario anywhere, ever. I’ve
been meaning to set it up for all of our non-production environments
at opscode but I haven’t had the time.

HTH,
Dan DeLeo


cheers,
Mike Williams


#3

On 08/11/2010, at 15:49 , Daniel DeLeo wrote:

Yep, you want a report/exception handler. I started the wiki page for
this feature earlier this week:
http://wiki.opscode.com/display/chef/Exception+and+Report+Handlers

Also, I’ve been working on a handler paired with a simple sinatra app
that stores run history in redis. It’s still pretty early going, but
you can find the source here:
https://github.com/danielsdeleo/nom_nom_nom

Nice one, thanks Daniel.

Out of interest, why Redis, as opposed to CouchDB? I was thinking that it might make sense to store the result of the last Chef run in the existing server data-store, along with everything else, so that (for example) you could target failed nodes using “knife”. Are you intentionally trying to keep “node status” and “run history” separate?


cheers,
Mike Williams


#4

On Sun, Nov 7, 2010 at 9:25 PM, Mike Williams
mike@cogentconsulting.com.au wrote:

On 08/11/2010, at 15:49 , Daniel DeLeo wrote:

Also, I’ve been working on a handler paired with a simple sinatra app
that stores run history in redis. It’s still pretty early going, but
you can find the source here:
https://github.com/danielsdeleo/nom_nom_nom

Nice one, thanks Daniel.

Out of interest, why Redis, as opposed to CouchDB? I was thinking that it might make sense to store the result of the last Chef run in the existing server data-store, along with everything else, so that (for example) you could target failed nodes using “knife”. Are you intentionally trying to keep “node status” and “run history” separate?

I wrote it on the side as a way to get some exposure to technology I
find interesting but don’t use day-to-day. My plan is to get some
experience using/operating what I have so far and then re-evaluate the
technology decisions.

The app as it is allows you to fetch the list of only successful or
failed nodes (it’s part of how the data is modeled in redis) and the
UI shows failed/success in the node list, so I’m not trying to keep
them separate. But I did want to take the opportunity to design the UI
(and the whole app, really) from first principles instead of trying to
shoehorn it into what already exists so I can have a fresh perspective
on the problem.

Cheers,

Dan DeLeo


cheers,
Mike Williams


#5

I’ve taken a different approach to monitoring failed recipes. I use a
nagios passive check to monitor each chef-client’s log file using LMF
(similar to swatch) for both failed and successful runs. By adding a
freshness threshold, it also doubles as a check that each chef-client is
still running (i.e., the check goes critical if it hasn’t received a passive
check update within the chef client interval + splay + some buffer). Works
well so far.

  • Rob

On Mon, Nov 8, 2010 at 1:36 AM, Daniel DeLeo dan@kallistec.com wrote:

On Sun, Nov 7, 2010 at 9:25 PM, Mike Williams
mike@cogentconsulting.com.au wrote:

On 08/11/2010, at 15:49 , Daniel DeLeo wrote:

Also, I’ve been working on a handler paired with a simple sinatra app
that stores run history in redis. It’s still pretty early going, but
you can find the source here:
https://github.com/danielsdeleo/nom_nom_nom

Nice one, thanks Daniel.

Out of interest, why Redis, as opposed to CouchDB? I was thinking that
it might make sense to store the result of the last Chef run in the existing
server data-store, along with everything else, so that (for example) you
could target failed nodes using “knife”. Are you intentionally trying to
keep “node status” and “run history” separate?

I wrote it on the side as a way to get some exposure to technology I
find interesting but don’t use day-to-day. My plan is to get some
experience using/operating what I have so far and then re-evaluate the
technology decisions.

The app as it is allows you to fetch the list of only successful or
failed nodes (it’s part of how the data is modeled in redis) and the
UI shows failed/success in the node list, so I’m not trying to keep
them separate. But I did want to take the opportunity to design the UI
(and the whole app, really) from first principles instead of trying to
shoehorn it into what already exists so I can have a fresh perspective
on the problem.

Cheers,

Dan DeLeo


cheers,
Mike Williams


#6

I run chef-client from cron and check the exist status with $?

On Mon, Nov 8, 2010 at 10:01 AM, Rob Guttman robguttman@gmail.com wrote:

I’ve taken a different approach to monitoring failed recipes. I use a
nagios passive check to monitor each chef-client’s log file using LMF
(similar to swatch) for both failed and successful runs. By adding a
freshness threshold, it also doubles as a check that each chef-client is
still running (i.e., the check goes critical if it hasn’t received a passive
check update within the chef client interval + splay + some buffer). Works
well so far.

  • Rob

On Mon, Nov 8, 2010 at 1:36 AM, Daniel DeLeo dan@kallistec.com wrote:

On Sun, Nov 7, 2010 at 9:25 PM, Mike Williams
mike@cogentconsulting.com.au wrote:

On 08/11/2010, at 15:49 , Daniel DeLeo wrote:

Also, I’ve been working on a handler paired with a simple sinatra app
that stores run history in redis. It’s still pretty early going, but
you can find the source here:
https://github.com/danielsdeleo/nom_nom_nom

Nice one, thanks Daniel.

Out of interest, why Redis, as opposed to CouchDB? I was thinking that
it might make sense to store the result of the last Chef run in the existing
server data-store, along with everything else, so that (for example) you
could target failed nodes using “knife”. Are you intentionally trying to
keep “node status” and “run history” separate?

I wrote it on the side as a way to get some exposure to technology I
find interesting but don’t use day-to-day. My plan is to get some
experience using/operating what I have so far and then re-evaluate the
technology decisions.

The app as it is allows you to fetch the list of only successful or
failed nodes (it’s part of how the data is modeled in redis) and the
UI shows failed/success in the node list, so I’m not trying to keep
them separate. But I did want to take the opportunity to design the UI
(and the whole app, really) from first principles instead of trying to
shoehorn it into what already exists so I can have a fresh perspective
on the problem.

Cheers,

Dan DeLeo


cheers,
Mike Williams


#7

Hello!

fwiw, someone mentioned on IRC that they were planning to open-source a
hoptoad exception handler that works with recent versions of chef.

I will definitely use that.

– Thibaut


#8

I have a script that runs in cron, checking for the last time that a client has checked in.

I found that node[ohai_time] records when a client has checked in. So if threshold < Time.now - node[ohai_time], notify.

This is the code that runs via cron on my servers:

Please feel free to use it ifyou’d like, and feedback is welcome.

-Paul

-----Original Message-----
From: Mike Williams [mailto:mike@cogentconsulting.com.au]
Sent: Sunday, November 07, 2010 4:43 PM
To: chef@lists.opscode.com
Subject: [chef] How to tell when recipes fail?

On the Chef web_ui “Status” page, I can see when each node last checked-in with the server, but there does not appear to be any indication of failures during running of recipes!

What’s the recommended way of detecting and reporting on failures?

When chef-client uploads node state to the server, at the end of a run, does it persist any information about the success/failure of the run_list?

Should I be registering an exception/report handler? And if so, has anyone written a handler that persists saves the run-reports to a central DB of some kind?


cheers,
Mike Williams


#9

On 09/11/2010, at 09:26 , Paul Choi wrote:

I have a script that runs in cron, checking for the last time that a client has checked in.

That’s definitely useful, but (correct me if I’m wrong) only ensures that the chef-client daemon is still running … not that the recipes actually apply without error.


cheers,
Mike Williams


#10

Ah, you are right, sir… Monday brain fart on my part. :slight_smile: What I have been meaning to do is… write a reporter cookbook, which is run at the end of a cookbook, where it does an HTTP POST to Chef’s data bags, and can be parsed for status.

Thanks for the feedback.

-Paul

-----Original Message-----
From: Mike Williams [mailto:mike@cogentconsulting.com.au]
Sent: Monday, November 08, 2010 3:01 PM
To: chef@lists.opscode.com
Subject: [chef] Re: RE: How to tell when recipes fail?

On 09/11/2010, at 09:26 , Paul Choi wrote:

I have a script that runs in cron, checking for the last time that a client has checked in.

That’s definitely useful, but (correct me if I’m wrong) only ensures that the chef-client daemon is still running … not that the recipes actually apply without error.


cheers,
Mike Williams


#11

Same here

On Mon, Nov 8, 2010 at 7:01 AM, Rob Guttman robguttman@gmail.com wrote:

I’ve taken a different approach to monitoring failed recipes. I use a
nagios passive check to monitor each chef-client’s log file using LMF
(similar to swatch) for both failed and successful runs. By adding a
freshness threshold, it also doubles as a check that each chef-client is
still running (i.e., the check goes critical if it hasn’t received a passive
check update within the chef client interval + splay + some buffer). Works
well so far.

  • Rob

On Mon, Nov 8, 2010 at 1:36 AM, Daniel DeLeo dan@kallistec.com wrote:

On Sun, Nov 7, 2010 at 9:25 PM, Mike Williams
mike@cogentconsulting.com.au wrote:

On 08/11/2010, at 15:49 , Daniel DeLeo wrote:

Also, I’ve been working on a handler paired with a simple sinatra app
that stores run history in redis. It’s still pretty early going, but
you can find the source here:
https://github.com/danielsdeleo/nom_nom_nom

Nice one, thanks Daniel.

Out of interest, why Redis, as opposed to CouchDB? I was thinking that
it might make sense to store the result of the last Chef run in the existing
server data-store, along with everything else, so that (for example) you
could target failed nodes using “knife”. Are you intentionally trying to
keep “node status” and “run history” separate?

I wrote it on the side as a way to get some exposure to technology I
find interesting but don’t use day-to-day. My plan is to get some
experience using/operating what I have so far and then re-evaluate the
technology decisions.

The app as it is allows you to fetch the list of only successful or
failed nodes (it’s part of how the data is modeled in redis) and the
UI shows failed/success in the node list, so I’m not trying to keep
them separate. But I did want to take the opportunity to design the UI
(and the whole app, really) from first principles instead of trying to
shoehorn it into what already exists so I can have a fresh perspective
on the problem.

Cheers,

Dan DeLeo


cheers,
Mike Williams