Monitoring chef runs


#1

Hi all,

We’re fairly new to chef and have been manually executing chef runs on one or many nodes. We just made the move to run it as a service via the chef-client cookbook.

What’s the quickest way to make sure that we’re getting notified when there are problems during a chef run? What’s the best way?

Thanks,
Paul


#2

Check out http://wiki.opscode.com/display/chef/Exception+and+Report+Handlers :slight_smile:

–Noah

On Sep 5, 2012, at 4:32 PM, Paul McCallick wrote:

Hi all,

We’re fairly new to chef and have been manually executing chef runs on one or many nodes. We just made the move to run it as a service via the chef-client cookbook.

What’s the quickest way to make sure that we’re getting notified when there are problems during a chef run? What’s the best way?

Thanks,
Paul


#3

I’m not sure if it’s the quickest, but we use Chef handlers:
http://wiki.opscode.com/display/chef/Exception+and+Report+Handlers

On Wed, Sep 5, 2012 at 7:32 PM, Paul McCallick PMcCallick@paraport.comwrote:

Hi all,****


We’re fairly new to chef and have been manually executing chef runs on one
or many nodes. We just made the move to run it as a service via the
chef-client cookbook.****


What’s the quickest way to make sure that we’re getting notified when
there are problems during a chef run? What’s the best way?****


Thanks,****

Paul****


#4

+1 for handlers. we have an email exception handler. The handler is applied
in our base role (early on), in QA and Prod environments, but disabled in
other environments so we don’t spam ourselves while developing and testing
updates. We are still pretty early on in our implementation, but so far
this has been working for us.

On Wed, Sep 5, 2012 at 7:35 PM, Brian Hatfield bhatfield@brightcove.comwrote:

I’m not sure if it’s the quickest, but we use Chef handlers:
http://wiki.opscode.com/display/chef/Exception+and+Report+Handlers

On Wed, Sep 5, 2012 at 7:32 PM, Paul McCallick PMcCallick@paraport.comwrote:

Hi all,****


We’re fairly new to chef and have been manually executing chef runs on
one or many nodes. We just made the move to run it as a service via the
chef-client cookbook.****


What’s the quickest way to make sure that we’re getting notified when
there are problems during a chef run? What’s the best way?****


Thanks,****

Paul****


#5

Hello,

You can use exception handler to get runtime exception on your recipes.

To monitor chef-client process, you can use god, which is a process monitor written in Ruby.

btw, running chef-client as a daemon is not good in some case. be careful with that.

Regards,

Tetsu

On Sep 6, 2012, at 8:32 AM, Paul McCallick PMcCallick@paraport.com wrote:

Hi all,

We’re fairly new to chef and have been manually executing chef runs on one or many nodes. We just made the move to run it as a service via the chef-client cookbook.

What’s the quickest way to make sure that we’re getting notified when there are problems during a chef run? What’s the best way?

Thanks,
Paul


#6

Tetsu, can you elaborate on the concerns you’ve got for running chef-client
as a daemon?
On Sep 6, 2012 7:26 AM, “Tetsu Soh” tetsu.soh@gmail.com wrote:

Hello,

You can use exception handler to get runtime exception on your recipes.

To monitor chef-client process, you can use god, which is a process
monitor written in Ruby.

btw, running chef-client as a daemon is not good in some case. be careful
with that.

Regards,

Tetsu

On Sep 6, 2012, at 8:32 AM, Paul McCallick PMcCallick@paraport.com
wrote:

Hi all,****


We’re fairly new to chef and have been manually executing chef runs on one
or many nodes. We just made the move to run it as a service via the
chef-client cookbook.****


What’s the quickest way to make sure that we’re getting notified when
there are problems during a chef run? What’s the best way?****


Thanks,****
Paul****


#7

Hi,

well, it really depends on how you manage operations carried out by chef.

By running chef as daemon, chef will apply your changes in next time it runs.
So longer the interval between 2 converges is, more changes may be made.
And more changes you make once, more risk you have.
For example, if one recipes failed, all recipes after that will not be run.
So you need to figure out which one failed, fix it an run everything again.

OMP, running chef-client on demand will be a better solution.

Regards,

Tetsu

On Sep 6, 2012, at 11:36 PM, Joshua Miller joshuamiller01@gmail.com wrote:

Tetsu, can you elaborate on the concerns you’ve got for running chef-client as a daemon?

On Sep 6, 2012 7:26 AM, “Tetsu Soh” tetsu.soh@gmail.com wrote:
Hello,

You can use exception handler to get runtime exception on your recipes.

To monitor chef-client process, you can use god, which is a process monitor written in Ruby.

btw, running chef-client as a daemon is not good in some case. be careful with that.

Regards,

Tetsu

On Sep 6, 2012, at 8:32 AM, Paul McCallick PMcCallick@paraport.com wrote:

Hi all,

We’re fairly new to chef and have been manually executing chef runs on one or many nodes. We just made the move to run it as a service via the chef-client cookbook.

What’s the quickest way to make sure that we’re getting notified when there are problems during a chef run? What’s the best way?

Thanks,
Paul


#8

Thanks everyone, it sounds like handlers are the way to go.

From: Tetsu Soh [mailto:tetsu.soh@gmail.com]
Sent: Thursday, September 06, 2012 7:53 AM
To: chef@lists.opscode.com
Subject: [chef] Re: Re: Re: Monitoring chef runs

Hi,

well, it really depends on how you manage operations carried out by chef.

By running chef as daemon, chef will apply your changes in next time it runs.
So longer the interval between 2 converges is, more changes may be made.
And more changes you make once, more risk you have.
For example, if one recipes failed, all recipes after that will not be run.
So you need to figure out which one failed, fix it an run everything again.

OMP, running chef-client on demand will be a better solution.

Regards,

Tetsu

On Sep 6, 2012, at 11:36 PM, Joshua Miller <joshuamiller01@gmail.commailto:joshuamiller01@gmail.com> wrote:

Tetsu, can you elaborate on the concerns you’ve got for running chef-client as a daemon?
On Sep 6, 2012 7:26 AM, “Tetsu Soh” <tetsu.soh@gmail.commailto:tetsu.soh@gmail.com> wrote:
Hello,

You can use exception handler to get runtime exception on your recipes.

To monitor chef-client process, you can use god, which is a process monitor written in Ruby.

btw, running chef-client as a daemon is not good in some case. be careful with that.

Regards,

Tetsu

On Sep 6, 2012, at 8:32 AM, Paul McCallick <PMcCallick@paraport.commailto:PMcCallick@paraport.com> wrote:

Hi all,

We’re fairly new to chef and have been manually executing chef runs on one or many nodes. We just made the move to run it as a service via the chef-client cookbook.

What’s the quickest way to make sure that we’re getting notified when there are problems during a chef run? What’s the best way?

Thanks,
Paul


#9

There are a number of different considerations here, and it’s worth teasing them all apart:

  1. Do you want to know about impeding change, possibly without acting? If so, check out “why run” in the latest builds: http://wiki.opscode.com/display/chef/Whyrun+Testing
  2. Do you want to reduce the amount of change in a time period, thinking that’s less likely to result in errors stacking up? (NOTE: that’s not a provable claim, but an intuitive notion). If so, either run chef-client from cron or as a daemon.
  3. Do you want to directly control how and when change occurs? If so, run chef-client manually.
  4. Do you want to directly control how and when Ruby consumes memory? If so, run chef-client manually. Running any Ruby process as a daemon may result in a fair amount of memory being consumed / committed all the time.

After you’ve chosen your strategy for running chef-client, report / exception handlers are there to tell you exactly what happened: http://wiki.opscode.com/display/chef/Exception+and+Report+Handlers

-C


From: Tetsu Soh [tetsu.soh@gmail.com]
Sent: Thursday, September 06, 2012 7:52 AM
To: chef@lists.opscode.com
Subject: [chef] Re: Re: Re: Monitoring chef runs

Hi,

well, it really depends on how you manage operations carried out by chef.

By running chef as daemon, chef will apply your changes in next time it runs.
So longer the interval between 2 converges is, more changes may be made.
And more changes you make once, more risk you have.
For example, if one recipes failed, all recipes after that will not be run.
So you need to figure out which one failed, fix it an run everything again.

OMP, running chef-client on demand will be a better solution.

Regards,

Tetsu

On Sep 6, 2012, at 11:36 PM, Joshua Miller <joshuamiller01@gmail.commailto:joshuamiller01@gmail.com> wrote:

Tetsu, can you elaborate on the concerns you’ve got for running chef-client as a daemon?

On Sep 6, 2012 7:26 AM, “Tetsu Soh” <tetsu.soh@gmail.commailto:tetsu.soh@gmail.com> wrote:
Hello,

You can use exception handler to get runtime exception on your recipes.

To monitor chef-client process, you can use god, which is a process monitor written in Ruby.

btw, running chef-client as a daemon is not good in some case. be careful with that.

Regards,

Tetsu

On Sep 6, 2012, at 8:32 AM, Paul McCallick <PMcCallick@paraport.commailto:PMcCallick@paraport.com> wrote:

Hi all,

We’re fairly new to chef and have been manually executing chef runs on one or many nodes. We just made the move to run it as a service via the chef-client cookbook.

What’s the quickest way to make sure that we’re getting notified when there are problems during a chef run? What’s the best way?

Thanks,
Paul


#10

One more thing need your attention, exception handler can only handle runtime exception but not compile time exception.

If your recipe broke on compile time, the chef-client will stop without running and exception handler.

On Sep 7, 2012, at 12:23 AM, Paul McCallick PMcCallick@paraport.com wrote:

Thanks everyone, it sounds like handlers are the way to go.

From: Tetsu Soh [mailto:tetsu.soh@gmail.com]
Sent: Thursday, September 06, 2012 7:53 AM
To: chef@lists.opscode.com
Subject: [chef] Re: Re: Re: Monitoring chef runs

Hi,

well, it really depends on how you manage operations carried out by chef.

By running chef as daemon, chef will apply your changes in next time it runs.
So longer the interval between 2 converges is, more changes may be made.
And more changes you make once, more risk you have.
For example, if one recipes failed, all recipes after that will not be run.
So you need to figure out which one failed, fix it an run everything again.

OMP, running chef-client on demand will be a better solution.

Regards,

Tetsu

On Sep 6, 2012, at 11:36 PM, Joshua Miller joshuamiller01@gmail.com wrote:

Tetsu, can you elaborate on the concerns you’ve got for running chef-client as a daemon?

On Sep 6, 2012 7:26 AM, “Tetsu Soh” tetsu.soh@gmail.com wrote:
Hello,

You can use exception handler to get runtime exception on your recipes.

To monitor chef-client process, you can use god, which is a process monitor written in Ruby.

btw, running chef-client as a daemon is not good in some case. be careful with that.

Regards,

Tetsu

On Sep 6, 2012, at 8:32 AM, Paul McCallick PMcCallick@paraport.com wrote:

Hi all,

We’re fairly new to chef and have been manually executing chef runs on one or many nodes. We just made the move to run it as a service via the chef-client cookbook.

What’s the quickest way to make sure that we’re getting notified when there are problems during a chef run? What’s the best way?

Thanks,
Paul


#11

Yikes. Will this at least get captured in the logs?

From: Tetsu Soh [mailto:tetsu.soh@gmail.com]
Sent: Thursday, September 06, 2012 8:26 AM
To: chef@lists.opscode.com
Subject: [chef] Re: Monitoring chef runs

One more thing need your attention, exception handler can only handle runtime exception but not compile time exception.

If your recipe broke on compile time, the chef-client will stop without running and exception handler.

On Sep 7, 2012, at 12:23 AM, Paul McCallick <PMcCallick@paraport.commailto:PMcCallick@paraport.com> wrote:

Thanks everyone, it sounds like handlers are the way to go.

From: Tetsu Soh [mailto:tetsu.soh@gmail.comhttp://gmail.com]
Sent: Thursday, September 06, 2012 7:53 AM
To: chef@lists.opscode.commailto:chef@lists.opscode.com
Subject: [chef] Re: Re: Re: Monitoring chef runs

Hi,

well, it really depends on how you manage operations carried out by chef.

By running chef as daemon, chef will apply your changes in next time it runs.
So longer the interval between 2 converges is, more changes may be made.
And more changes you make once, more risk you have.
For example, if one recipes failed, all recipes after that will not be run.
So you need to figure out which one failed, fix it an run everything again.

OMP, running chef-client on demand will be a better solution.

Regards,

Tetsu

On Sep 6, 2012, at 11:36 PM, Joshua Miller <joshuamiller01@gmail.commailto:joshuamiller01@gmail.com> wrote:

Tetsu, can you elaborate on the concerns you’ve got for running chef-client as a daemon?
On Sep 6, 2012 7:26 AM, “Tetsu Soh” <tetsu.soh@gmail.commailto:tetsu.soh@gmail.com> wrote:
Hello,

You can use exception handler to get runtime exception on your recipes.

To monitor chef-client process, you can use god, which is a process monitor written in Ruby.

btw, running chef-client as a daemon is not good in some case. be careful with that.

Regards,

Tetsu

On Sep 6, 2012, at 8:32 AM, Paul McCallick <PMcCallick@paraport.commailto:PMcCallick@paraport.com> wrote:

Hi all,

We’re fairly new to chef and have been manually executing chef runs on one or many nodes. We just made the move to run it as a service via the chef-client cookbook.

What’s the quickest way to make sure that we’re getting notified when there are problems during a chef run? What’s the best way?

Thanks,
Paul


#12

Yes. it will be wrote in log.

On Sep 7, 2012, at 12:37 AM, Paul McCallick PMcCallick@paraport.com wrote:

Yikes. Will this at least get captured in the logs?

From: Tetsu Soh [mailto:tetsu.soh@gmail.com]
Sent: Thursday, September 06, 2012 8:26 AM
To: chef@lists.opscode.com
Subject: [chef] Re: Monitoring chef runs

One more thing need your attention, exception handler can only handle runtime exception but not compile time exception.

If your recipe broke on compile time, the chef-client will stop without running and exception handler.

On Sep 7, 2012, at 12:23 AM, Paul McCallick PMcCallick@paraport.com wrote:

Thanks everyone, it sounds like handlers are the way to go.

From: Tetsu Soh [mailto:tetsu.soh@gmail.com]
Sent: Thursday, September 06, 2012 7:53 AM
To: chef@lists.opscode.com
Subject: [chef] Re: Re: Re: Monitoring chef runs

Hi,

well, it really depends on how you manage operations carried out by chef.

By running chef as daemon, chef will apply your changes in next time it runs.
So longer the interval between 2 converges is, more changes may be made.
And more changes you make once, more risk you have.
For example, if one recipes failed, all recipes after that will not be run.
So you need to figure out which one failed, fix it an run everything again.

OMP, running chef-client on demand will be a better solution.

Regards,

Tetsu

On Sep 6, 2012, at 11:36 PM, Joshua Miller joshuamiller01@gmail.com wrote:

Tetsu, can you elaborate on the concerns you’ve got for running chef-client as a daemon?

On Sep 6, 2012 7:26 AM, “Tetsu Soh” tetsu.soh@gmail.com wrote:
Hello,

You can use exception handler to get runtime exception on your recipes.

To monitor chef-client process, you can use god, which is a process monitor written in Ruby.

btw, running chef-client as a daemon is not good in some case. be careful with that.

Regards,

Tetsu

On Sep 6, 2012, at 8:32 AM, Paul McCallick PMcCallick@paraport.com wrote:

Hi all,

We’re fairly new to chef and have been manually executing chef runs on one or many nodes. We just made the move to run it as a service via the chef-client cookbook.

What’s the quickest way to make sure that we’re getting notified when there are problems during a chef run? What’s the best way?

Thanks,
Paul


#13

IMO, if you run it on at least a 24h basis then you know that your
systems are converged. This is good in Enterprise Situations™ since
change management quickly becomes somewhat brutal and you’ll often only
be able to get approval to push changes and manually run chef a few
times a month. That can build up to a really substantial amount of lint
that you need to worry about between changes staged to go out that you
weren’t aware were uploaded and changes made manually to server by SAs
getting frisky with with root typey-typey. This becomes politically an
issue since it means that every time you deploy something with chef you
have to worry about things blowing up and it going to be your fault, not
the fault of the SA who made with the typey-typey (its easy to correctly
throw the typist under the bus the first time, but by the third time
this happens people in the CM meetings who have zero clue about systems
management are going to start wondering what is wrong with your
deployment process with chef). With more frequent convergences you
remediate issues with root typists much more rapidly and it becomes
their fault much more clearly since changes were made outside of the CM
process and CM windows which affected production. You also have
substantially more confidence that the approved changes you are pushing
on any given day are the only changes staged to go out with your
change. Since chef ran last night and converged the entire
infrastructure, it is only going to be the changes you are releasing to
your production environment which will go out.

You can try to address all this after-the-fact with whyrun, and that
will be a useful weapon to wield in CM meetings, but at some point
you’re going to be generating cruft and doing a ton of analysis on your
whyrun output in prep for the CM process, and it’ll get large enough and
complex enough that you’ll still miss important changes buried in an
avalanche of whyrun.

And if RAM is a concern (and I think it probably is), then I’d suggest
running chef-client one-shot periodically from cron.

I also like to run periodic convergences once every 24 hours with 12
hours of splay from 8pm to 8am since then there’s more of a chance that
you’ll catch a really bad error caused by periodic convergences as it
starts to turn individual servers lights red rather than pushing out
some accidental “chmod 600 /etc/resolv.conf” code in a 5 minute window
to every server you have. It also lets you stage changes during the day
at work, and manually poke systems during the day, allowing you time to
analyze your changes and abort if you find something you don’t like,
then you can let periodic convergence overnight handle making sure that
all the servers are in lockstep. YMMV.

(and if you think that everyone involved in your change management
process and your IT management would be smart enough to correctly blame
SAs who make with the typey-typey and implement strong rules against
that, then you have been extremely lucky in your exposure to
Enterprise-Class IT management…)

Anyway, things are not as clear-cut as they seem and what works for the
startup with 20 servers, or the tech-heavy enterprise with a lot of
smart people, does not necessarily apply across the board… I can also
see where SAs in the financial services sector may not do this at all,
and all change may be controlled, nobody allowed to login to servers
without forms filled out in triplicate, no config pushed to a chef
server without approval, all convergences submitted to change management
with associated whyrun output and justification, etc… Or, you know,
just quit and find yourself a startup with a handful of employees and
cowboy it all up… =)

On 9/6/12 7:52 AM, Tetsu Soh wrote:

Hi,

well, it really depends on how you manage operations carried out by chef.

By running chef as daemon, chef will apply your changes in next time
it runs.
So longer the interval between 2 converges is, more changes may be made.
And more changes you make once, more risk you have.
For example, if one recipes failed, all recipes after that will not be
run.
So you need to figure out which one failed, fix it an run everything
again.

OMP, running chef-client on demand will be a better solution.

Regards,

Tetsu

On Sep 6, 2012, at 11:36 PM, Joshua Miller <joshuamiller01@gmail.com
mailto:joshuamiller01@gmail.com> wrote:

Tetsu, can you elaborate on the concerns you’ve got for running
chef-client as a daemon?

On Sep 6, 2012 7:26 AM, “Tetsu Soh” <tetsu.soh@gmail.com
mailto:tetsu.soh@gmail.com> wrote:

Hello,

You can use exception handler to get runtime exception on your
recipes.

To monitor chef-client process, you can use god, which is a
process monitor written in Ruby.

btw, running chef-client as a daemon is not good in some case. be
careful with that.

Regards,

Tetsu

On Sep 6, 2012, at 8:32 AM, Paul McCallick
<PMcCallick@paraport.com <mailto:PMcCallick@paraport.com>> wrote:
Hi all,
We’re fairly new to chef and have been manually executing chef
runs on one or many nodes.  We just made the move to run it as a
service via the chef-client cookbook.
What’s the quickest way to make sure that we’re getting notified
when there are problems during a chef run?  What’s the best way?
Thanks,
Paul

#14

It’s pretty simple to wrap chef runs in a “deploy” flag so that you can kick off certain tasks like code deploys only when you want to. That way your base system configs are constantly updated, but your application remains the same.

http://www.therealtimsmith.com/home/2012/06/only-deploying-when-you-want-to-with-chef-aka-dont-break-prod/

Tim Smith

Operations Engineer, SaaS Operations

M: +1 707.738.8132

TW: @tas50

webtrendshttp://www.webtrends.com/

Real-Time Relevance. Remarkable ROI.™

London | Portland | San Francisco | Melbourne | Tokyo

From: Lamont Granquist <lamont@opscode.commailto:lamont@opscode.com>
Reply-To: "chef@lists.opscode.commailto:chef@lists.opscode.com" <chef@lists.opscode.commailto:chef@lists.opscode.com>
Date: Thursday, September 6, 2012 11:14 AM
To: "chef@lists.opscode.commailto:chef@lists.opscode.com" <chef@lists.opscode.commailto:chef@lists.opscode.com>
Subject: [chef] Re: Re: Re: Re: Monitoring chef runs

IMO, if you run it on at least a 24h basis then you know that your systems are converged. This is good in Enterprise Situations™ since change management quickly becomes somewhat brutal and you’ll often only be able to get approval to push changes and manually run chef a few times a month. That can build up to a really substantial amount of lint that you need to worry about between changes staged to go out that you weren’t aware were uploaded and changes made manually to server by SAs getting frisky with with root typey-typey. This becomes politically an issue since it means that every time you deploy something with chef you have to worry about things blowing up and it going to be your fault, not the fault of the SA who made with the typey-typey (its easy to correctly throw the typist under the bus the first time, but by the third time this happens people in the CM meetings who have zero clue about systems management are going to start wondering what is wrong with your deployment process with chef). With more frequent convergences you remediate issues with root typists much more rapidly and it becomes their fault much more clearly since changes were made outside of the CM process and CM windows which affected production. You also have substantially more confidence that the approved changes you are pushing on any given day are the only changes staged to go out with your change. Since chef ran last night and converged the entire infrastructure, it is only going to be the changes you are releasing to your production environment which will go out.

You can try to address all this after-the-fact with whyrun, and that will be a useful weapon to wield in CM meetings, but at some point you’re going to be generating cruft and doing a ton of analysis on your whyrun output in prep for the CM process, and it’ll get large enough and complex enough that you’ll still miss important changes buried in an avalanche of whyrun.

And if RAM is a concern (and I think it probably is), then I’d suggest running chef-client one-shot periodically from cron.

I also like to run periodic convergences once every 24 hours with 12 hours of splay from 8pm to 8am since then there’s more of a chance that you’ll catch a really bad error caused by periodic convergences as it starts to turn individual servers lights red rather than pushing out some accidental “chmod 600 /etc/resolv.conf” code in a 5 minute window to every server you have. It also lets you stage changes during the day at work, and manually poke systems during the day, allowing you time to analyze your changes and abort if you find something you don’t like, then you can let periodic convergence overnight handle making sure that all the servers are in lockstep. YMMV.

(and if you think that everyone involved in your change management process and your IT management would be smart enough to correctly blame SAs who make with the typey-typey and implement strong rules against that, then you have been extremely lucky in your exposure to Enterprise-Class IT management…)

Anyway, things are not as clear-cut as they seem and what works for the startup with 20 servers, or the tech-heavy enterprise with a lot of smart people, does not necessarily apply across the board… I can also see where SAs in the financial services sector may not do this at all, and all change may be controlled, nobody allowed to login to servers without forms filled out in triplicate, no config pushed to a chef server without approval, all convergences submitted to change management with associated whyrun output and justification, etc… Or, you know, just quit and find yourself a startup with a handful of employees and cowboy it all up… =)

On 9/6/12 7:52 AM, Tetsu Soh wrote:
Hi,

well, it really depends on how you manage operations carried out by chef.

By running chef as daemon, chef will apply your changes in next time it runs.
So longer the interval between 2 converges is, more changes may be made.
And more changes you make once, more risk you have.
For example, if one recipes failed, all recipes after that will not be run.
So you need to figure out which one failed, fix it an run everything again.

OMP, running chef-client on demand will be a better solution.

Regards,

Tetsu

On Sep 6, 2012, at 11:36 PM, Joshua Miller <joshuamiller01@gmail.commailto:joshuamiller01@gmail.com> wrote:

Tetsu, can you elaborate on the concerns you’ve got for running chef-client as a daemon?

On Sep 6, 2012 7:26 AM, “Tetsu Soh” <tetsu.soh@gmail.commailto:tetsu.soh@gmail.com> wrote:
Hello,

You can use exception handler to get runtime exception on your recipes.

To monitor chef-client process, you can use god, which is a process monitor written in Ruby.

btw, running chef-client as a daemon is not good in some case. be careful with that.

Regards,

Tetsu

On Sep 6, 2012, at 8:32 AM, Paul McCallick <PMcCallick@paraport.commailto:PMcCallick@paraport.com> wrote:

Hi all,

We’re fairly new to chef and have been manually executing chef runs on one or many nodes. We just made the move to run it as a service via the chef-client cookbook.

What’s the quickest way to make sure that we’re getting notified when there are problems during a chef run? What’s the best way?

Thanks,
Paul


#15

On 9/6/12 7:52 AM, Tetsu Soh wrote:
well, it really depends on how you manage operations carried out by chef.

In the end you have to sit down and come up with a workflow and risk
model that fits your business and they’re all somewhat different. A
few other things to consider besides the info above:

  • Just because you run chef all the time across your whole
    infrastructure doesn’t mean you have to push changes to your whole
    infrastructure all the time. Consider designing your change control
    process around the the environments and/or chef servers you’re pushing
    to. In the past I’ve seen things like: dev/qa always gets new
    cookbooks all the time, prod used a separate server/organization so a
    separate push was required to migrate into prod which was done in a
    scheduled window but those windows were easy to come by as long as
    your change was targeted at a specific (usually out of service)
    cluster (modeled with environments). It was a somewhat complex flow
    but it was clear, let people move fast and kept chef running
    everywhere every 30 minutes. This is good because chef could still
    remediate any out-of-band changes it finds very quickly while new
    changes go out in a slower and more controlled way.

  • Another reason to keep chef running all the time is that you’re less
    likely to fight with it. If it takes minimum 24 hours to push a change
    with chef (example above) and you need to do something faster (OMG
    prod is on fire!), you’re not going to use chef, you’re going to do
    something else (ssh for loop?). Maybe you’ll remember to backport your
    change into chef. Maybe when you backport it into chef it’ll actually
    come out compatible rather than just slightly not the same. Maybe not,
    and every time that fails everyone who doesn’t know better blames chef
    for undoing their productive work. Now they’re working against the
    tool instead of the tool making their lives better.

KC


#16

On 9/6/12 1:20 PM, KC Braunschweig wrote:

  • Another reason to keep chef running all the time is that you’re less
    likely to fight with it. If it takes minimum 24 hours to push a change
    with chef (example above) and you need to do something faster (OMG
    prod is on fire!), you’re not going to use chef, you’re going to do
    something else (ssh for loop?). Maybe you’ll remember to backport your
    change into chef. Maybe when you backport it into chef it’ll actually
    come out compatible rather than just slightly not the same. Maybe not,
    and every time that fails everyone who doesn’t know better blames chef
    for undoing their productive work. Now they’re working against the
    tool instead of the tool making their lives better. KC
    Or just:

knife ssh ‘:’ sudo chef-client

To force a push in an emergency, or if you have changes that must be
rolled out in a given window and can’t be done lazily. I used lazy
overnight config rollouts at one job, but at others I’ve had to fit all
of the change within a given window; so while I’ve still used 24h
cronjobs for periodic convergence, I’ve used knife ssh to kick off
changes across the whole fleet at once.


#17

On Thu, Sep 6, 2012 at 5:25 PM, Christopher Brown cb@opscode.com wrote:

There are a number of different considerations here, and it’s worth teasing
them all apart:

  1. Do you want to know about impeding change, possibly without acting? If
    so, check out “why run” in the latest builds:
    http://wiki.opscode.com/display/chef/Whyrun+Testing
  2. Do you want to reduce the amount of change in a time period, thinking
    that’s less likely to result in errors stacking up? (NOTE: that’s not a
    provable claim, but an intuitive notion). If so, either run chef-client
    from cron or as a daemon.
  3. Do you want to directly control how and when change occurs? If so, run
    chef-client manually.
  4. Do you want to directly control how and when Ruby consumes memory? If
    so, run chef-client manually. Running any Ruby process as a daemon may
    result in a fair amount of memory being consumed / committed all the time.

And most of these options can be further automated using something
like mcollective.
We’re looking into using it to add some “push” capabilities for
on-demand changes.

Andrea


#18

On Thu, Sep 6, 2012 at 8:26 AM, Tetsu Soh tetsu.soh@gmail.com wrote:

One more thing need your attention, exception handler can only handle
runtime exception but not compile time exception.

If your recipe broke on compile time, the chef-client will stop without
running and exception handler.

Would people consider this a bug? I (unfortunately) do a lot at
compile-time so dying there is fairly likely and I’d really like to
know about that too. Anyone have a good workaround? Would it be
appropriate to always run exception handlers? Or should there be an
alternate mechanism for compile-time failures?

A couple related points:

  • I noticed (probably for the same reason) that when runs die early
    you don’t get the run duration, which I’d still like to have.
  • I’ve run into this scenario a lot:
    1. Run updates resource X
    2. resource X notifies a restart of service Xd (delay)
    3. An exception happens on resource Y and the chef run dies before
      delayed notifications are processed. At this point service Xd hasn’t
      been restarted which is probably good because that might cause a bad
      state since the run failed.
    4. I come along to troubleshoot, fix things and run chef again
    5. chef run (including resource Y) now completes successfully but
      the restart of service Xd never happened. Now I have a system that
      seems like it should be in a good state but it might not be because
      service Xd still has old config in memory and I have no way to know
      that and chef will never fix it.

Anyone have a good way to prevent or address the above?

Thanks,

KC


#19

On Friday, September 7, 2012 at 3:06 PM, KC Braunschweig wrote:

On Thu, Sep 6, 2012 at 8:26 AM, Tetsu Soh <tetsu.soh@gmail.com (mailto:tetsu.soh@gmail.com)> wrote:

One more thing need your attention, exception handler can only handle
runtime exception but not compile time exception.

If your recipe broke on compile time, the chef-client will stop without
running and exception handler.

Would people consider this a bug? I (unfortunately) do a lot at
compile-time so dying there is fairly likely and I’d really like to
know about that too. Anyone have a good workaround? Would it be
appropriate to always run exception handlers? Or should there be an
alternate mechanism for compile-time failures?

A couple related points:

  • I noticed (probably for the same reason) that when runs die early
    you don’t get the run duration, which I’d still like to have.
  • I’ve run into this scenario a lot:
  1. Run updates resource X
  2. resource X notifies a restart of service Xd (delay)
  3. An exception happens on resource Y and the chef run dies before
    delayed notifications are processed. At this point service Xd hasn’t
    been restarted which is probably good because that might cause a bad
    state since the run failed.
  4. I come along to troubleshoot, fix things and run chef again
  5. chef run (including resource Y) now completes successfully but
    the restart of service Xd never happened. Now I have a system that
    seems like it should be in a good state but it might not be because
    service Xd still has old config in memory and I have no way to know
    that and chef will never fix it.

Anyone have a good way to prevent or address the above?

Thanks,

KC
There’s a patch where delayed notifications are always run in master. This is a pretty significant behavior change so we’re waiting until Chef 11 to ship it.


Daniel DeLeo


#20

On Fri, Sep 7, 2012 at 3:11 PM, Daniel DeLeo dan@kallistec.com wrote:

There’s a patch where delayed notifications are always run in master. This
is a pretty significant behavior change so we’re waiting until Chef 11 to
ship it.

Interesting. I suspect that would be good much of the time, and more
obvious, but sorta goes against the normal chef behavior that if
something bad happens we bail out immediately. Thanks,

KC