It’s pretty simple to wrap chef runs in a “deploy” flag so that you can kick off certain tasks like code deploys only when you want to. That way your base system configs are constantly updated, but your application remains the same.
Operations Engineer, SaaS Operations
M: +1 707.738.8132
Real-Time Relevance. Remarkable ROI.™
London | Portland | San Francisco | Melbourne | Tokyo
From: Lamont Granquist <email@example.com:firstname.lastname@example.org>
Reply-To: "email@example.com:firstname.lastname@example.org" <email@example.com:firstname.lastname@example.org>
Date: Thursday, September 6, 2012 11:14 AM
To: "email@example.com:firstname.lastname@example.org" <email@example.com:firstname.lastname@example.org>
Subject: [chef] Re: Re: Re: Re: Monitoring chef runs
IMO, if you run it on at least a 24h basis then you know that your systems are converged. This is good in Enterprise Situations™ since change management quickly becomes somewhat brutal and you’ll often only be able to get approval to push changes and manually run chef a few times a month. That can build up to a really substantial amount of lint that you need to worry about between changes staged to go out that you weren’t aware were uploaded and changes made manually to server by SAs getting frisky with with root typey-typey. This becomes politically an issue since it means that every time you deploy something with chef you have to worry about things blowing up and it going to be your fault, not the fault of the SA who made with the typey-typey (its easy to correctly throw the typist under the bus the first time, but by the third time this happens people in the CM meetings who have zero clue about systems management are going to start wondering what is wrong with your deployment process with chef). With more frequent convergences you remediate issues with root typists much more rapidly and it becomes their fault much more clearly since changes were made outside of the CM process and CM windows which affected production. You also have substantially more confidence that the approved changes you are pushing on any given day are the only changes staged to go out with your change. Since chef ran last night and converged the entire infrastructure, it is only going to be the changes you are releasing to your production environment which will go out.
You can try to address all this after-the-fact with whyrun, and that will be a useful weapon to wield in CM meetings, but at some point you’re going to be generating cruft and doing a ton of analysis on your whyrun output in prep for the CM process, and it’ll get large enough and complex enough that you’ll still miss important changes buried in an avalanche of whyrun.
And if RAM is a concern (and I think it probably is), then I’d suggest running chef-client one-shot periodically from cron.
I also like to run periodic convergences once every 24 hours with 12 hours of splay from 8pm to 8am since then there’s more of a chance that you’ll catch a really bad error caused by periodic convergences as it starts to turn individual servers lights red rather than pushing out some accidental “chmod 600 /etc/resolv.conf” code in a 5 minute window to every server you have. It also lets you stage changes during the day at work, and manually poke systems during the day, allowing you time to analyze your changes and abort if you find something you don’t like, then you can let periodic convergence overnight handle making sure that all the servers are in lockstep. YMMV.
(and if you think that everyone involved in your change management process and your IT management would be smart enough to correctly blame SAs who make with the typey-typey and implement strong rules against that, then you have been extremely lucky in your exposure to Enterprise-Class IT management…)
Anyway, things are not as clear-cut as they seem and what works for the startup with 20 servers, or the tech-heavy enterprise with a lot of smart people, does not necessarily apply across the board… I can also see where SAs in the financial services sector may not do this at all, and all change may be controlled, nobody allowed to login to servers without forms filled out in triplicate, no config pushed to a chef server without approval, all convergences submitted to change management with associated whyrun output and justification, etc… Or, you know, just quit and find yourself a startup with a handful of employees and cowboy it all up… =)
On 9/6/12 7:52 AM, Tetsu Soh wrote:
well, it really depends on how you manage operations carried out by chef.
By running chef as daemon, chef will apply your changes in next time it runs.
So longer the interval between 2 converges is, more changes may be made.
And more changes you make once, more risk you have.
For example, if one recipes failed, all recipes after that will not be run.
So you need to figure out which one failed, fix it an run everything again.
OMP, running chef-client on demand will be a better solution.
On Sep 6, 2012, at 11:36 PM, Joshua Miller <email@example.com:firstname.lastname@example.org> wrote:
Tetsu, can you elaborate on the concerns you’ve got for running chef-client as a daemon?
On Sep 6, 2012 7:26 AM, “Tetsu Soh” <email@example.com:firstname.lastname@example.org> wrote:
You can use exception handler to get runtime exception on your recipes.
To monitor chef-client process, you can use god, which is a process monitor written in Ruby.
btw, running chef-client as a daemon is not good in some case. be careful with that.
On Sep 6, 2012, at 8:32 AM, Paul McCallick <PMcCallick@paraport.commailto:PMcCallick@paraport.com> wrote:
We’re fairly new to chef and have been manually executing chef runs on one or many nodes. We just made the move to run it as a service via the chef-client cookbook.
What’s the quickest way to make sure that we’re getting notified when there are problems during a chef run? What’s the best way?