Delayed actions don't run if a recipe fails

Michael_Rowe · November 23, 2010, 2:25am

Hi folks,

I am currently bumping up against the issue described here:
http://tickets.opscode.com/browse/CHEF-1067

In a nutshell:

a recipe sets up some file and notifies a service it needs to restart
the restart is delayed
further things happen…
a recipe fails
the chef run ends, without running the delayed actions

On the next chef-client run, the original change is already in place,
so the service isn’t notified to restart again. But it never actually
got restarted because the previous run ended prematurely. This is a
Bad Thing.

The last update to that issue was in July, saying “maybe discuss on
the list”… but no one seemed to at the time, so here I am!

Thoughts?

–
Michael Rowe

kallistec · November 23, 2010, 3:13am

On Mon, Nov 22, 2010 at 7:25 PM, Michael Rowe mrowe@mojain.com wrote:

Hi folks,

I am currently bumping up against the issue described here:
http://tickets.opscode.com/browse/CHEF-1067

In a nutshell:

a recipe sets up some file and notifies a service it needs to restart

the restart is delayed

further things happen...

a recipe fails

the chef run ends, without running the delayed actions

On the next chef-client run, the original change is already in place,
so the service isn't notified to restart again. But it never actually
got restarted because the previous run ended prematurely. This is a
Bad Thing.

The last update to that issue was in July, saying "maybe discuss on
the list"... but no one seemed to at the time, so here I am!

Thoughts?

It's a tough situation, services are unique in that they have state
in-memory that's difficult/impossible for chef to query (unless you
somehow instrument your services).

The specific problem with delayed notifications is that you may have
depended on some later resource in your resource collection to modify
some other bit of configuration relevant to that service. So in the
case of a failure, you'd prefer to fix everything by hand rather than
restarting your service in a broken state.

To get around this issue, you might consider keeping some state in a
file on disk that marks the service as "dirty" and add some mechanism
to the service resource to trigger a restart if the service is marked
dirty. The problem with this is that you might erroneously restart the
service if an administrator has manually resolved the issue in between
the failed chef run and the next one. Even so, it may be possible on
some systems to work back to the last time the service was restarted
using information from something like ps -o etime and detect if the
service has been manually restarted.

That all said, it's definitely not a general solution, so delayed
notifications to execute resources would still be a problem. The best
you could do there as far as I can figure is print the notifications
that did not execute.

I haven't thought all of this through completely, so if there's some
problem with the solution I described, I'd like to hear it.

Dan DeLeo

--
Michael Rowe

Dan_DeMaggio · November 23, 2010, 7:16am

On Mon, Nov 22, 2010 at 10:13 PM, Daniel DeLeo dan@kallistec.com wrote:

It's a tough situation, services are unique in that they have state
in-memory that's difficult/impossible for chef to query (unless you
somehow instrument your services).

Instead of thinking of the service, think about logging Chef's
intentions in an idempotent way:

For each file change (that has a notification), Chen can create an
INTENTION file (or append a line in a file). The intention file simply
has the filename, and the new MD5.
when the associated notification is delivered, we delete that file
(or that line in the file).
At the end of a successful run, we can delete all intention files/entries.

Now, when a file is not changed, we consult the intention DB. if it
has our file+MD5, we pretend the file was just changed. This sets up
the delayed notification again. If we fail again, we continue to think
that "the file is still new if it has this MD5".

Note that:

if the recipe isn't run, this feature is ignored (we won't see our
file + md5, so we won't trigger. At the end of the run we forget about
it)
if the file changes again, we won't trigger (will have a different
MD5, which will only trigger the normal change notification)
if the recipe changes (add/remove notifications) we will trigger
whatever the new recipe says
Things get a little tricky if one file notifies multiple services.
We could not do notifications in this case, and we'd be no worse
off. (i.e. we'd be fixing the single-notification case, and leaving
the multi-notification case broken as it is now.)

-=Dan=-

coderanger · November 23, 2010, 7:25am

On Nov 22, 2010, at 7:13 PM, Daniel DeLeo wrote:

On Mon, Nov 22, 2010 at 7:25 PM, Michael Rowe mrowe@mojain.com
wrote:

Hi folks,

I am currently bumping up against the issue described here:
http://tickets.opscode.com/browse/CHEF-1067

In a nutshell:

a recipe sets up some file and notifies a service it needs to
restart

the restart is delayed

further things happen...

a recipe fails

the chef run ends, without running the delayed actions

On the next chef-client run, the original change is already in place,
so the service isn't notified to restart again. But it never actually
got restarted because the previous run ended prematurely. This is a
Bad Thing.

The last update to that issue was in July, saying "maybe discuss on
the list"... but no one seemed to at the time, so here I am!

Thoughts?

It's a tough situation, services are unique in that they have state
in-memory that's difficult/impossible for chef to query (unless you
somehow instrument your services).

The specific problem with delayed notifications is that you may have
depended on some later resource in your resource collection to modify
some other bit of configuration relevant to that service. So in the
case of a failure, you'd prefer to fix everything by hand rather than
restarting your service in a broken state.

To get around this issue, you might consider keeping some state in a
file on disk that marks the service as "dirty" and add some mechanism
to the service resource to trigger a restart if the service is marked
dirty. The problem with this is that you might erroneously restart the
service if an administrator has manually resolved the issue in between
the failed chef run and the next one. Even so, it may be possible on
some systems to work back to the last time the service was restarted
using information from something like ps -o etime and detect if the
service has been manually restarted.

That all said, it's definitely not a general solution, so delayed
notifications to execute resources would still be a problem. The best
you could do there as far as I can figure is print the notifications
that did not execute.

I haven't thought all of this through completely, so if there's some
problem with the solution I described, I'd like to hear it.

Chef could internally do something similar. The client could note in a
state file that there is a pending notification and wait until the end
of the next successful cycle to execute it. There isn't a whole lot of
state attached to notifications, so a simple JSON file would probably
suffice. This still isn't quite bullet-proof if the client itself
crashes, but it would cover the 99% case of bad recipes.

--Noah

ptrstr · November 23, 2010, 8:35am

On Mon, Nov 22, 2010 at 11:25 PM, Noah Kantrowitz noah@coderanger.netwrote:

On Nov 22, 2010, at 7:13 PM, Daniel DeLeo wrote:

On Mon, Nov 22, 2010 at 7:25 PM, Michael Rowe mrowe@mojain.com wrote:

Hi folks,

I am currently bumping up against the issue described here:
http://tickets.opscode.com/browse/CHEF-1067

In a nutshell:

a recipe sets up some file and notifies a service it needs to restart

the restart is delayed

further things happen...

a recipe fails

the chef run ends, without running the delayed actions

On the next chef-client run, the original change is already in place,
so the service isn't notified to restart again. But it never actually
got restarted because the previous run ended prematurely. This is a
Bad Thing.

The last update to that issue was in July, saying "maybe discuss on
the list"... but no one seemed to at the time, so here I am!

Thoughts?

It's a tough situation, services are unique in that they have state
in-memory that's difficult/impossible for chef to query (unless you
somehow instrument your services).

The specific problem with delayed notifications is that you may have
depended on some later resource in your resource collection to modify
some other bit of configuration relevant to that service. So in the
case of a failure, you'd prefer to fix everything by hand rather than
restarting your service in a broken state.

To get around this issue, you might consider keeping some state in a
file on disk that marks the service as "dirty" and add some mechanism
to the service resource to trigger a restart if the service is marked
dirty. The problem with this is that you might erroneously restart the
service if an administrator has manually resolved the issue in between
the failed chef run and the next one. Even so, it may be possible on
some systems to work back to the last time the service was restarted
using information from something like ps -o etime and detect if the
service has been manually restarted.

That all said, it's definitely not a general solution, so delayed
notifications to execute resources would still be a problem. The best
you could do there as far as I can figure is print the notifications
that did not execute.

I haven't thought all of this through completely, so if there's some
problem with the solution I described, I'd like to hear it.

Chef could internally do something similar. The client could note in a
state file that there is a pending notification and wait until the end of
the next successful cycle to execute it. There isn't a whole lot of state
attached to notifications, so a simple JSON file would probably suffice.
This still isn't quite bullet-proof if the client itself crashes, but it
would cover the 99% case of bad recipes.

--Noah

My two cents..

Chef-client should probably execute all pending delayed notifications by
default, even if there was an exception during the run. To me this behavior
would be preferable and most of the time it would be the correct thing to do
anyway. If it breaks a service because of inconsistent state, I can live
with that. You could even make this behavior optional and be able to
explicitly enable/disable it per notification. e.g. add a :persistent option
to notifies.
This also has the benefit of attempting to resolve it within a single chef
run. I do not like the idea of trying to preserve state between two (or
more) runs.

--Peter

Mike_Williams · November 23, 2010, 10:02am

On 23/11/2010, at 19:35 , Peter Struijk wrote:

Chef-client should probably execute all pending delayed notifications by default, even if there was an exception during the run.

+1.

The only conceivable downside is that this might cause services to fail immediately, rather than keep running despite an inconsistent environment. But that could also be considered an upside - fail fast, rather than some later date, when the service happens to restart for a different reason.

--
cheers,
Mike Williams

Dan_DeMaggio · November 23, 2010, 2:17pm

On Tue, Nov 23, 2010 at 3:35 AM, Peter Struijk peter@delftblues.net wrote:

On Mon, Nov 22, 2010 at 11:25 PM, Noah Kantrowitz noah@coderanger.net
execute all pending delayed notifications by default

Hmm, this "seems wrong" somehow, but I have to admit I'd like it
better than the current behavior ("do the wrong thing and stay
there").

+1, even if it's a short-term solution.

-=Dan=-

Topic		Replies	Views
Chef Failure and Lost Notifications Chef Infra (archive)	3	632	April 7, 2012
Force restart/start Chef Infra (archive)	2	282	May 14, 2011
RE : * PROBABLY SPAM * Re: Re: Re: Re: Re: Re: delayed execution order error? Chef Infra (archive)	0	294	February 21, 2013
Re: Re: Re: Re: Re: delayed execution order error? Chef Infra (archive)	1	275	February 21, 2013
Delayed execution order error? Chef Infra (archive)	10	1592	February 22, 2013

Delayed actions don't run if a recipe fails

Related topics