Dry-run / no-op mode

On Mon, Nov 30, 2009 at 2:28 AM, jon stuart lists@zomo.co.uk wrote:

I had a great weekend getting Chef up and running on some lab kit,
looking at it as an alternative to homegrown scripts and Puppet. It's an
impressive system so firstly thanks for writing and sharing it :slight_smile:

I was wondering about the no-op stuff hinted at in CHEF-13. For me the
ability to eyeball proposed changes on (a sample of) nodes is pretty
important, both for avoiding silly mistakes whilst learning the ropes
and as a checkpoint before rolling out scarily large changes.

If Chef has this ability I can't find it, and if it doesn't then I'm
wondering if the problem is something a keen Rubyist could work on. Only
48 hours I'm admittedly naive about Chef's internals, it might be a hard
one that needs serious attention from core developers rather than me
blundering in!

Sorry if this is a FAQ or similar, I couldn't find much mention of it
other than the ticket.

As Bryan pointed out, Chef doesn't have a --noop mode right now.

I want to take a minute to talk about how this might work, and why
doing this with Chef might produce results that are less satisfactory
than they might otherwise be.

First, off Chef values repeat-ability and consistency over resiliency
when applying recipes. What that means is that we do things in the
order you tell us to, and we do the same thing every time you run
Chef. This buys us a couple of neat things: the first is that it's
easy to reason about what happens when Chef gets run, and the second
that given the same set of inputs and the same original system state,
Chef will always fail in the same way (assuming it's a bug in you
recipe at fault.)

This decision causes an interesting condition to exist when talking
about things like --noop. Because Chef is built of idempotent
resources that expect to be run in order, there is no way for us to
tell that a particular resource later in the resource collection would
succeed if a resource that came before it did not succeeded - we have
to assume that everything works.

As an example, lets take a recipe that add's apt.opscode.com to
/etc/apt/sources.list, runs apt-get update, and then installs the
latest version of ohai.

template "/etc/apt/sources.d/opscode" do
.. some stuff ..
end

execute "apt-get update" do
action :nothing
subscribe :run, resources(:template => "/etc/apt/sources.d/opscode")
end

package "ohai" do
action :upgrade
end

In a normal Chef run, if the template fails, or the apt-get update
fails, the package won't even attempt to be installed. In a dry-run
world, you would have to assume that every resource would either take
no action, or be successful. So if the template did not need to be
rendered, we would know that we didn't need to run apt-get update; but
what about the package? We'll likely fail to find it even available
in the package list, at least on the first pass, causing a failure
that may cascade through the rest of the resource collection.

This problem gets exacerbated when you start thinking about the
dynamism that is present in Chef - you can alter the resource
collection at run time, you can search across the entire
infrastructure, you can query data bags, etc. Each of these can
potentially alter the resource collection, or alter the way a resource
might be rendered. Which means that, between the output of your dry
run and the actual run, the actions taken might change.

All of this means that, while a dry-run mode is possible, it is also
likely to tell you lies about what might really happen.

The use-case you specify above is the ability to eyeball proposed
changes on a sample of nodes, and only apply them to the entire world
once you are comfortable with them. That problem sounds like it could
be solved by our adding Infrastructure support (the ability to have
more than one environment, say dev->test->staging->production) with
the ability to propagate a cookbook version from one environment to
another, along with some great reporting about what each chef run has
done to your system after the fact. Would that satisfy your use case
for a dry-run mode?

The above is true about most other configuration management systems
dry run modes - Puppet at the very least (although the resource level
dependency tracking gives puppet some interesting options about
chopping off limbs of the tree as failure happens). Bcfg2 and
Cfengine2 actually have some potential to be valuable here, since they
are basically policy engines that are order-agnostic - Bcfg2 can tell
you that N packages are out of policy, and Y services are out of
policy, etc.

What other use-cases are there here, that aren't under-cut by the very
real potential for lies? Would it be enough to enable the visibility
into how the system is really behaving that would allow you to gain
the level of trust you need, rather than a full-on dry run mode?

Adam

Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com

Adam Jacob wrote:

[ snip ]

All of this means that, while a dry-run mode is possible, it is also
likely to tell you lies about what might really happen.

[ snip ]

What other use-cases are there here, that aren't under-cut by the very
real potential for lies? Would it be enough to enable the visibility
into how the system is really behaving that would allow you to gain
the level of trust you need, rather than a full-on dry run mode?

Hi,

Thanks for explaining your reservations about dry-run.

I'd suggest however that the caveats of dry-runs not perfectly
reflecting what happens in real runs are understood by the users of most
tools that provide such a facility, especially in the case of an
intermediate step's failure changing or halting execution.

For example, make's -n dry-run assumes every compile succeeds, and walks
the dependency graph accordingly. And, probably less so than Chef,
intermediate steps can alter the graph too beyond just success or
failure. However it's making no promises that a real build will behave
like that, but it's providing a massively useful facility: "Roughly, in
a perfect world, what would you do to build this target?"

Concerning use-cases: of particular interest to me is being able to diff
the the outputs of template generation. Not just to see how a template
change might look, but more importantly what the outcome of changing the
attributes that govern their population.

A contrived but hopefully valid example: the address of an internal web
service (a key-value store, perhaps) is being managed by Chef. It uses
this attribute to populate templates that tell other applications on the
platform where to connect for this service.

One day I decide to pair up two hosts to provide this service and expose
them via a VIP managed by some appliance, and update the attribute in
Chef accordingly. Great, the apps can survive the failure of one of the
service's hosts.

What I've forgotten is that I've also got Chef to use this attribute to
indicate where to poll SNMP data from and where overnight backups should
rsync from. These templates now point at the VIP-managing appliance
rather than the real host. Things go weird, things break.

Yes, I should've thought about that. Yes, I can discover this by looking
at what Chef did after it runs. Yes, I can fix by making a clearer
distinction in my Chef configuration between hosts and services. But
finding out how feeble my brain is after the change has broken stuff
isn't great.

It's possible such a situation might arise even when testing changes in
environments prior to production. I might not backup the key-value
service in staging so there was no change to notice there. (As I said,
contrived example!)

True, having staging not reflecting production is always going to hurt.
But I don't want my configuration manager to increase that hurt by
making me drink the "everything that works on staging works in
production" koolaid. IME, there's always some variance somewhere, even
just in the dimensions.

I'd be happy with the potential for a dry-run mode to be optimistic and
sometimes inaccurate if it helps save me from myself. Given the
precedent of dry-run modes in other complex tools I'd like to believe
it's not just my capacity for forgetfulness and oversight that makes
them useful, other people think so too :slight_smile:

Regards, jon.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

jon stuart wrote:

Thanks for explaining your reservations about dry-run.

<.. snip super great use case description ..>

I'd be happy with the potential for a dry-run mode to be optimistic and
sometimes inaccurate if it helps save me from myself. Given the
precedent of dry-run modes in other complex tools I'd like to believe
it's not just my capacity for forgetfulness and oversight that makes
them useful, other people think so too :slight_smile:

Thanks for the answer, Jon. It certainly tilted me a bit back in favor
of adding a dry run mode. :slight_smile:

Anyone else?

Adam


Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJLF4pgAAoJEK+ksDjf4JoCx+AH/3K6Fzg+vixUAuu4Q1uqE4Uk
tE80Nn9GYDoK/GWpmmcnNt/5q2fP6lKHHSS/niSTFw7LCuUJXHHFOgaXvVUNpc3m
XA3DBy3mckvA8eeTsjruho9mCZ4zElkWxC6q3lqoy/HstxVK1bODNFXZDFqw8Uxh
M0ihoxYbqUHmjcYXKNKxiwy/bWTSYYDxSVFkYy5DL99k7onlb8MMX21rcQXRmg/v
lOqgML2VZkVTmkDMGz+hq2bUmSeDUhP0/hxwWtxjYFskr0mQQk8MGbNjHqg66B3V
S1Aa/nQSLeSLgXMO5OFrJhS9yKOfss6/zaXEGfGMP+FKxJSPxEULrBPH+BRTH6E=
=/7qt
-----END PGP SIGNATURE-----

On Thu, Dec 3, 2009 at 4:52 AM, Adam Jacob adam@opscode.com wrote:

Anyone else?

Yes, I'd say a dry-run mode would be helpful. I think there's two
ways to look at a dry-run mode:

  1. it will tell you what will happen

  2. it will tell you what it wants to change

I think we all agree that #1 is a fantasy -- it can't predict 100%
what will happen. The analogy to Make is a good one. Dry run isn't
going to be very useful if you're running Chef for the first time ever
-- there's just too much going on. (But it might be useful to say
"chef --dry-run | grep Package" to see which packages it would install
as a sanity check).

But #2 is very useful. Much of the time we make small changes and just
want to make sure nothing is screwed up ("oh, no, it pulled in the
wrong recipe and now my DNS server running MySQL"). I think of
Dry-run as more of a sanity check to see what's changed since the last
run.

Of course, nothing is going to be 100% accurate (or even safe. See
"ruby -c" with BEGIN blocks). I think we can set user's expectations
with just a paragraph or two, and it will be a useful tool.

Down the road it would be neat to put in warnings during dry run, such
as: "warning package doesn't exist in repo" or "warning file doesn't
exist when trying to set owner". That way the user gets hints on what
would fail, but dry-run can still assume nothing will fail. Some of
the warnings would be invalid (i.e. trying to modify http.conf after
installing apache during a dry run), and some would be valid (i.e.
trying to modify "/etc/hsots").

-=Dan=-