Double converge, blocking on search, event-driven chef-client?


#1

Hi all,

One thing I see come up regularly when deploying a set of recipes to a
network of machines is that search-populated config data isn’t necessarily
available when it’s needed.

Let’s use a simple example-- Nagios server and NRPE server. In our
hypothetical example, we want to use nrpe’s ability to respond only to a
configured list of IP addresses in nrpe.conf. Based on the general
principles behind converged infrastructure, the IP list would be populated
based on the results of a search.

With a brand-new network deployment, however, it is likely that nrpe will
converge on one node before the nagios server’s entire run list converges
on another node. Therefore, the nrpe server recipe will have no results
from its search until the nagios server node converges successfully once.
On the first converge to take place after the nagios server’s node data has
been stored in solr, the nrpe server will get data to write to the nrpe
configuration file.

This is what I mean by “double converge”-- it takes at least two converges
to complete the nrpe server installation and configuration because the
necessary data is not available in the first converge.

One way to reduce the number of converges is to poll search until a result
comes back. Something like this in the nrpe-server recipe code:

results = []
do
results = search(…)
break unless results.empty?
sleep(10)
end

would cause the recipe to poll the Chef server every ten seconds until a
response came back, or until the client token expired (usually around 15
minutes).

Such behavior is not ideal (particularly because of the client token expiry
issue) but besides supporting a second converge, it’s the only way I have
seen that will accomplish the desired result.

One slightly wild idea kicking around would be to have some ability for the
client to register for an event, and associate a set of resources with the
event triggering. In our hypothetical, that would allow the nrpe server’s
chef client to converge everything it could (perhaps, a set of other
recipes in the run list) and be idle until the nagios server’s chef client
completes converging its run list.

Chef Listers: How have you addressed this situation in your environment?


Justin Dossey
Practice Owner
New Context Services, Inc


#2

On Wed, Oct 1, 2014 at 12:48 PM, Justin Dossey
justin.dossey@newcontext.com wrote:

With a brand-new network deployment, however, it is likely that nrpe will
converge on one node before the nagios server’s entire run list converges on
another node. Therefore, the nrpe server recipe will have no results from
its search until the nagios server node converges successfully once. On the
first converge to take place after the nagios server’s node data has been
stored in solr, the nrpe server will get data to write to the nrpe
configuration file.

This is what I mean by “double converge”-- it takes at least two converges
to complete the nrpe server installation and configuration because the
necessary data is not available in the first converge.

You’re focusing on the window right at the initial provisioning time
for the whole network of machines, asking for a bunch of servers at
once including a brand new nagios server. The general case is to add
servers over time, and that is going to naturally take 1 run on a new
server and 1 run afterwards on the nagios server, because the new
servers didn’t exist for all the prior chef runs, and can’t be said
to be ready for service or monitoring until after their first chef
run.

One way to reduce the number of converges is to poll search until a result
comes back. Something like this in the nrpe-server recipe code:

results = []
do
results = search(…)
break unless results.empty?
sleep(10)
end

would cause the recipe to poll the Chef server every ten seconds until a
response came back, or until the client token expired (usually around 15
minutes).

Such behavior is not ideal (particularly because of the client token expiry
issue) but besides supporting a second converge, it’s the only way I have
seen that will accomplish the desired result.

Every time chef runs on the nagios server is also a “poll”, without
any downsides. Any reason that won’t work here?

A server in a tight polling loop is not being actively managed by
chef, and this logic doesn’t help you with stragglers. You could wait
for N servers, but you may not be sure N is the right number, and now
a single straggler or bad provision causes the nagios server to get
stuck, fail the run, and repeat, never getting past the poll.

One slightly wild idea kicking around would be to have some ability for the
client to register for an event, and associate a set of resources with the
event triggering. In our hypothetical, that would allow the nrpe server’s
chef client to converge everything it could (perhaps, a set of other recipes
in the run list) and be idle until the nagios server’s chef client completes
converging its run list.

Which is what chef push https://docs.getchef.com/push_jobs.html can
do, and this could have the effect of shortening the lag time for a
new server to be monitored at the cost of additional complexity.
It’s a hard problem to know when a distributed system reaches a
particular global state - e.g. “all servers that I want to monitor
have run chef once”.

Any way you slice it, this is a lot of work to optimize for the very
specific case of the first nagios server chef run. Do you have a goal
is that makes this important or necessary? Why not just let the
infrastructure converge?

-Aaron Peterson


#3

On 10/1/14, 12:48 PM, Justin Dossey wrote:

would cause the recipe to poll the Chef server every ten seconds until
a response came back, or until the client token expired (usually
around 15 minutes).

Such behavior is not ideal (particularly because of the client token
expiry issue) but besides supporting a second converge, it’s the only
way I have seen that will accomplish the desired result.

You can set no_lazy_load true in client.rb to avoid cookbook_file
resources failing after 15m. This will be the default in Chef 12
client. The Chef 12 server also uses solr 4 and is going to populate
search results in seconds rather than minutes.


#4

Thank you for focusing on the generic case rather than my hypothetical
example. This situation-- where one node needs to locate another via
search in order to write its configuration correctly-- arises in many
cases, not just with Nagios and NRPE.

To review the responses (in order to create something useful for future
readers)

  1. (paraphrased) Just wait for the second converge to complete for the
    node to be functional." This is what we do already-- a pattern like
    matches = search(…)
    then in resources, use only_if { ! matches.empty? }
    does a decent job here. It’s a bit more complex with wrapper cookbooks,
    but it’s doable.
  2. (paraphrased) A blocking poll is not ideal because the number of
    matches necessary to complete convergence may vary as the infrastructure
    grows, leading to said poll requiring additional parameterization over
    time. This is a source of unnecessary complexity. Note that in my
    hypothetical, I was describing the NRPE server polling search for the
    Nagios server, not the Nagios server polling for nodes to monitor.
  3. (paraphrased) It is inappropriate to expect to be able to converge an
    entire infrastructure in a single pass, so we should deploy nodes in a
    defined sequence in order to minimize the number of client runs necessary
    to configure the infrastructure. I disagree with this sentiment, as
    delaying service availability by a multiple of the client run interval adds
    to the complexity of the environment and increases the time to resolution
    for legitimate problems.
  4. (paraphrased) Use push jobs to notify nodes when the infrastructure
    is ready for them to converge. The very first sentence on the Chef Push
    Jobs page is “Chef push jobs is an extension of the Chef server that allows
    jobs to be run against nodes independently of a chef-client run.” We are
    talking about triggering individual resources within a cookbook within a
    chef-client run, so push jobs don’t really address this issue at all.
    Also, it would appear that push jobs were meant to be triggered by
    administrators and not by the chef-client. While I’m sure it would be
    possible to set things up in such a way that a successful initial converge
    could trigger a push job, such an implementation diverges considerably from
    the role of push jobs as designed.
  5. Chef 12 doesn’t have the 15-minute client token expiry in the same
    way Chef 11 does (by default).

Did I miss anything? While I’m not surprised by the information here, it
confirms that there is a functional hole here-- there are many situations
in a converged infrastructure that require multiple client runs before the
infrastructure is fully functional, and this means it takes longer than
absolutely necessary to bring up a new infrastructure from scratch.

I am aware that implementation of callbacks, or resource hooks on remote
nodes, is a huge job, there are other ways to approach the situation.
Another solution that comes to mind (besides the search-subscriber model):
Make it easy for a recipe to advise the chef-client to reduce its run
interval based on conditions detected in the chef-client run. I like this
one because it is useful beyond the specific situation-- it gives the
chef-client a primitive way to learn about the infrastructure it manages
and respond appropriately.

At its core, I believe that requiring configuration to come via
application-specific config files is the real issue here, but I don’t
expect to see network-wide integrated config resource support (such as in
CoreOS’s etcd) in most *nix tools in the near term, nor do I expect to see
direct support for communication with Chef servers in our tools (so, for
our trusty hypothetical example, nrpe could be instructed simply to perform
a Chef search directly to determine the allowed nrpe-client IPs rather than
have the chef-client write the nrpe.conf as part of convergence, then
notify the nrpe server of the change).

TL;DR: Now I’m thinking that the best thing would be for the recipe to be
able to advise the client to reduce the interval before the next run. This
would enable code like

result = search(…)

if result.empty?

Chef::Client.advise_interval(300) # request a chef-client run 5 minutes
(+/- splay) after this converge

end

define resources below, using not_if { result.empty? } where appropriate

On Thu, Oct 2, 2014 at 8:53 AM, Lamont Granquist lamont@opscode.com wrote:

On 10/1/14, 12:48 PM, Justin Dossey wrote:

would cause the recipe to poll the Chef server every ten seconds until a
response came back, or until the client token expired (usually around 15
minutes).

Such behavior is not ideal (particularly because of the client token
expiry issue) but besides supporting a second converge, it’s the only way I
have seen that will accomplish the desired result.

You can set no_lazy_load true in client.rb to avoid cookbook_file
resources failing after 15m. This will be the default in Chef 12 client.
The Chef 12 server also uses solr 4 and is going to populate search results
in seconds rather than minutes.


Justin Dossey
Practice Owner
New Context Services, Inc


#5

On Thu Oct 2 09:52:34 2014, Justin Dossey wrote:

  1. (paraphrased) Use push jobs to notify nodes when the
    infrastructure is ready for them to converge. The very first
    sentence on the Chef Push Jobs page is “Chef push jobs is an
    extension of the Chef server that allows jobs to be run against
    nodes independently of a chef-client run.” We are talking about
    triggering individual resources within a cookbook within a
    chef-client run, so push jobs don’t really address this issue at
    all. Also, it would appear that push jobs were meant to be
    triggered by administrators and not by the chef-client. While I’m
    sure it would be possible to set things up in such a way that a
    successful initial converge could trigger a push job, such an
    implementation diverges considerably from the role of push jobs as
    designed.
    Triggering a single individual resource is not a pattern we’re likely to
    ever support[*]. We have support for override run lists which you can
    use to do a software deployment or configure nagios or whatever and not
    have to go through and create all your home directories for your admins
    and reapply all your sysctl configuration and ssh_known_hosts entries
    and orthogonal stuff you don’t care about. Aside from whatever the docs
    might state, that is certainly a use case for push jobs. I also don’t
    know where the idea was that its only supposed to be kicked off by an
    administrator is coming from, since it is designed to be an
    orchaestration agent. Based on the design you should be able to have
    edge clients (webserver and whatnot) send a push jobs notification that
    amount to announcing that they’ve newly been built, this could then be
    used to kick off chef-client override runlists on nagios host to have
    them hit search and add the new host to monitoring. I’m not certain how
    polished push jobs is for all of that right now, but its definitely the
    tool you want for the use case you describe.

[*] you could extract the resource you want to signal and put it in its
own stand-alone recipe (which you could include_recipe from the run_list
you normally use) and then only trigger that one-resource-recipe, so I
guess we do support that if you do the work to extract it… it won’t
work like a magic cross-server resource notification though.


#6

To me, it sounds like you’re using the wrong hammer. Chef is a nice tool,
but it’s simply the wrong tool in this case.
The underlying assumption in Chef and every CM system I’ve ever looked at
is that anything that gets "registered"
in the DB is going to be around long enough to be “stable” in some sense.

I’m not sure if this will make sense or not, but the generalization of your
example is where you have a client that needs
to find a dynamic pool of servers. I’m not aware of any CM that solves this
problem well. Dynamic pool of
clients to relatively static pool of servers is the model they were
designed for.

Something like serf might be much more appropriate to the kind of dynamic
configuration you’re talking about.

  • Booker C. Bense

On Thu, Oct 2, 2014 at 9:52 AM, Justin Dossey justin.dossey@newcontext.com
wrote:

Thank you for focusing on the generic case rather than my hypothetical
example. This situation-- where one node needs to locate another via
search in order to write its configuration correctly-- arises in many
cases, not just with Nagios and NRPE.

To review the responses (in order to create something useful for future
readers)

  1. (paraphrased) Just wait for the second converge to complete for the
    node to be functional." This is what we do already-- a pattern like
    matches = search(…)
    then in resources, use only_if { ! matches.empty? }
    does a decent job here. It’s a bit more complex with wrapper
    cookbooks, but it’s doable.
  2. (paraphrased) A blocking poll is not ideal because the number of
    matches necessary to complete convergence may vary as the infrastructure
    grows, leading to said poll requiring additional parameterization over
    time. This is a source of unnecessary complexity. Note that in my
    hypothetical, I was describing the NRPE server polling search for the
    Nagios server, not the Nagios server polling for nodes to monitor.
  3. (paraphrased) It is inappropriate to expect to be able to converge
    an entire infrastructure in a single pass, so we should deploy nodes in a
    defined sequence in order to minimize the number of client runs necessary
    to configure the infrastructure. I disagree with this sentiment, as
    delaying service availability by a multiple of the client run interval adds
    to the complexity of the environment and increases the time to resolution
    for legitimate problems.
  4. (paraphrased) Use push jobs to notify nodes when the infrastructure
    is ready for them to converge. The very first sentence on the Chef Push
    Jobs page is “Chef push jobs is an extension of the Chef server that allows
    jobs to be run against nodes independently of a chef-client run.” We are
    talking about triggering individual resources within a cookbook within a
    chef-client run, so push jobs don’t really address this issue at all.
    Also, it would appear that push jobs were meant to be triggered by
    administrators and not by the chef-client. While I’m sure it would be
    possible to set things up in such a way that a successful initial converge
    could trigger a push job, such an implementation diverges considerably from
    the role of push jobs as designed.
  5. Chef 12 doesn’t have the 15-minute client token expiry in the same
    way Chef 11 does (by default).

Did I miss anything? While I’m not surprised by the information here, it
confirms that there is a functional hole here-- there are many situations
in a converged infrastructure that require multiple client runs before the
infrastructure is fully functional, and this means it takes longer than
absolutely necessary to bring up a new infrastructure from scratch.

I am aware that implementation of callbacks, or resource hooks on remote
nodes, is a huge job, there are other ways to approach the situation.
Another solution that comes to mind (besides the search-subscriber model):
Make it easy for a recipe to advise the chef-client to reduce its run
interval based on conditions detected in the chef-client run. I like this
one because it is useful beyond the specific situation-- it gives the
chef-client a primitive way to learn about the infrastructure it manages
and respond appropriately.

At its core, I believe that requiring configuration to come via
application-specific config files is the real issue here, but I don’t
expect to see network-wide integrated config resource support (such as in
CoreOS’s etcd) in most *nix tools in the near term, nor do I expect to see
direct support for communication with Chef servers in our tools (so, for
our trusty hypothetical example, nrpe could be instructed simply to perform
a Chef search directly to determine the allowed nrpe-client IPs rather than
have the chef-client write the nrpe.conf as part of convergence, then
notify the nrpe server of the change).

TL;DR: Now I’m thinking that the best thing would be for the recipe to be
able to advise the client to reduce the interval before the next run. This
would enable code like

result = search(…)

if result.empty?

Chef::Client.advise_interval(300) # request a chef-client run 5 minutes
(+/- splay) after this converge

end

define resources below, using not_if { result.empty? } where appropriate

On Thu, Oct 2, 2014 at 8:53 AM, Lamont Granquist lamont@opscode.com
wrote:

On 10/1/14, 12:48 PM, Justin Dossey wrote:

would cause the recipe to poll the Chef server every ten seconds until a
response came back, or until the client token expired (usually around 15
minutes).

Such behavior is not ideal (particularly because of the client token
expiry issue) but besides supporting a second converge, it’s the only way I
have seen that will accomplish the desired result.

You can set no_lazy_load true in client.rb to avoid cookbook_file
resources failing after 15m. This will be the default in Chef 12 client.
The Chef 12 server also uses solr 4 and is going to populate search results
in seconds rather than minutes.


Justin Dossey
Practice Owner
New Context Services, Inc


#7

Booker, you’re exactly right.

I was trying to use Chef as a service discovery tool. It looks like Consul
(which uses Serf internally) is a better fit for ephemeral services
especially.

Thanks for the tip.

On Thu, Oct 2, 2014 at 11:05 AM, Booker Bense bbense@gmail.com wrote:

To me, it sounds like you’re using the wrong hammer. Chef is a nice tool,
but it’s simply the wrong tool in this case.
The underlying assumption in Chef and every CM system I’ve ever looked at
is that anything that gets "registered"
in the DB is going to be around long enough to be “stable” in some sense.

I’m not sure if this will make sense or not, but the generalization of
your example is where you have a client that needs
to find a dynamic pool of servers. I’m not aware of any CM that solves
this problem well. Dynamic pool of
clients to relatively static pool of servers is the model they were
designed for.

Something like serf might be much more appropriate to the kind of dynamic
configuration you’re talking about.

http://www.serfdom.io

  • Booker C. Bense

On Thu, Oct 2, 2014 at 9:52 AM, Justin Dossey <
justin.dossey@newcontext.com> wrote:

Thank you for focusing on the generic case rather than my hypothetical
example. This situation-- where one node needs to locate another via
search in order to write its configuration correctly-- arises in many
cases, not just with Nagios and NRPE.

To review the responses (in order to create something useful for future
readers)

  1. (paraphrased) Just wait for the second converge to complete for
    the node to be functional." This is what we do already-- a pattern like
    matches = search(…)
    then in resources, use only_if { ! matches.empty? }
    does a decent job here. It’s a bit more complex with wrapper
    cookbooks, but it’s doable.
  2. (paraphrased) A blocking poll is not ideal because the number of
    matches necessary to complete convergence may vary as the infrastructure
    grows, leading to said poll requiring additional parameterization over
    time. This is a source of unnecessary complexity. Note that in my
    hypothetical, I was describing the NRPE server polling search for the
    Nagios server, not the Nagios server polling for nodes to monitor.
  3. (paraphrased) It is inappropriate to expect to be able to converge
    an entire infrastructure in a single pass, so we should deploy nodes in a
    defined sequence in order to minimize the number of client runs necessary
    to configure the infrastructure. I disagree with this sentiment, as
    delaying service availability by a multiple of the client run interval adds
    to the complexity of the environment and increases the time to resolution
    for legitimate problems.
  4. (paraphrased) Use push jobs to notify nodes when the
    infrastructure is ready for them to converge. The very first sentence on
    the Chef Push Jobs page is “Chef push jobs is an extension of the Chef
    server that allows jobs to be run against nodes independently of a
    chef-client run.” We are talking about triggering individual resources
    within a cookbook within a chef-client run, so push jobs don’t really
    address this issue at all. Also, it would appear that push jobs were meant
    to be triggered by administrators and not by the chef-client. While I’m
    sure it would be possible to set things up in such a way that a successful
    initial converge could trigger a push job, such an implementation diverges
    considerably from the role of push jobs as designed.
  5. Chef 12 doesn’t have the 15-minute client token expiry in the same
    way Chef 11 does (by default).

Did I miss anything? While I’m not surprised by the information here, it
confirms that there is a functional hole here-- there are many situations
in a converged infrastructure that require multiple client runs before the
infrastructure is fully functional, and this means it takes longer than
absolutely necessary to bring up a new infrastructure from scratch.

I am aware that implementation of callbacks, or resource hooks on remote
nodes, is a huge job, there are other ways to approach the situation.
Another solution that comes to mind (besides the search-subscriber model):
Make it easy for a recipe to advise the chef-client to reduce its run
interval based on conditions detected in the chef-client run. I like this
one because it is useful beyond the specific situation-- it gives the
chef-client a primitive way to learn about the infrastructure it manages
and respond appropriately.

At its core, I believe that requiring configuration to come via
application-specific config files is the real issue here, but I don’t
expect to see network-wide integrated config resource support (such as in
CoreOS’s etcd) in most *nix tools in the near term, nor do I expect to see
direct support for communication with Chef servers in our tools (so, for
our trusty hypothetical example, nrpe could be instructed simply to perform
a Chef search directly to determine the allowed nrpe-client IPs rather than
have the chef-client write the nrpe.conf as part of convergence, then
notify the nrpe server of the change).

TL;DR: Now I’m thinking that the best thing would be for the recipe to be
able to advise the client to reduce the interval before the next run. This
would enable code like

result = search(…)

if result.empty?

Chef::Client.advise_interval(300) # request a chef-client run 5 minutes
(+/- splay) after this converge

end

define resources below, using not_if { result.empty? } where appropriate

On Thu, Oct 2, 2014 at 8:53 AM, Lamont Granquist lamont@opscode.com
wrote:

On 10/1/14, 12:48 PM, Justin Dossey wrote:

would cause the recipe to poll the Chef server every ten seconds until
a response came back, or until the client token expired (usually around 15
minutes).

Such behavior is not ideal (particularly because of the client token
expiry issue) but besides supporting a second converge, it’s the only way I
have seen that will accomplish the desired result.

You can set no_lazy_load true in client.rb to avoid cookbook_file
resources failing after 15m. This will be the default in Chef 12 client.
The Chef 12 server also uses solr 4 and is going to populate search results
in seconds rather than minutes.


Justin Dossey
Practice Owner
New Context Services, Inc


Justin Dossey
Practice Owner
New Context Services, Inc


#8

My 2 cts here: I use search for mysql clusters. And I do configure replication only if my search return a peer server. If not the recipe assume the server is meant to be standalone.

In any case the server register itself to the load balancer and to nagios by a WebServices (full stack involves glpi, centron and Nagios) so it’s monitored as soon as possible within the run, the delay is due to Nagios itself and not to an indexing service.

On the Nagios box (and on the load balancer) there’s a recipe using search to catch any leftover service/host for which the registration did not work.

It involves a lot of moving parts but for now it works :wink:

---- Justin Dossey a écrit ----

Booker, you’re exactly right.

I was trying to use Chef as a service discovery tool. It looks like Consul (which uses Serf internally) is a better fit for ephemeral services especially.

Thanks for the tip.

On Thu, Oct 2, 2014 at 11:05 AM, Booker Bense bbense@gmail.com wrote:

To me, it sounds like you’re using the wrong hammer. Chef is a nice tool, but it’s simply the wrong tool in this case.

The underlying assumption in Chef and every CM system I’ve ever looked at is that anything that gets “registered”

in the DB is going to be around long enough to be “stable” in some sense.

I’m not sure if this will make sense or not, but the generalization of your example is where you have a client that needs

to find a dynamic pool of servers. I’m not aware of any CM that solves this problem well. Dynamic pool of

clients to relatively static pool of servers is the model they were designed for.

Something like serf might be much more appropriate to the kind of dynamic configuration you’re talking about.

http://www.serfdom.io

  • Booker C. Bense

On Thu, Oct 2, 2014 at 9:52 AM, Justin Dossey justin.dossey@newcontext.com wrote:

Thank you for focusing on the generic case rather than my hypothetical example. This situation-- where one node needs to locate another via search in order to write its configuration correctly-- arises in many cases, not just with Nagios and NRPE.

To review the responses (in order to create something useful for future readers)
(paraphrased) Just wait for the second converge to complete for the node to be functional." This is what we do already-- a pattern like
matches = search(…)
then in resources, use only_if { ! matches.empty? }
does a decent job here. It’s a bit more complex with wrapper cookbooks, but it’s doable.(paraphrased) A blocking poll is not ideal because the number of matches necessary to complete convergence may vary as the infrastructure grows, leading to said poll requiring additional parameterization over time. This is a source of unnecessary complexity. Note that in my hypothetical, I was describing the NRPE server polling search for the Nagios server, not the Nagios server polling for nodes to monitor.
(paraphrased) It is inappropriate to expect to be able to converge an entire infrastructure in a single pass, so we should deploy nodes in a defined sequence in order to minimize the number of client runs necessary to configure the infrastructure. I disagree with this sentiment, as delaying service availability by a multiple of the client run interval adds to the complexity of the environment and increases the time to resolution for legitimate problems.(paraphrased) Use push jobs to notify nodes when the infrastructure is ready for them to converge. The very first sentence on the Chef Push Jobs page is “Chef push jobs is an extension of the Chef server that allows jobs to be run against nodes independently of a chef-client run.” We are talking about triggering individual resources within a cookbook within a chef-client run, so push jobs don’t really address this issue at all. Also, it would appear that push jobs were meant to be triggered by administrators and not by the chef-client. While I’m sure it would be possible to set things up in such a way that a successful initial converge could trigger a push job, such an implementation diverges considerably from the role of push jobs as designed.Chef 12 doesn’t have the 15-minute client token expiry in the same way Chef 11 does (by default).

Did I miss anything? While I’m not surprised by the information here, it confirms that there is a functional hole here-- there are many situations in a converged infrastructure that require multiple client runs before the infrastructure is fully functional, and this means it takes longer than absolutely necessary to bring up a new infrastructure from scratch.

I am aware that implementation of callbacks, or resource hooks on remote nodes, is a huge job, there are other ways to approach the situation. Another solution that comes to mind (besides the search-subscriber model): Make it easy for a recipe to advise the chef-client to reduce its run interval based on conditions detected in the chef-client run. I like this one because it is useful beyond the specific situation-- it gives the chef-client a primitive way to learn about the infrastructure it manages and respond appropriately.

At its core, I believe that requiring configuration to come via application-specific config files is the real issue here, but I don’t expect to see network-wide integrated config resource support (such as in CoreOS’s etcd) in most *nix tools in the near term, nor do I expect to see direct support for communication with Chef servers in our tools (so, for our trusty hypothetical example, nrpe could be instructed simply to perform a Chef search directly to determine the allowed nrpe-client IPs rather than have the chef-client write the nrpe.conf as part of convergence, then notify the nrpe server of the change).

TL;DR: Now I’m thinking that the best thing would be for the recipe to be able to advise the client to reduce the interval before the next run. This would enable code like

result = search(…)

if result.empty?

Chef::Client.advise_interval(300) # request a chef-client run 5 minutes (+/- splay) after this converge

end

define resources below, using not_if { result.empty? } where appropriate

On Thu, Oct 2, 2014 at 8:53 AM, Lamont Granquist lamont@opscode.com wrote:

On 10/1/14, 12:48 PM, Justin Dossey wrote:

would cause the recipe to poll the Chef server every ten seconds until a response came back, or until the client token expired (usually around 15 minutes).

Such behavior is not ideal (particularly because of the client token expiry issue) but besides supporting a second converge, it’s the only way I have seen that will accomplish the desired result.

You can set no_lazy_load true in client.rb to avoid cookbook_file resources failing after 15m. This will be the default in Chef 12 client. The Chef 12 server also uses solr 4 and is going to populate search results in seconds rather than minutes.

Justin Dossey

Practice Owner

New Context Services, Inc

Justin Dossey

Practice Owner

New Context Services, Inc


#9

Hi,

You asked:

Chef Listers: How have you addressed this situation in your environment?

Chef can’t converge a Nagios server often enough to keep up with a dynamic
environment. I was in the same boat a couple of years ago.

The big win here is to abandon Nagios for Sensu. I can say more if you’re
willing to consider it. http://sensuapp.org/

But it was way cool to move monitoring to end of my run so I didn’t get
alerts on down services as a node was being instantiated for the first time.

Cheers,

Peter

On Wed, Oct 1, 2014 at 12:48 PM, Justin Dossey <justin.dossey@newcontext.com

wrote:

Hi all,

One thing I see come up regularly when deploying a set of recipes to a
network of machines is that search-populated config data isn’t necessarily
available when it’s needed.

Let’s use a simple example-- Nagios server and NRPE server. In our
hypothetical example, we want to use nrpe’s ability to respond only to a
configured list of IP addresses in nrpe.conf. Based on the general
principles behind converged infrastructure, the IP list would be populated
based on the results of a search.

With a brand-new network deployment, however, it is likely that nrpe will
converge on one node before the nagios server’s entire run list converges
on another node. Therefore, the nrpe server recipe will have no results
from its search until the nagios server node converges successfully once.
On the first converge to take place after the nagios server’s node data has
been stored in solr, the nrpe server will get data to write to the nrpe
configuration file.

This is what I mean by “double converge”-- it takes at least two converges
to complete the nrpe server installation and configuration because the
necessary data is not available in the first converge.

One way to reduce the number of converges is to poll search until a result
comes back. Something like this in the nrpe-server recipe code:

results = []
do
results = search(…)
break unless results.empty?
sleep(10)
end

would cause the recipe to poll the Chef server every ten seconds until a
response came back, or until the client token expired (usually around 15
minutes).

Such behavior is not ideal (particularly because of the client token
expiry issue) but besides supporting a second converge, it’s the only way I
have seen that will accomplish the desired result.

One slightly wild idea kicking around would be to have some ability for
the client to register for an event, and associate a set of resources with
the event triggering. In our hypothetical, that would allow the nrpe
server’s chef client to converge everything it could (perhaps, a set of
other recipes in the run list) and be idle until the nagios server’s chef
client completes converging its run list.

Chef Listers: How have you addressed this situation in your environment?


Justin Dossey
Practice Owner
New Context Services, Inc