# Fwd: How do I know if my application has really been "provisioned"? a suggestion

#1

Sorry for breaking the thread – when I first signed up I used a plus hack address and your list software is stricter than I thought.

Anyhow my reply is included here.

Begin forwarded message:

From: Erik Hollensbe erik@hollensbe.org
Subject: Re: How do I know if my application has really been “provisioned”? a suggestion
Date: December 9, 2012 10:21:21 AM PST
To: Bryan Berry bryan.berry@gmail.com
Cc: chef@lists.opscode.com, Chef Dev chef-dev@lists.opscode.com

On Dec 9, 2012, at 4:22 AM, Bryan Berry bryan.berry@gmail.com wrote:

Erik Hollensbe is doing some freaking awesome work on workflow
orchestration w/ chef-workflow and I think it illustrates the problem
here

require 'chef-workflow/helper’
class MyTest < MiniTest::Unit::VagrantTestCase
def before
@json_msg = '{ ‘id’: “dumb message json msg”}'
end
def setup
provision(‘elasticsearch’)
provision(‘logstash’)
wait_for(‘elasticsearch’)
wait_for(‘logstash’)
inject_logstash_message(@json_msg)
end

def test_message_indexed_elasticsearch
assert es_has_message?(@json_msg)
end
end

If I understand Erik’s code correctly, the wait_for('elasticsearch')
only waits for the vagrant provisioner to return. The vagrant
provisioner in turn only waits for service elasticsearch start to
return a non-zero exit-code.

Not exactly. It doesn’t matter for the purposes of this discussion, but I feel compelled to explain anyway: chef-workflow’s provisioner is multithreaded and dependency-based out of the box. When you ask something to be provisioned, it gets scheduled for a provision and a scheduler in the background tries to provision it as soon as all dependencies are satisifed for it, but it doesn’t actually wait for anything to happen other than the message to be sent to the scheduler. In this time it may provision other machines that are needed to satisfy any requirements of the machine or groups of machines.

The wait_for statement is simply a way to say, “I can’t continue until this machine actually exists” but is not coupled with a provision statement at all – the behavior you’re seeing is partially a side effect of being unable to multithread vagrant and virtualbox for provisioning (the knife side of this is already multithreaded, and the gains are huge when you provision more than one machine at a time for a specific role).

This is relevant because my current task is supporting ec2 as a first-class provisioner which means that in your test this would actually be quite a bit faster:

def setup
provision(‘elasticsearch’)
provision(‘logstash’, 1, %w[elasticsearch]) # logstash depends on ES here
wait_for(‘logstash’)
inject_logstash_message(@json_msg)
end

Because the scheduler cares about the logstash dependency on ES now. If you needed to provision other machines, you could throw these wait_for statements in the unit tests themselves and literally have your tests be provisioning tons of machines in the background but not actually halt the testing process until the one you care about hasn’t been provisioned, which has signficant gain over time as they’re all provisioning at once whether or not your test suite has made it to the point where they matter yet.

Anyhow, this is important to point out because I think this dependency system and parallelism code can be adapted to chef converges – because resource converge lists and this work extremely similar from a conceptual standpoint, and I’m about to suggest an alternative that would solve this problem in a way that lets that happen, should the actual patches be written. Please raise your hand if you’d like chef to try and parallelize as much as it can about your converge.

We need an optional way to determine whether an server has been
complete provisioned, or that all the resources have entered a "done"
state. The only way I know that elasticsearch has started
successfully is if I see in the log “Elasticsearch has started” w/ a
timestamp more recent than when I started the service.

The before block would run before the service is actually actioned.
Now Chef would need some additional machinery to collect all the done
:after blocks and the related @before_results. This could be done by
chef_handler but may be better as part of chef itself. Let’s call it
the done_handler for now. This done_handler would mark the time before
it starts handling any done_after blocks, then loop through the
collected done_after blocks for the specified timeout. Once all blocks
are complete it would continue onto other handlers, such as the
minitest_handler.

I think I have a more general suggestion that takes its cue from typical exception handling schemes in languages, but not exactly.

When you have an exception in ruby, the program aborts unless you catch it. Here’s an example:

def foo
something_that_might_raise
rescue
$stderr.puts "omg! we raised" end This is a common problem in writing routines like ‘foo’: def foo create_a_file something_that_might_raise delete_that_file rescue$stderr.puts "omg! we raised"
end

The problem being that if “something_that_might_raise” does indeed raise an exception, no amount of error handling is going to get “delete_that_file” called.

Luckily ruby (and other languages that use exceptions) provides us with “ensure”, which allows us to specify a bit of code executed that always runs no matter what happens. The right way to write the last example:

def foo
create_a_file
something_that_might_raise
rescue
$stderr.puts "omg! we raised" ensure delete_that_file if file_exists end Saying that a chef converge is exceptions-as-flow-control isn’t exactly a leap of logic – a resource application breaks and chef blows up – that’s the end of the story. Your job is to write your cookbooks and recipes in a way that’s tolerant of these issues. Our ensure block can raise a clearer error, but it can also clean up and it can also verify that indeed, some side effect worked. You can see above that it checks if the file exists before attempting to delete it – in the event the create_a_file call failed, it does nothing. Anyhow, this long-winded explanation more or less amounts to a simplification of what you’re asking for – an ensure block that spans all resource classes. service “foo” do action :start ensure do sleep 10 # ensures the socket the service foo created is open TCPSocket.new(‘localhost’, 8675309) end end But it’s also general enough to gracefully handle failures: cookbook_file “foo.tar.gz” do action :create end execute “untar foo.tar.gz” do code <<-EOF tar xzf foo.tar.gz EOF ensure do FileUtils.rm(‘foo.tar.gz’) # always runs, even if the above untar fails end end A corrolary call would be allowing some kind of state predicate to determine if the ensure block is being fired due to success or failure. These could be implemented as super-sets of ensure: service “foo” do action :start success do # check socket end failure do # maybe kill process if it got started anyway? end end (Which is the way things like jquery’s ajax tooling works) Or with a simple predicate you check yourself: service “foo” do action :start ensure do if success # check socket else # maybe kill process? end end end It’d be nice if notifies worked here too – so you could signal another resource to run depending on what happened. Anyhow, I think this is considerably more general and solves many more use-cases than this specific problem, but does not exempt it from being handled. Anyhow, back to my hole. -Erik #2 Hey Erik, thanks for your thoughtful comments! service “foo” do action :start ensure do if success # check socket else # maybe kill process? end end end I think that your example above could work for a lot of use cases and I could definitely could see myself using. However, it doesn’t really apply to the specific use case I have in mind. I need chef to loop for a maximum specified timeout value, checking if a condition is true. For example, starting a JBoss instance that uses the standalone-full configuration will take around 20 seconds to start. A one-time check after an indeterminate period will not be sufficient for my needs. i will have to take some more time to read thru your full response. Thanks for taking the time to make a thoughtful response! On Sun, Dec 9, 2012 at 7:28 PM, Erik Hollensbe erik@hollensbe.org wrote: Sorry for breaking the thread – when I first signed up I used a plus hack address and your list software is stricter than I thought. Anyhow my reply is included here. Begin forwarded message: From: Erik Hollensbe erik@hollensbe.org Subject: Re: How do I know if my application has really been “provisioned”? a suggestion Date: December 9, 2012 10:21:21 AM PST To: Bryan Berry bryan.berry@gmail.com Cc: chef@lists.opscode.com, Chef Dev chef-dev@lists.opscode.com On Dec 9, 2012, at 4:22 AM, Bryan Berry bryan.berry@gmail.com wrote: Erik Hollensbe is doing some freaking awesome work on workflow orchestration w/ chef-workflow and I think it illustrates the problem here require 'chef-workflow/helper’ class MyTest < MiniTest::Unit::VagrantTestCase def before @json_msg = '{ ‘id’: “dumb message json msg”}' end def setup provision(‘elasticsearch’) provision(‘logstash’) wait_for(‘elasticsearch’) wait_for(‘logstash’) inject_logstash_message(@json_msg) end def test_message_indexed_elasticsearch assert es_has_message?(@json_msg) end end If I understand Erik’s code correctly, the wait_for('elasticsearch') only waits for the vagrant provisioner to return. The vagrant provisioner in turn only waits for service elasticsearch start to return a non-zero exit-code. Not exactly. It doesn’t matter for the purposes of this discussion, but I feel compelled to explain anyway: chef-workflow’s provisioner is multithreaded and dependency-based out of the box. When you ask something to be provisioned, it gets scheduled for a provision and a scheduler in the background tries to provision it as soon as all dependencies are satisifed for it, but it doesn’t actually wait for anything to happen other than the message to be sent to the scheduler. In this time it may provision other machines that are needed to satisfy any requirements of the machine or groups of machines. The wait_for statement is simply a way to say, “I can’t continue until this machine actually exists” but is not coupled with a provision statement at all – the behavior you’re seeing is partially a side effect of being unable to multithread vagrant and virtualbox for provisioning (the knife side of this is already multithreaded, and the gains are huge when you provision more than one machine at a time for a specific role). This is relevant because my current task is supporting ec2 as a first-class provisioner which means that in your test this would actually be quite a bit faster: def setup provision(‘elasticsearch’) provision(‘logstash’, 1, %w[elasticsearch]) # logstash depends on ES here wait_for(‘logstash’) inject_logstash_message(@json_msg) end Because the scheduler cares about the logstash dependency on ES now. If you needed to provision other machines, you could throw these wait_for statements in the unit tests themselves and literally have your tests be provisioning tons of machines in the background but not actually halt the testing process until the one you care about hasn’t been provisioned, which has signficant gain over time as they’re all provisioning at once whether or not your test suite has made it to the point where they matter yet. Anyhow, this is important to point out because I think this dependency system and parallelism code can be adapted to chef converges – because resource converge lists and this work extremely similar from a conceptual standpoint, and I’m about to suggest an alternative that would solve this problem in a way that lets that happen, should the actual patches be written. Please raise your hand if you’d like chef to try and parallelize as much as it can about your converge. We need an optional way to determine whether an server has been complete provisioned, or that all the resources have entered a "done" state. The only way I know that elasticsearch has started successfully is if I see in the log “Elasticsearch has started” w/ a timestamp more recent than when I started the service. The before block would run before the service is actually actioned. Now Chef would need some additional machinery to collect all the done :after blocks and the related @before_results. This could be done by chef_handler but may be better as part of chef itself. Let’s call it the done_handler for now. This done_handler would mark the time before it starts handling any done_after blocks, then loop through the collected done_after blocks for the specified timeout. Once all blocks are complete it would continue onto other handlers, such as the minitest_handler. I think I have a more general suggestion that takes its cue from typical exception handling schemes in languages, but not exactly. When you have an exception in ruby, the program aborts unless you catch it. Here’s an example: def foo something_that_might_raise rescue$stderr.puts "omg! we raised"
end

This is a common problem in writing routines like ‘foo’:

def foo
create_a_file
something_that_might_raise
delete_that_file
rescue
$stderr.puts "omg! we raised" end The problem being that if “something_that_might_raise” does indeed raise an exception, no amount of error handling is going to get "delete_that_file" called. Luckily ruby (and other languages that use exceptions) provides us with "ensure", which allows us to specify a bit of code executed that always runs no matter what happens. The right way to write the last example: def foo create_a_file something_that_might_raise rescue$stderr.puts "omg! we raised"
ensure
delete_that_file if file_exists
end

Saying that a chef converge is exceptions-as-flow-control isn’t exactly a
leap of logic – a resource application breaks and chef blows up – that’s
the end of the story. Your job is to write your cookbooks and recipes in a
way that’s tolerant of these issues.

Our ensure block can raise a clearer error, but it can also clean up and it
can also verify that indeed, some side effect worked. You can see above that
it checks if the file exists before attempting to delete it – in the event
the create_a_file call failed, it does nothing.

Anyhow, this long-winded explanation more or less amounts to a
simplification of what you’re asking for – an ensure block that spans all
resource classes.

service “foo” do
action :start
ensure do
sleep 10

# ensures the socket the service foo created is open

TCPSocket.new(‘localhost’, 8675309)
end
end

But it’s also general enough to gracefully handle failures:

cookbook_file “foo.tar.gz” do
action :create
end

execute “untar foo.tar.gz” do
code <<-EOF
tar xzf foo.tar.gz
EOF
ensure do
FileUtils.rm(‘foo.tar.gz’) # always runs, even if the above untar fails
end
end

A corrolary call would be allowing some kind of state predicate to determine
if the ensure block is being fired due to success or failure. These could be
implemented as super-sets of ensure:

service “foo” do
action :start
success do

end

failure do

# maybe kill process if it got started anyway?

end
end

(Which is the way things like jquery’s ajax tooling works)

Or with a simple predicate you check yourself:

service “foo” do
action :start
ensure do
if success
# check socket
else
# maybe kill process?
end
end
end

It’d be nice if notifies worked here too – so you could signal another
resource to run depending on what happened.

Anyhow, I think this is considerably more general and solves many more
use-cases than this specific problem, but does not exempt it from being
handled. Anyhow, back to my hole.

-Erik

#3

-1 Internet for me. Apologies for the duplicate emails Bryan, sending
this to the whole list now.

On 12/9/12 11:14 AM, Bryan Berry wrote:

I think that your example above could work for a lot of use cases and
I could definitely could see myself using. However, it doesn’t really
apply to the specific use case I have in mind. I need chef to loop for
a maximum specified timeout value, checking if a condition is true.
For example, starting a JBoss instance that uses the standalone-full
configuration will take around 20 seconds to start. A one-time check
after an indeterminate period will not be sufficient for my needs.

I’m wondering if, rather than tying this to the service resource in
particular, a “lock” resource might be more useful. The API might look
something like:

lock "wait for foobar" do
until { some_ruby_code}
timeout 30 # default to nil to never timeout
end

lock "wait for wombats" do
until "some shell command"
end


I’ve seen a number of recipes that use the execute and ruby_block
resources with the retries and retry_delay attributes to mimic this
behavior.

In your example with elasticsearch, you would then have two options,
the elastic search node could lock, waiting for its service to be up.
Or the logstash node would lock waiting for elasticsearch to be up. (I
would paint my bikeshed this latter color.)

As an aside, you might want to check if elasticsearch is up simply by
making an HTTP call against one of its endpoints. The _status endpoint
might be useful here.

Cheers,

Steven

#4

On Dec 9, 2012, at 11:32 AM, Steven Danna steve@opscode.com wrote:

-1 Internet for me. Apologies for the duplicate emails Bryan, sending
this to the whole list now.

On 12/9/12 11:14 AM, Bryan Berry wrote:

I think that your example above could work for a lot of use cases and
I could definitely could see myself using. However, it doesn’t really
apply to the specific use case I have in mind. I need chef to loop for
a maximum specified timeout value, checking if a condition is true.
For example, starting a JBoss instance that uses the standalone-full
configuration will take around 20 seconds to start. A one-time check
after an indeterminate period will not be sufficient for my needs.

I’m wondering if, rather than tying this to the service resource in
particular, a “lock” resource might be more useful. The API might look
something like:

lock “wait for foobar” do
until { some_ruby_code}
timeout 30 # default to nil to never timeout
end

lock “wait for wombats” do
until "some shell command"
end

This isn’t really a lock though – nothing’s setting a lock and nothing’s waiting for a set lock.

This would work fine if it was named ‘wait’ or ‘timeout’ though, as it communicates what you’re actually doing.

-Erik

#5

Hi,

We use a couple of strategies to tackle this problem.

• Firstly we generate our init scripts to block until the service is "up"
for some definition of up. Typically this means that the service is
listening on all the correct ports on the correct interfaces.
• Secondly we use some outside-in test that will test the service as if it
was a client and make sure we get the correct response. i.e. Ensure a HTTP
service responds with correct status code when you hit particular URLs.
• Thirdly we monitor some local system to wait for some impact the service
has on the machine (i.e. scan logs for keys, look for files created etc).

In most cases we use execute blocks or LWRPs to ensure the service is up
which is why …

On Mon, Dec 10, 2012 at 6:32 AM, Steven Danna steve@opscode.com wrote:

I’m wondering if, rather than tying this to the service resource in
particular, a “lock” resource might be more useful. The API might look
something like:

lock "wait for foobar" do
until { some_ruby_code}
timeout 30 # default to nil to never timeout
end

lock "wait for wombats" do
until "some shell command"
end


That sounds like a neat idea!

Cheers,

Peter Donald

#6

On 12/9/12 11:50 AM, Erik Hollensbe wrote:

This isn’t really a lock though – nothing’s setting a lock and nothing’s waiting for a set lock.

This would work fine if it was named ‘wait’ or ‘timeout’ though, as it communicates what you’re actually doing.

Completely agree. ‘Lock’ is the wrong name, ‘wait’ makes more sense.

Steven Danna
Systems Engineer, Opscode, Inc

#7

On Dec 9, 2012, at 11:57 AM, Steven Danna steve@opscode.com wrote:

On 12/9/12 11:50 AM, Erik Hollensbe wrote:

This isn’t really a lock though – nothing’s setting a lock and nothing’s waiting for a set lock.

This would work fine if it was named ‘wait’ or ‘timeout’ though, as it communicates what you’re actually doing.

Completely agree. ‘Lock’ is the wrong name, ‘wait’ makes more sense.

Here ya go – untested, but the resource is called ‘wait_until’ and should work because all it really does is wrap timeout.

https://github.com/erikh/chef-wait

#8

Dear PeterD,

I have to disagree w/ the approach to change the init script to block
until the satisfied condition has been reached. 99% of the time I
want a call to restart a service to return quickly. Only 1% of the
time, usually during some kind of orchestration activity or testing
activity, that I want to block for desired state that indicates the
action is fully completed.

Even more so, I don’t want to actually block all of Chef until this
desired state is reached. there may be additional resources that i
want to continue to be processed after the service is started. for
example, cron tasks to clean up log files and collectd plugins to be
configured to monitor my giant, slow-ass J2EE service. checking for
the :until condition to be met should be deferred until the end of the
chef run or at least b4 other handlers run, like
minitest-chef-handler. A chef_handler is the correct place for them to
live.

erikh’s wait lwrp might do the job. However, I want there to be the
possibility to have multiple wait resources in chef run. I can easily
manage the possibility of trying to run several superslow J2EE recipes
on a single machine. This might be premature optimization but the
extra engineering effort to write a wait_handler isn’t that
significant.

On Sun, Dec 9, 2012 at 9:08 PM, Erik Hollensbe erik@hollensbe.org wrote:

On Dec 9, 2012, at 11:57 AM, Steven Danna steve@opscode.com wrote:

On 12/9/12 11:50 AM, Erik Hollensbe wrote:

This isn’t really a lock though – nothing’s setting a lock and nothing’s waiting for a set lock.

This would work fine if it was named ‘wait’ or ‘timeout’ though, as it communicates what you’re actually doing.

Completely agree. ‘Lock’ is the wrong name, ‘wait’ makes more sense.

Here ya go – untested, but the resource is called ‘wait_until’ and should work because all it really does is wrap timeout.

https://github.com/erikh/chef-wait

#9

On Dec 9, 2012, at 12:38 PM, Bryan Berry bryan.berry@gmail.com wrote:

Dear PeterD,

I have to disagree w/ the approach to change the init script to block
until the satisfied condition has been reached. 99% of the time I
want a call to restart a service to return quickly. Only 1% of the
time, usually during some kind of orchestration activity or testing
activity, that I want to block for desired state that indicates the
action is fully completed.

Even more so, I don’t want to actually block all of Chef until this
desired state is reached. there may be additional resources that i
want to continue to be processed after the service is started. for
example, cron tasks to clean up log files and collectd plugins to be
configured to monitor my giant, slow-ass J2EE service. checking for
the :until condition to be met should be deferred until the end of the
chef run or at least b4 other handlers run, like
minitest-chef-handler. A chef_handler is the correct place for them to
live.

erikh’s wait lwrp might do the job. However, I want there to be the
possibility to have multiple wait resources in chef run. I can easily
manage the possibility of trying to run several superslow J2EE recipes
on a single machine. This might be premature optimization but the
extra engineering effort to write a wait_handler isn’t that
significant.

I really think the ‘ensure’ thing I suggested has more utility and solves this problem better. I know the email is long, but I encourage its consideration for solving this (and many other) problem.

-Erik

#10

Hi,

On Mon, Dec 10, 2012 at 7:38 AM, Bryan Berry bryan.berry@gmail.com wrote:

I have to disagree w/ the approach to change the init script to block
until the satisfied condition has been reached. 99% of the time I
want a call to restart a service to return quickly. Only 1% of the
time, usually during some kind of orchestration activity or testing
activity, that I want to block for desired state that indicates the
action is fully completed.

I guess it depends on your use case. In most of the scenarios where we have
implemented this “wait til up for some definition of up” behaviour it is
because it is expected that chef will interact with the underlying service
again. i.e. When we bring up a glassfish domain, we immediately configure
it with LWRPs and these LWRPs talk across the wire using a custom admin
protocol - thus if glassfish is not up, the LWRP would fail.

For other services where this is not a requirement we don’t bother waiting
after starting the service.

Even more so, I don’t want to actually block all of Chef until this

desired state is reached. there may be additional resources that i
want to continue to be processed after the service is started. for
example, cron tasks to clean up log files and collectd plugins to be
configured to monitor my giant, slow-ass J2EE service. checking for
the :until condition to be met should be deferred until the end of the
chef run or at least b4 other handlers run, like
minitest-chef-handler. A chef_handler is the correct place for them to
live.

It would be nice if chef allowed to to converge recipes/resources
in parallel where possible and even the possibility of futures to join
against when you wanted to wait between resources however that would
significantly increase the complexity of chef.

erikh’s wait lwrp might do the job. However, I want there to be the
possibility to have multiple wait resources in chef run. I can easily

We do this aswell … and use similar terminology

It ultimately depends on how you interact with the service within the chef
converge phase.

Cheers,

Peter Donald

#11

Hi,

On Mon, Dec 10, 2012 at 7:59 AM, Peter Donald peter@realityforge.org wrote:

It would be nice if chef allowed to to converge recipes/resources in
parallel where possible and even the possibility of futures to join against
when you wanted to wait between resources however that would significantly
increase the complexity of chef.

LTDR:
+1 Erik Hollensbe’s wait/ensure{}
-{\infty} parallelism in Chef
Of course +2 My Observer suggestion.

Anyway:

Erik Hollensbe also mentioned a desire for parallelism in Chef runs.

Parallel runs are very useful.

Without raining on the Unicorn parade can I suggest this be addressed
in other ways?
My experiences with MPI and PETSc make me more than a little leary of
casual talk about parallelism. Essentially, outside of embarrassingly
parallel tasks, you’ll get your self into a whole lot of pain very
quickly. Also, I think the ZeroMQ guys made a compelling case against
multithreading, again just from experience these are tough problems to
execute reliably, never mind rigorously - the early MPI list has some
interesting debates about whether some parts of the spec is
theoretically possible, and how can people be claiming to have
implemented it fully.

I don’t think the Chef community will get a great deal of benefit by
trying to bake parallelism into Chef. I’m pretty sure they will get a
whole lot of pain - Does ant one really want to reach the point where
we start saying we need a Chef plugin for TotalView or Eclipse/PTP in
order to find
out why some Chef run behaves the way it does?

However, we do all want things to run in parallel.
I’m writing a lib to be released shortly that does use the parallelism
provided by
another library, so I’m not against this on principle.

A chef run can get complex, so I think the way to address needs for
parallelism is to suggest multiple chef-solo/client runs, and leave it
up to the recipes in each process/run to wait as they need to. Some
best practice suggestions/patterns will emerge. One simple one is
’share nothing’ between these processes, where share refers to
roles/recipes/cookbooks, not databags etc. which should be shared.
Another is don’t change any data that another run uses, without it
being communicated via the wait/ensure{} block. Even this opens up
a Pandora’s box for the unaware. But hopefully we can make an effort to push
back and keep things simple and not have this list descend
into communication topology discussions around parallel Chef runs.

This suggests Erik Hollensbe wait LWRP should suffice for immediate
needs. Longer term I agree that something like the ensure{}
suggestion might work to allow one chef-solo/client run to coordinate
with another. Irrespective of whether that other process/Chef-Solo run is on
another machine or the same machine.

The benefit of this is that:

• Inexperienced users are encouraged to adopt embarrassingly parallel
setups (run x instances of chef-solo each with their ‘role’/'cookbook’
as waiting/ensuring appropriate)
• Attention/emphasis switches to making/keeping chef-solo/client runs
as lean as possible - which benefits everyone.

For users who are adamant that their problem requires Chef be
"parallelized": perhaps the can be gently nudged in the direction of
Ruote, or Condor. If they are really adamant that directed acyclic
graphs just won’t suffice for them, then more firmly point them to MPI
and hope they write a ffi-MPI + chef extension for some kind of
virtual topology?

Hope that helps

Best wishes

#12

Hi,

On Mon, Dec 10, 2012 at 10:07 AM, Mark Van De Vyver mark@taqtiqa.comwrote:

Longer term I agree that something like the ensure{}
suggestion might work to allow one chef-solo/client run to coordinate
with another. Irrespective of whether that other process/Chef-Solo run is
on
another machine or the same machine.

I don’t know of a good solution for coordination of chef across multiple
nodes. We currently use a separate command and control service that
orchestrates the updating of databags and the kick off of chef-client runs
where appropriate. However I have been watching flock_of_chefs [1] which is
a really interesting take on the whole deal. I am extremely interested in
that sort of approach for more choreographed releases rather than
orchestrated releases. No idea if it will eventuate or it is just an
experiment but it is an interesting project to watch and the author usually
writes good stuff.

Cheers,

Peter Donald

#13

We’ve got Flock working for cross node-resource notifications,
subscriptions, wait_for and wait_until (FYI)

–AJ

On 10 December 2012 12:22, Peter Donald peter@realityforge.org wrote:

Hi,

On Mon, Dec 10, 2012 at 10:07 AM, Mark Van De Vyver mark@taqtiqa.comwrote:

Longer term I agree that something like the ensure{}
suggestion might work to allow one chef-solo/client run to coordinate
with another. Irrespective of whether that other process/Chef-Solo run
is on
another machine or the same machine.

I don’t know of a good solution for coordination of chef across multiple
nodes. We currently use a separate command and control service that
orchestrates the updating of databags and the kick off of chef-client runs
where appropriate. However I have been watching flock_of_chefs [1] which is
a really interesting take on the whole deal. I am extremely interested in
that sort of approach for more choreographed releases rather than
orchestrated releases. No idea if it will eventuate or it is just an
experiment but it is an interesting project to watch and the author usually
writes good stuff.

Cheers,

Peter Donald

#14

On Mon, Dec 10, 2012 at 10:22 AM, Peter Donald peter@realityforge.org wrote:

Hi,

On Mon, Dec 10, 2012 at 10:07 AM, Mark Van De Vyver mark@taqtiqa.com
wrote:

Longer term I agree that something like the ensure{}
suggestion might work to allow one chef-solo/client run to coordinate
with another. Irrespective of whether that other process/Chef-Solo run is
on
another machine or the same machine.

I don’t know of a good solution for coordination of chef across multiple
nodes. We currently use a separate command and control service that
orchestrates the updating of databags and the kick off of chef-client runs
where appropriate.

There are good solutions, Condor being one, but they can be resource heavy.

However I have been watching flock_of_chefs [1] which is
a really interesting take on the whole deal. I am extremely interested in
that sort of approach for more choreographed releases rather than
orchestrated releases. No idea if it will eventuate or it is just an
experiment but it is an interesting project to watch and the author usually
writes good stuff.

My sense so far is to regard thoughts like ‘this Chef run would be so
much better done in parallel’ as a kind of code smell.
Of course not everything can be redesigned the way it should have been
done, so the call for parallel will be a perennial one.

Flock does look interesting, I don’t doubt it scratches some itches.
To my mind the only long-term advantage it could offer over Condor/MPI
is its resource profile and ease of use.
Given it is written in Ruby the only way it’ll get a resource edge is
to delimit the scope of the problem it addresses.
I can’t see it has done that, so I think it’ll either: a) come to
constrain its communication pattern scope more aggressively, b)
re-implement Condor/MPI and likely make them look resource-lite (in
the case of Condor) and simple (in the case of MPI).
The high risk of b) means I’d be wary of adding it to any dev environment.

Right now I think Ruote would be the Ruby project to look at and I’d
be interested if Flock aims to address some ‘thing’ that is missing in
Ruote? Or makes it trivial to do what is tricky in Ruote?
AJ you might have an insight?

Best wishes
Mark

Cheers,

Peter Donald

#15

On Dec 9, 2012, at 3:07 PM, Mark Van De Vyver mark@taqtiqa.com wrote:

Hi,

On Mon, Dec 10, 2012 at 7:59 AM, Peter Donald peter@realityforge.org wrote:

It would be nice if chef allowed to to converge recipes/resources in
parallel where possible and even the possibility of futures to join against
when you wanted to wait between resources however that would significantly
increase the complexity of chef.

LTDR:
+1 Erik Hollensbe’s wait/ensure{}
-{\infty} parallelism in Chef
Of course +2 My Observer suggestion.

Anyway:

Erik Hollensbe also mentioned a desire for parallelism in Chef runs.

Parallel runs are very useful.

Without raining on the Unicorn parade can I suggest this be addressed
in other ways?
My experiences with MPI and PETSc make me more than a little leary of
casual talk about parallelism. Essentially, outside of embarrassingly
parallel tasks, you’ll get your self into a whole lot of pain very
quickly. Also, I think the ZeroMQ guys made a compelling case against
multithreading, again just from experience these are tough problems to
execute reliably, never mind rigorously - the early MPI list has some
interesting debates about whether some parts of the spec is
theoretically possible, and how can people be claiming to have
implemented it fully.

I don’t think the Chef community will get a great deal of benefit by
trying to bake parallelism into Chef. I’m pretty sure they will get a
whole lot of pain - Does ant one really want to reach the point where
we start saying we need a Chef plugin for TotalView or Eclipse/PTP in
order to find
out why some Chef run behaves the way it does?

Well, I should explain some things. The scheduler I wrote isn’t really built for chef – it’s a generic scheduler that works in kernel style and is used for a testing system I’m working on that needs to do a lot of expensive, blocking things. It just turns out that after I built it, that I realized it really wouldn’t take much to bolt this right over chef’s resource runs because they already do exactly what I’m doing, so it may turn out that this is a worthwhile effort.

I also agree with you on multithreading and parallel runs in general – this isn’t for everyone or every use case, including the use cases that the scheduler was built for – it has both parallel and serial modes, which work interchangably because there is no shared state between threads and the coordinator is always a single thread – whether that’s ruby’s “main” thread or a secondary coordination thread is arguably unimportant to how it operates.

Personally I don’t care about coordination coming from chef – chef doesn’t need to be a be-everything-to-everyone tool, that’s the path to madness, and there’s plenty of work in this area that could be hooked into chef (things like mcollective come to mind) without being so invasive that chef needs to change.

The scheduler is just a coordination thread that pops a queue. Other calls are allowed to write the queue but the scheduler itself is only expected to read from it. The collection of resources (server groups in my case) has an associated list of dependencies, currently with a restriction that the dependencies have already been declared, and it also has a group of code to execute (exposed as objects similar to the actor model). Items come through the queue and are lined up with these resources – if there are no dependencies to satisfy, execution happens in a child thread and when it comes back its state is inspected – if it succeeded, it gets put in a “satisfied” set, on failure the scheduler barfs (it is built for a testing system, after all). If there are any outstanding dependencies, it gets pushed back on to the queue and will be re-evaluated until its dependencies are satisifed.

In serial mode, the child threads just execute in the main thread and the queue coordinator also runs in the main thread. This 100% translates to a single threaded scenario – all this business about shared state (or the lack thereof) is also what the ZeroMQ guys mean when they say don’t use threading (I have a bit of history with ZeroMQ and similar tech when I was still doing application programming).

I think if you think about it some, you’ll see the parallels (heh!) to chef runs here. During compilation phase the resources are arranged into a queue based on what order they need to execute in, and then that queue is acted upon. Resources are independent, just like my provisioner classes – there are a few cases where you can abuse rubyisms to modify state outside of the resource (local variables and calls like String#replace come to mind) but they are hard to do and firmly in the “I know what I’m doing” department. The big difference is that by the time a chef resource converge starts almost all of the details are sorted out, and this scheduler is a bit more dynamic in the “I’ve just been given something to do” department.

Anyhow, I want to doubly express that it wasn’t built for chef, that chef itself wasn’t the reason I wrote it, and I have no interest in throwing myself at a wall over and over again to get it merged or even tried out. It just turns out that it’s a disturbingly good fit for how chef already works, and I think it was important to mention as I know some people @ opscode have been thinking about how to solve this.

Anyhow, probably gonna drop off this thread – more code to write.

-Erik

#16

This thread has drifted pretty far from the original post’s topic, so I’m forking it.

On Sunday, December 9, 2012 at 12:59 PM, Peter Donald wrote:

Hi,

It would be nice if chef allowed to to converge recipes/resources in parallel where possible and even the possibility of futures to join against when you wanted to wait between resources however that would significantly increase the complexity of chef.

A few points here:

First of all, we’ve started talking about solutions without stating the problem. I’ll go ahead and guess that the problem being solved is optimizing the convergence time of the initial chef-client run on a bare node.

If that’s the case, then the next step is to understand the problem: why do your runs take so long? You could use something like the elapsed time handler[1] to see how long each resource type takes. From there you can dig deeper: what resource are you constrained on? Network IO? Disk IO? CPU? Depending on the answer to that question, there are probably a variety of solutions, such as creating local package mirrors, moving compilation of source code from Chef to Ci (then packaging the result), etc.

Once you have a set of options, you can look at the relative value of each of them. Without data from the above step, I can’t list the “pros” for making chef concurrent, but I can list some cons:

• complication of the Chef model: you need a way to specify which parts of your run_list can run concurrently, which other parts they depend on, etc.
• Opportunity for user error: if you miss a dependency in the above, you’ll see intermittent errors due to a race condition
• Need to build concurrency primitives and make users use them instead of core ruby classes: unlike languages like Erlang, Haskell, etc., ruby’s core data structures are mutable, so wherever users have the opportunity to modify data, concurrency needs to be taken into account. Adding mutexes to an existing program is a game of whack-a-mole, and bugs could remain hidden for a long time, so the better approach would be to rearchitect Chef using some sort of actor library. Then users would need to understand that library to use Chef’s code.
• Despite the above, you still have opportunity for concurrency bugs though side effects: does your package manager get a global lock? Do different parts of your chef run implicitly rely on shared system state (files, etc.)?

I think generally if a problem is identified where concurrency is the most effective solution, you’re better off solving a small focused issue and keeping the concurrent part of the code as contained as possible. For a off-the-top-of-my-head example, if you need to download a bunch of files and you absolutely can’t cache them (or there’s no value in adding more caching) then a parallel version of the remote_file resource would be preferable to adding all the stuff necessary to run a handful of remote_file resources in parallel.

Daniel DeLeo

#17

I think this thread is conflating (at least) three different topics:

1. parallel execution of resources within a Chef client run;
2. synchronization between changes on different nodes;
3. dependencies within services

#1 sounds scary. Tackling it within Chef would open a big can of worms,
breaking assumptions in many resources and cookbooks. This would probably
take a lot of effort, and be a real “major” release (for lack of an even
bigger level of granularity).
I have to second Dan’s point: we would need a very strong definition of a
problem before trying to solve it.
Personally I don’t have this problem so I would only see the downsides of
such a move. Consequently it would take me a long time to trust such a
"make it or break it" release.

#2 is interesting, and it seems to me Flock of Chefs is well positioned to
I have never felt the need for it so I’m not going to comment on it further.

#3 is probably what is needed by most people.

In simple cases, I agree with the person (sorry, long thread, I lost track)
who said that init scripts can take care of this.
If service A depends on service B, that dependencies typically goes well
beyond Chef: at boot you would still have to wait.

Of course some people are going to have more complex requirements.
Doing something that fits your own needs is not terribly hard: you add
monitoring for the presence of a pid file; then you realize you also need
to check that the process is really up.
But then you also need to check that something else is listening over TCP.
And next thing you know, you realize that when a given process restarts,
you also need to restart services that depend on it.
Jump forward a few iterations…

What you have built by now is effectively a PaaS.

Wouldn’t it be more effective then to use a PaaS to begin with?

I’m not saying Chef cannot do that; it’s a programmable toolkit with full