Race conditions during resource allocation


#1

Hi all-

I have a question about possible race conditions during resource allocation. Let me give a simple (but somewhat silly) example. Imagine I have two types of nodes: web-server nodes and database nodes. As far as Chef is concerned, new nodes can come alive at any random time (in reality, they come alive when an auto-scaler determines that more resources are needed in the deployment). I want to make sure that each web-server, once configured and running, is connected to exactly one database. I can accomplish this by storing an attribute on the web-server nodes called “using-database”. When a new web-server comes alive, it searches the chef-server db for the list of web-server and database nodes, and it looks for a database node that is not being used by any other web-server. Then it sets that database node in its “using-database” attribute.

This works fine, but you get the usual race condition problem with shared read-write resources. If two web-servers come alive at the same time, they will search for an available database at the same time, and possibly select the same one, and save it to their “using-database” attributes. The one-web-server-per-database constraint is then violated.

Has anyone else come across this problem? Does this situation go against the intent of a Chef system? Or is there support for it already, or some other Chef best-practice?

Thanks in advance.

-Nick O


#2

On Fri, Oct 2, 2009 at 3:08 PM, Nick Ohanian nick@geodelic.com wrote:

I have a question about possible race conditions during resource allocation.
Let me give a simple (but somewhat silly) example. Imagine I have two
types of nodes: web-server nodes and database nodes. As far as Chef is
concerned, new nodes can come alive at any random time (in reality, they
come alive when an auto-scaler determines that more resources are needed in
the deployment). I want to make sure that each web-server, once configured
and running, is connected to exactly one database. I can accomplish this by
storing an attribute on the web-server nodes called “using-database”. When
a new web-server comes alive, it searches the chef-server db for the list of
web-server and database nodes, and it looks for a database node that is not
being used by any other web-server. Then it sets that database node in its
"using-database" attribute.

This works fine, but you get the usual race condition problem with shared
read-write resources. If two web-servers come alive at the same time, they
will search for an available database at the same time, and possibly select
the same one, and save it to their “using-database” attributes. The
one-web-server-per-database constraint is then violated.

Has anyone else come across this problem? Does this situation go against
the intent of a Chef system? Or is there support for it already, or some
other Chef best-practice?
Thanks in advance.

This is an interesting one, Nick. Nobody has brought it up before,
but the potential certainly exists. The fact that the write time for
the search indexes is not immediate makes the window larger as well.
As things stand right now, you can’t really get this sort of behavior
reliably from Chef. CouchDB provides some internal mechanisms for
doing conflict resolution on documents, and Chef basically uses them
to get a ‘last write wins’ model.

What you really wind up needing is an external locking mechanism that
deals with the correct distribution of these sorts of resources, and
integrating with it via a library.

Another way to solve this problem would be to always bring up the
webservers and databases in pairs, and remove the need for the lock
altogether.

Regards,
Adam


Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com


#3

Adam Jacob wrote:

On Fri, Oct 2, 2009 at 3:08 PM, Nick Ohanian nick@geodelic.com wrote:

I have a question about possible race conditions during resource allocation.
Let me give a simple (but somewhat silly) example. Imagine I have two
types of nodes: web-server nodes and database nodes. As far as Chef is
concerned, new nodes can come alive at any random time (in reality, they
come alive when an auto-scaler determines that more resources are needed in
the deployment). I want to make sure that each web-server, once configured
and running, is connected to exactly one database. I can accomplish this by
storing an attribute on the web-server nodes called “using-database”. When
a new web-server comes alive, it searches the chef-server db for the list of
web-server and database nodes, and it looks for a database node that is not
being used by any other web-server. Then it sets that database node in its
"using-database" attribute.

This works fine, but you get the usual race condition problem with shared
read-write resources. If two web-servers come alive at the same time, they
will search for an available database at the same time, and possibly select
the same one, and save it to their “using-database” attributes. The
one-web-server-per-database constraint is then violated.

Has anyone else come across this problem? Does this situation go against
the intent of a Chef system? Or is there support for it already, or some
other Chef best-practice?
Thanks in advance.

This is an interesting one, Nick. Nobody has brought it up before,
but the potential certainly exists. The fact that the write time for
the search indexes is not immediate makes the window larger as well.
As things stand right now, you can’t really get this sort of behavior
reliably from Chef. CouchDB provides some internal mechanisms for
doing conflict resolution on documents, and Chef basically uses them
to get a ‘last write wins’ model.

What you really wind up needing is an external locking mechanism that
deals with the correct distribution of these sorts of resources, and
integrating with it via a library.

Another way to solve this problem would be to always bring up the
webservers and databases in pairs, and remove the need for the lock
altogether.

Regards,
Adam

Thanks Adam. After working with Chef a bit more, it occurred to me that Chef’s “Convergence” model helps mitigate this problem. If the nodes are checking in with the Chef server frequently, then on the next chef-run one of the nodes could realize that it’s breaking a rule and act appropriately (thank you, splay). As long as a small period of rule-violation doesn’t blow up the entire deployment, the system should heal itself over time.

-Nick


#4

On Mon, Oct 19, 2009 at 9:32 AM, Nick Ohanian nick@geodelic.com wrote:

Thanks Adam. After working with Chef a bit more, it occurred to me that
Chef’s “Convergence” model helps mitigate this problem. If the nodes are
checking in with the Chef server frequently, then on the next chef-run one
of the nodes could realize that it’s breaking a rule and act appropriately
(thank you, splay). As long as a small period of rule-violation doesn’t
blow up the entire deployment, the system should heal itself over time.

Right - if you can remove the need for absolute consistency and allow
things to be eventually consistent instead, you can work it out much
easier.

http://books.couchdb.org/relax/intro/eventual-consistency

Check out the section here on CAP theorem for the knobs you are tweaking. :slight_smile:

Adam


Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com


#5

Another thing to consider when designing our solution is that if the clients
are running on the same interval, it’s possible to get into a
both-on-both-off periodic behavior (
http://en.wikipedia.org/wiki/Deadlock#Livelock).
-chris

On Mon, Oct 19, 2009 at 11:28 AM, Adam Jacob adam@opscode.com wrote:

On Mon, Oct 19, 2009 at 9:32 AM, Nick Ohanian nick@geodelic.com wrote:

Thanks Adam. After working with Chef a bit more, it occurred to me that
Chef’s “Convergence” model helps mitigate this problem. If the nodes are
checking in with the Chef server frequently, then on the next chef-run
one
of the nodes could realize that it’s breaking a rule and act
appropriately
(thank you, splay). As long as a small period of rule-violation doesn’t
blow up the entire deployment, the system should heal itself over time.

Right - if you can remove the need for absolute consistency and allow
things to be eventually consistent instead, you can work it out much
easier.

http://books.couchdb.org/relax/intro/eventual-consistency

Check out the section here on CAP theorem for the knobs you are tweaking.
:slight_smile:

Adam


Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com