Race conditions during resource allocation

Nick_Ohanian · October 2, 2009, 10:08pm

Hi all-

I have a question about possible race conditions during resource allocation. Let me give a simple (but somewhat silly) example. Imagine I have two types of nodes: web-server nodes and database nodes. As far as Chef is concerned, new nodes can come alive at any random time (in reality, they come alive when an auto-scaler determines that more resources are needed in the deployment). I want to make sure that each web-server, once configured and running, is connected to exactly one database. I can accomplish this by storing an attribute on the web-server nodes called “using-database”. When a new web-server comes alive, it searches the chef-server db for the list of web-server and database nodes, and it looks for a database node that is not being used by any other web-server. Then it sets that database node in its “using-database” attribute.

This works fine, but you get the usual race condition problem with shared read-write resources. If two web-servers come alive at the same time, they will search for an available database at the same time, and possibly select the same one, and save it to their “using-database” attributes. The one-web-server-per-database constraint is then violated.

Has anyone else come across this problem? Does this situation go against the intent of a Chef system? Or is there support for it already, or some other Chef best-practice?

Thanks in advance.

-Nick O

Adam_Jacob · October 14, 2009, 12:02am

On Fri, Oct 2, 2009 at 3:08 PM, Nick Ohanian nick@geodelic.com wrote:

I have a question about possible race conditions during resource allocation.
Let me give a simple (but somewhat silly) example. Imagine I have two
types of nodes: web-server nodes and database nodes. As far as Chef is
concerned, new nodes can come alive at any random time (in reality, they
come alive when an auto-scaler determines that more resources are needed in
the deployment). I want to make sure that each web-server, once configured
and running, is connected to exactly one database. I can accomplish this by
storing an attribute on the web-server nodes called "using-database". When
a new web-server comes alive, it searches the chef-server db for the list of
web-server and database nodes, and it looks for a database node that is not
being used by any other web-server. Then it sets that database node in its
"using-database" attribute.

This works fine, but you get the usual race condition problem with shared
read-write resources. If two web-servers come alive at the same time, they
will search for an available database at the same time, and possibly select
the same one, and save it to their "using-database" attributes. The
one-web-server-per-database constraint is then violated.

Has anyone else come across this problem? Does this situation go against
the intent of a Chef system? Or is there support for it already, or some
other Chef best-practice?
Thanks in advance.

This is an interesting one, Nick. Nobody has brought it up before,
but the potential certainly exists. The fact that the write time for
the search indexes is not immediate makes the window larger as well.
As things stand right now, you can't really get this sort of behavior
reliably from Chef. CouchDB provides some internal mechanisms for
doing conflict resolution on documents, and Chef basically uses them
to get a 'last write wins' model.

What you really wind up needing is an external locking mechanism that
deals with the correct distribution of these sorts of resources, and
integrating with it via a library.

Another way to solve this problem would be to always bring up the
webservers and databases in pairs, and remove the need for the lock
altogether.

Regards,
Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com

Nick_Ohanian · October 19, 2009, 4:32pm

Adam Jacob wrote:

On Fri, Oct 2, 2009 at 3:08 PM, Nick Ohanian nick@geodelic.com wrote:

I have a question about possible race conditions during resource allocation.
Let me give a simple (but somewhat silly) example. Imagine I have two
types of nodes: web-server nodes and database nodes. As far as Chef is
concerned, new nodes can come alive at any random time (in reality, they
come alive when an auto-scaler determines that more resources are needed in
the deployment). I want to make sure that each web-server, once configured
and running, is connected to exactly one database. I can accomplish this by
storing an attribute on the web-server nodes called "using-database". When
a new web-server comes alive, it searches the chef-server db for the list of
web-server and database nodes, and it looks for a database node that is not
being used by any other web-server. Then it sets that database node in its
"using-database" attribute.

This works fine, but you get the usual race condition problem with shared
read-write resources. If two web-servers come alive at the same time, they
will search for an available database at the same time, and possibly select
the same one, and save it to their "using-database" attributes. The
one-web-server-per-database constraint is then violated.

Has anyone else come across this problem? Does this situation go against
the intent of a Chef system? Or is there support for it already, or some
other Chef best-practice?
Thanks in advance.

This is an interesting one, Nick. Nobody has brought it up before,
but the potential certainly exists. The fact that the write time for
the search indexes is not immediate makes the window larger as well.
As things stand right now, you can't really get this sort of behavior
reliably from Chef. CouchDB provides some internal mechanisms for
doing conflict resolution on documents, and Chef basically uses them
to get a 'last write wins' model.

What you really wind up needing is an external locking mechanism that
deals with the correct distribution of these sorts of resources, and
integrating with it via a library.

Another way to solve this problem would be to always bring up the
webservers and databases in pairs, and remove the need for the lock
altogether.

Regards,
Adam

Thanks Adam. After working with Chef a bit more, it occurred to me that Chef's "Convergence" model helps mitigate this problem. If the nodes are checking in with the Chef server frequently, then on the next chef-run one of the nodes could realize that it's breaking a rule and act appropriately (thank you, splay). As long as a small period of rule-violation doesn't blow up the entire deployment, the system should heal itself over time.

-Nick

Adam_Jacob · October 19, 2009, 6:28pm

On Mon, Oct 19, 2009 at 9:32 AM, Nick Ohanian nick@geodelic.com wrote:

Thanks Adam. After working with Chef a bit more, it occurred to me that
Chef's "Convergence" model helps mitigate this problem. If the nodes are
checking in with the Chef server frequently, then on the next chef-run one
of the nodes could realize that it's breaking a rule and act appropriately
(thank you, splay). As long as a small period of rule-violation doesn't
blow up the entire deployment, the system should heal itself over time.

Right - if you can remove the need for absolute consistency and allow
things to be eventually consistent instead, you can work it out much
easier.

http://books.couchdb.org/relax/intro/eventual-consistency

Check out the section here on CAP theorem for the knobs you are tweaking.

Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com

Chris_Walters · October 19, 2009, 6:34pm

Another thing to consider when designing our solution is that if the clients
are running on the same interval, it's possible to get into a
both-on-both-off periodic behavior (
Deadlock - Wikipedia).
-chris

On Mon, Oct 19, 2009 at 11:28 AM, Adam Jacob adam@opscode.com wrote:

On Mon, Oct 19, 2009 at 9:32 AM, Nick Ohanian nick@geodelic.com wrote:

Thanks Adam. After working with Chef a bit more, it occurred to me that
Chef's "Convergence" model helps mitigate this problem. If the nodes are
checking in with the Chef server frequently, then on the next chef-run
one
of the nodes could realize that it's breaking a rule and act
appropriately
(thank you, splay). As long as a small period of rule-violation doesn't
blow up the entire deployment, the system should heal itself over time.

Right - if you can remove the need for absolute consistency and allow
things to be eventually consistent instead, you can work it out much
easier.

http://books.couchdb.org/relax/intro/eventual-consistency

Check out the section here on CAP theorem for the knobs you are tweaking.

Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com

Topic		Replies	Views
Concurrency node creation issue Chef Infra (archive)	6	289	September 7, 2011
Race Conditions Chef Infra (archive)	2	398	September 27, 2013
Save from recipe data to chef server Chef Infra (archive)	5	377	November 18, 2013
Chef setup has become unstable Chef Infra (archive)	13	468	June 18, 2012
Simultaneous node edits Chef Infra (archive)	9	719	July 7, 2015

Race conditions during resource allocation

Related topics