RE: Re: Re: chef client locked


#1

Hello again,

for the record, I have created a ticket and offered a fix (http://tickets.opscode.com/browse/CHEF-4010)


Grégoire

From: Grégoire Seux
Sent: jeudi 14 mars 2013 09:29
To: chef@lists.opscode.com
Subject: RE: [chef] Re: Re: chef client locked

Thanks for both reply.
Indeed I have reproduced this only in the case where chef server is not accessible.
It seems to happen quite often, but I don’t know if it is due to high latency between nodes and server (~250 ms), over saturated connection or chef server 11.
I’ll wait for the fix then.


Grégoire

From: Paul Mooring [mailto:paul@opscode.com]mailto:[mailto:paul@opscode.com]
Sent: mercredi 13 mars 2013 18:09
To: chef@lists.opscode.commailto:chef@lists.opscode.com
Subject: [chef] Re: Re: chef client locked

This should be the result of loading the node from the server somehow failing. I believe Sascha is working on a proper fix, but in the mean time this shouldn’t happen if you have a connection to the server.

Paul Mooring
Systems Engineer and Customer Advocate

www.opscode.comhttp://www.opscode.com

From: Sascha Bates <sascha.bates@gmail.commailto:sascha.bates@gmail.com>
Reply-To: "chef@lists.opscode.commailto:chef@lists.opscode.com" <chef@lists.opscode.commailto:chef@lists.opscode.com>
Date: Wednesday, March 13, 2013 10:02 AM
To: "chef@lists.opscode.commailto:chef@lists.opscode.com" <chef@lists.opscode.commailto:chef@lists.opscode.com>
Subject: [chef] Re: chef client locked

I can confirm this. I was debugging it earlier this week and have been looking for the time to write the code to submit a pull request instead of just submitting a bug report :confused:

On Wed, Mar 13, 2013 at 5:27 AM, Grégoire Seux <g.seux@criteo.commailto:g.seux@criteo.com> wrote:
Hello,

using chef 11 (11.4.0) I have noticed a strange behavior when a run fails: the next run won’t start because of the locking introduced by http://tickets.opscode.com/browse/CHEF-867.

Log for the client is :


ERROR: Errno::ETIMEDOUT: Error connecting to https://chef03-am5 /nodes/mem02-ty5 - Connection timed out - connect(2)
[2013-03-13T11:40:03+01:00] FATAL: Stacktrace dumped to /var/cache/chef/chef-stacktrace.out
[2013-03-13T11:40:03+01:00] ERROR: Sleeping for 1800 seconds before trying again
[2013-03-13T12:10:04+01:00] INFO: Chef client is running, will wait for it to finish and then run.

I guess this is not the expected impact of the lock, is this a bug ?

Cheers,


Grégoire


#2

I was looking at this but I found that there was already a discussion and a
fix submitted: http://tickets.opscode.com/browse/CHEF-3367

What I’m really curious about is why we have two different methods of
forking the process: daemon and Chef::Config[:client_fork] true/false. If
client_fork is set to false, which it is by default, the daemon takes care
of forking and that’s when we lose the pid and the client hangs.

I’m planning to push out client_fork true to all my clients this morning to
take care of the problem.

Sascha

On Mon, Mar 18, 2013 at 8:21 AM, Grégoire Seux g.seux@criteo.com wrote:

Hello again,****


for the record, I have created a ticket and offered a fix (
http://tickets.opscode.com/browse/CHEF-4010)****


– ****

Grégoire****


From: Grégoire Seux
Sent: jeudi 14 mars 2013 09:29
To: chef@lists.opscode.com
Subject: RE: [chef] Re: Re: chef client locked****


Thanks for both reply.****

Indeed I have reproduced this only in the case where chef server is not
accessible.****

It seems to happen quite often, but I don’t know if it is due to high
latency between nodes and server (~250 ms), over saturated connection or
chef server 11.****

I’ll wait for the fix then.****


– ****

Grégoire****


From: Paul Mooring [mailto:paul@opscode.com]
Sent: mercredi 13 mars 2013 18:09
To: chef@lists.opscode.com
Subject: [chef] Re: Re: chef client locked****


This should be the result of loading the node from the server somehow
failing. I believe Sascha is working on a proper fix, but in the mean time
this shouldn’t happen if you have a connection to the server.****

– ****

Paul Mooring****

Systems Engineer and Customer Advocate****


www.opscode.com****


*From: *Sascha Bates sascha.bates@gmail.com
*Reply-To: *“chef@lists.opscode.com” chef@lists.opscode.com
*Date: *Wednesday, March 13, 2013 10:02 AM
To: "chef@lists.opscode.com" chef@lists.opscode.com
Subject: [chef] Re: chef client locked


I can confirm this. I was debugging it earlier this week and have been
looking for the time to write the code to submit a pull request instead of
just submitting a bug report :/****


On Wed, Mar 13, 2013 at 5:27 AM, Grégoire Seux g.seux@criteo.com wrote:*


Hello,

using chef 11 (11.4.0) I have noticed a strange behavior when a run fails:
the next run won’t start because of the locking introduced by
http://tickets.opscode.com/browse/CHEF-867.

Log for the client is :


ERROR: Errno::ETIMEDOUT: Error connecting to https://chef03-am5/nodes/mem02-ty5 - Connection timed out - connect(2)
[2013-03-13T11:40:03+01:00] FATAL: Stacktrace dumped to
/var/cache/chef/chef-stacktrace.out
[2013-03-13T11:40:03+01:00] ERROR: Sleeping for 1800 seconds before trying
again
[2013-03-13T12:10:04+01:00] INFO: Chef client is running, will wait for
it to finish and then run.

I guess this is not the expected impact of the lock, is this a bug ?

Cheers,


Grégoire****



#3

I am not sure this is the same issue. In case of CHEF-4010, the issue appears when code outside of begin/ensure fails (such as load_node function) and lock file is not released properly.
The fix I proposed solves this by ensuring lock release in any case.

From: Sascha Bates [mailto:sascha.bates@gmail.com]
Sent: lundi 18 mars 2013 17:33
To: chef@lists.opscode.com
Subject: [chef] Re: RE: Re: Re: chef client locked

I was looking at this but I found that there was already a discussion and a fix submitted: http://tickets.opscode.com/browse/CHEF-3367
What I’m really curious about is why we have two different methods of forking the process: daemon and Chef::Config[:client_fork] true/false. If client_fork is set to false, which it is by default, the daemon takes care of forking and that’s when we lose the pid and the client hangs.
I’m planning to push out client_fork true to all my clients this morning to take care of the problem.

Sascha

On Mon, Mar 18, 2013 at 8:21 AM, Grégoire Seux <g.seux@criteo.commailto:g.seux@criteo.com> wrote:
Hello again,

for the record, I have created a ticket and offered a fix (http://tickets.opscode.com/browse/CHEF-4010)


Grégoire

From: Grégoire Seux
Sent: jeudi 14 mars 2013 09:29
To: chef@lists.opscode.commailto:chef@lists.opscode.com
Subject: RE: [chef] Re: Re: chef client locked

Thanks for both reply.
Indeed I have reproduced this only in the case where chef server is not accessible.
It seems to happen quite often, but I don’t know if it is due to high latency between nodes and server (~250 ms), over saturated connection or chef server 11.
I’ll wait for the fix then.


Grégoire

From: Paul Mooring [mailto:paul@opscode.com]mailto:[mailto:paul@opscode.com]
Sent: mercredi 13 mars 2013 18:09
To: chef@lists.opscode.commailto:chef@lists.opscode.com
Subject: [chef] Re: Re: chef client locked

This should be the result of loading the node from the server somehow failing. I believe Sascha is working on a proper fix, but in the mean time this shouldn’t happen if you have a connection to the server.

Paul Mooring
Systems Engineer and Customer Advocate

www.opscode.comhttp://www.opscode.com

From: Sascha Bates <sascha.bates@gmail.commailto:sascha.bates@gmail.com>
Reply-To: "chef@lists.opscode.commailto:chef@lists.opscode.com" <chef@lists.opscode.commailto:chef@lists.opscode.com>
Date: Wednesday, March 13, 2013 10:02 AM
To: "chef@lists.opscode.commailto:chef@lists.opscode.com" <chef@lists.opscode.commailto:chef@lists.opscode.com>
Subject: [chef] Re: chef client locked

I can confirm this. I was debugging it earlier this week and have been looking for the time to write the code to submit a pull request instead of just submitting a bug report :confused:

On Wed, Mar 13, 2013 at 5:27 AM, Grégoire Seux <g.seux@criteo.commailto:g.seux@criteo.com> wrote:
Hello,

using chef 11 (11.4.0) I have noticed a strange behavior when a run fails: the next run won’t start because of the locking introduced by http://tickets.opscode.com/browse/CHEF-867.

Log for the client is :


ERROR: Errno::ETIMEDOUT: Error connecting to https://chef03-am5 /nodes/mem02-ty5 - Connection timed out - connect(2)
[2013-03-13T11:40:03+01:00] FATAL: Stacktrace dumped to /var/cache/chef/chef-stacktrace.out
[2013-03-13T11:40:03+01:00] ERROR: Sleeping for 1800 seconds before trying again
[2013-03-13T12:10:04+01:00] INFO: Chef client is running, will wait for it to finish and then run.

I guess this is not the expected impact of the lock, is this a bug ?

Cheers,


Grégoire


#4

On Monday, March 18, 2013 at 9:32 AM, Sascha Bates wrote:

I was looking at this but I found that there was already a discussion and a fix submitted: http://tickets.opscode.com/browse/CHEF-3367

What I’m really curious about is why we have two different methods of forking the process: daemon and Chef::Config[:client_fork] true/false. If client_fork is set to false, which it is by default, the daemon takes care of forking and that’s when we lose the pid and the client hangs.
“daemon” is for OG Unix process backgrounding: fork, set a new process group, replace stdin/stdout with log file or dev/null, etc.

“client_fork” is where each chef run forks a new process. It doesn’t create a new process group, detach from the terminal or anything like that. The point of client_fork is that a “disposable” process is used for each run, so a cookbook cannot pollute state that is persisted between runs. This prevents memory leaks in cookbooks (or Chef itself, or some interaction between the two) from impacting the system since the process dies and returns its memory at the end of the run.

I’ll see about getting both of these tickets looked at during today’s code review.


Daniel DeLeo

I’m planning to push out client_fork true to all my clients this morning to take care of the problem.

Sascha

On Mon, Mar 18, 2013 at 8:21 AM, Grégoire Seux <g.seux@criteo.com (mailto:g.seux@criteo.com)> wrote:

Hello again,

for the record, I have created a ticket and offered a fix (http://tickets.opscode.com/browse/CHEF-4010)


Grégoire

From: Grégoire Seux
Sent: jeudi 14 mars 2013 09:29
To: chef@lists.opscode.com (mailto:chef@lists.opscode.com)
Subject: RE: [chef] Re: Re: chef client locked

Thanks for both reply.
Indeed I have reproduced this only in the case where chef server is not accessible.
It seems to happen quite often, but I don’t know if it is due to high latency between nodes and server (~250 ms), over saturated connection or chef server 11.
I’ll wait for the fix then.


Grégoire

From: Paul Mooring [mailto:paul@opscode.com] (mailto:[mailto:paul@opscode.com])
Sent: mercredi 13 mars 2013 18:09
To: chef@lists.opscode.com (mailto:chef@lists.opscode.com)
Subject: [chef] Re: Re: chef client locked

This should be the result of loading the node from the server somehow failing. I believe Sascha is working on a proper fix, but in the mean time this shouldn’t happen if you have a connection to the server.

Paul Mooring

Systems Engineer and Customer Advocate

www.opscode.com (http://www.opscode.com)

From: Sascha Bates <sascha.bates@gmail.com (mailto:sascha.bates@gmail.com)>
Reply-To: "chef@lists.opscode.com (mailto:chef@lists.opscode.com)" <chef@lists.opscode.com (mailto:chef@lists.opscode.com)>
Date: Wednesday, March 13, 2013 10:02 AM
To: "chef@lists.opscode.com (mailto:chef@lists.opscode.com)" <chef@lists.opscode.com (mailto:chef@lists.opscode.com)>
Subject: [chef] Re: chef client locked

I can confirm this. I was debugging it earlier this week and have been looking for the time to write the code to submit a pull request instead of just submitting a bug report :confused:

On Wed, Mar 13, 2013 at 5:27 AM, Grégoire Seux <g.seux@criteo.com (mailto:g.seux@criteo.com)> wrote:
Hello,

using chef 11 (11.4.0) I have noticed a strange behavior when a run fails: the next run won’t start because of the locking introduced by http://tickets.opscode.com/browse/CHEF-867.

Log for the client is :


ERROR: Errno::ETIMEDOUT: Error connecting to https://chef03-am5 /nodes/mem02-ty5 - Connection timed out - connect(2)
[2013-03-13T11:40:03+01:00] FATAL: Stacktrace dumped to /var/cache/chef/chef-stacktrace.out
[2013-03-13T11:40:03+01:00] ERROR: Sleeping for 1800 seconds before trying again
[2013-03-13T12:10:04+01:00] INFO: Chef client is running, will wait for it to finish and then run.

I guess this is not the expected impact of the lock, is this a bug ?

Cheers,


Grégoire


#5

On Monday, March 18, 2013 at 10:26 AM, Daniel DeLeo wrote:

On Monday, March 18, 2013 at 9:32 AM, Sascha Bates wrote:

I was looking at this but I found that there was already a discussion and a fix submitted: http://tickets.opscode.com/browse/CHEF-3367

What I’m really curious about is why we have two different methods of forking the process: daemon and Chef::Config[:client_fork] true/false. If client_fork is set to false, which it is by default, the daemon takes care of forking and that’s when we lose the pid and the client hangs.
“daemon” is for OG Unix process backgrounding: fork, set a new process group, replace stdin/stdout with log file or dev/null, etc.

“client_fork” is where each chef run forks a new process. It doesn’t create a new process group, detach from the terminal or anything like that. The point of client_fork is that a “disposable” process is used for each run, so a cookbook cannot pollute state that is persisted between runs. This prevents memory leaks in cookbooks (or Chef itself, or some interaction between the two) from impacting the system since the process dies and returns its memory at the end of the run.

I’ll see about getting both of these tickets looked at during today’s code review.

Dunno if you guys follow the code review emails, the patch for CHEF-4010 had some small issues but is 90% there.

http://tickets.opscode.com/browse/CHEF-4010


Daniel DeLeo