Chef-client massive memory usage spikes

We are running ubuntu 10.04 LTS, x64.
chef-client is running with the default service installed with the chef .deb packages.

On two of our machines when chef-client runs and does a check it randomly (once a day to once a week) takes longer than normal (30s becomes 3 minutes+) and takes up a lot more memory than chef should.
5GB memory usage: http://pastebin.com/YKXXFRaR (Taken from a machine running right now which had the issue earlier this morning, I’ve left chef-client running at 5GB in case there is anything that can be pulled from it).

For each of the machines having this problem we have an identically configured machine, same chef roles etc (secondary for failover) which is not exhibiting any of the issues (These machines also have almost no load, whereas the ones having issues have plenty of load).

I really don’t have any idea where to start with investigating this, so any help will be much appreciated!

The runlist for each machine has apt (currently 1.1.1, will update that) which installs a few packages, however none of these have changed recently, so no packages should be being installed.
Each machine has been running with the exact same runlist for quite a while (6 months+), and only started exhibiting this problem in the last month.

The chef logs when these memory/time issues occur appear in the log as follows:

[Fri, 20 Jan 2012 07:48:06 +1300] INFO: Processing easy_install_package[jonpy] action install (mapnik::default line 43)
[Fri, 20 Jan 2012 07:49:57 +1300] INFO: Processing package[php5-gd] action install (mapscript::default line 10)
(jonpy is already installed, yet it took 2 minutes to check?)

Or more commonly like this:
[Mon, 09 Jan 2012 01:05:31 +1300] INFO: Processing package[snmpd] action install (snmpd::default line 10)
[Mon, 09 Jan 2012 01:08:12 +1300] INFO: Processing service[snmpd] action nothing (snmpd::default line 14)

[Mon, 09 Jan 2012 01:08:27 +1300] INFO: Processing package[apache2] action install (apache2::default line 10)
[Mon, 09 Jan 2012 01:10:05 +1300] INFO: Processing service[apache2] action nothing (apache2::default line 14)
(Both packages already installed and happily running, yet took a long time)

One machine is on chef 10.8, the other on 10.0.

Runlist from machines:
http://pastebin.com/kPikRYqd
http://pastebin.com/0uS791S4

If there are any further details that may help investigate this I will happily supply them.

Cheers,
Dave.

run chef via cron
put monit on top of it. just to check memory consumption and execution time

On Fri, Jan 20, 2012 at 2:10 AM, David Leaver dave@smartrak.co.nz wrote:

We are running ubuntu 10.04 LTS, x64.****

chef-client is running with the default service installed with the chef
.deb packages.****


On two of our machines when chef-client runs and does a check it randomly
(once a day to once a week) takes longer than normal (30s becomes 3
minutes+) and takes up a lot more memory than chef should.****

5GB memory usage: http://pastebin.com/YKXXFRaR (Taken from a machine
running right now which had the issue earlier this morning, I’ve left
chef-client running at 5GB in case there is anything that can be pulled
from it).****


For each of the machines having this problem we have an identically
configured machine, same chef roles etc (secondary for failover) which is
not exhibiting any of the issues (These machines also have almost no load,
whereas the ones having issues have plenty of load).****



I really don’t have any idea where to start with investigating this, so
any help will be much appreciated!****



The runlist for each machine has apt (currently 1.1.1, will update that)
which installs a few packages, however none of these have changed recently,
so no packages should be being installed.****

Each machine has been running with the exact same runlist for quite a
while (6 months+), and only started exhibiting this problem in the last
month.****


The chef logs when these memory/time issues occur appear in the log as
follows:****


[Fri, 20 Jan 2012 07:48:06 +1300] INFO: Processing
easy_install_package[jonpy] action install (mapnik::default line 43)****

[Fri, 20 Jan 2012 07:49:57 +1300] INFO: Processing package[php5-gd] action
install (mapscript::default line 10)****

(jonpy is already installed, yet it took 2 minutes to check?)****


Or more commonly like this:****

[Mon, 09 Jan 2012 01:05:31 +1300] INFO: Processing package[snmpd] action
install (snmpd::default line 10)****

[Mon, 09 Jan 2012 01:08:12 +1300] INFO: Processing service[snmpd] action
nothing (snmpd::default line 14)****

….****

[Mon, 09 Jan 2012 01:08:27 +1300] INFO: Processing package[apache2] action
install (apache2::default line 10)****

[Mon, 09 Jan 2012 01:10:05 +1300] INFO: Processing service[apache2] action
nothing (apache2::default line 14)****

(Both packages already installed and happily running, yet took a long time)



One machine is on chef 10.8, the other on 10.0.****


Runlist from machines:****

http://pastebin.com/kPikRYqd****

http://pastebin.com/0uS791S4****


If there are any further details that may help investigate this I will
happily supply them.****


Cheers,****

Dave.****

I wonder if a node attribute is growing out of control. Have you looked at
the node attributes on the server after this run finishes?

Bryan
On Jan 19, 2012 3:40 PM, "David Leaver" dave@smartrak.co.nz wrote:

We are running ubuntu 10.04 LTS, x64.****

chef-client is running with the default service installed with the chef
.deb packages.****


On two of our machines when chef-client runs and does a check it randomly
(once a day to once a week) takes longer than normal (30s becomes 3
minutes+) and takes up a lot more memory than chef should.****

5GB memory usage: http://pastebin.com/YKXXFRaR (Taken from a machine
running right now which had the issue earlier this morning, I’ve left
chef-client running at 5GB in case there is anything that can be pulled
from it).****


For each of the machines having this problem we have an identically
configured machine, same chef roles etc (secondary for failover) which is
not exhibiting any of the issues (These machines also have almost no load,
whereas the ones having issues have plenty of load).****



I really don’t have any idea where to start with investigating this, so
any help will be much appreciated!****



The runlist for each machine has apt (currently 1.1.1, will update that)
which installs a few packages, however none of these have changed recently,
so no packages should be being installed.****

Each machine has been running with the exact same runlist for quite a
while (6 months+), and only started exhibiting this problem in the last
month.****


The chef logs when these memory/time issues occur appear in the log as
follows:****


[Fri, 20 Jan 2012 07:48:06 +1300] INFO: Processing
easy_install_package[jonpy] action install (mapnik::default line 43)****

[Fri, 20 Jan 2012 07:49:57 +1300] INFO: Processing package[php5-gd] action
install (mapscript::default line 10)****

(jonpy is already installed, yet it took 2 minutes to check?)****


Or more commonly like this:****

[Mon, 09 Jan 2012 01:05:31 +1300] INFO: Processing package[snmpd] action
install (snmpd::default line 10)****

[Mon, 09 Jan 2012 01:08:12 +1300] INFO: Processing service[snmpd] action
nothing (snmpd::default line 14)****

….****

[Mon, 09 Jan 2012 01:08:27 +1300] INFO: Processing package[apache2] action
install (apache2::default line 10)****

[Mon, 09 Jan 2012 01:10:05 +1300] INFO: Processing service[apache2] action
nothing (apache2::default line 14)****

(Both packages already installed and happily running, yet took a long time)



One machine is on chef 10.8, the other on 10.0.****


Runlist from machines:****

http://pastebin.com/kPikRYqd****

http://pastebin.com/0uS791S4****


If there are any further details that may help investigate this I will
happily supply them.****


Cheers,****

Dave.****

From “knife node show --long” right now on both nodes the attributes look fine. Spits out 140KB file for one node and 98KB for the other. (Most of that is the list of cpus and abilities)
Will keep that in mind and run it immediately after it breaks next time.

Cheers,
Dave.

From: Bryan McLellan [mailto:btm@loftninjas.org]
Sent: Friday, 20 January 2012 10:40 a.m.
To: chef@lists.opscode.com
Subject: [chef] Re: chef-client massive memory usage spikes

I wonder if a node attribute is growing out of control. Have you looked at the node attributes on the server after this run finishes?

Bryan
On Jan 19, 2012 3:40 PM, “David Leaver” <dave@smartrak.co.nzmailto:dave@smartrak.co.nz> wrote:
We are running ubuntu 10.04 LTS, x64.
chef-client is running with the default service installed with the chef .deb packages.

On two of our machines when chef-client runs and does a check it randomly (once a day to once a week) takes longer than normal (30s becomes 3 minutes+) and takes up a lot more memory than chef should.
5GB memory usage: http://pastebin.com/YKXXFRaR (Taken from a machine running right now which had the issue earlier this morning, I’ve left chef-client running at 5GB in case there is anything that can be pulled from it).

For each of the machines having this problem we have an identically configured machine, same chef roles etc (secondary for failover) which is not exhibiting any of the issues (These machines also have almost no load, whereas the ones having issues have plenty of load).

I really don’t have any idea where to start with investigating this, so any help will be much appreciated!

The runlist for each machine has apt (currently 1.1.1, will update that) which installs a few packages, however none of these have changed recently, so no packages should be being installed.
Each machine has been running with the exact same runlist for quite a while (6 months+), and only started exhibiting this problem in the last month.

The chef logs when these memory/time issues occur appear in the log as follows:

[Fri, 20 Jan 2012 07:48:06 +1300] INFO: Processing easy_install_package[jonpy] action install (mapnik::default line 43)
[Fri, 20 Jan 2012 07:49:57 +1300] INFO: Processing package[php5-gd] action install (mapscript::default line 10)
(jonpy is already installed, yet it took 2 minutes to check?)

Or more commonly like this:
[Mon, 09 Jan 2012 01:05:31 +1300] INFO: Processing package[snmpd] action install (snmpd::default line 10)
[Mon, 09 Jan 2012 01:08:12 +1300] INFO: Processing service[snmpd] action nothing (snmpd::default line 14)

[Mon, 09 Jan 2012 01:08:27 +1300] INFO: Processing package[apache2] action install (apache2::default line 10)
[Mon, 09 Jan 2012 01:10:05 +1300] INFO: Processing service[apache2] action nothing (apache2::default line 14)
(Both packages already installed and happily running, yet took a long time)

One machine is on chef 10.8, the other on 10.0.

Runlist from machines:
http://pastebin.com/kPikRYqd
http://pastebin.com/0uS791S4

If there are any further details that may help investigate this I will happily supply them.

Cheers,
Dave.

It doesn't look like you have logging set to DEBUG either. Try that and
maybe check your log rotation to protect your disk space.

Bryan
On Jan 19, 2012 5:41 PM, "David Leaver" dave@smartrak.co.nz wrote:

From “knife node show --long" right now on both nodes the attributes
look fine. Spits out 140KB file for one node and 98KB for the other. (Most
of that is the list of cpus and abilities)****

Will keep that in mind and run it immediately after it breaks next time.**
**


Cheers,
Dave.****


From: Bryan McLellan [mailto:btm@loftninjas.org]
Sent: Friday, 20 January 2012 10:40 a.m.
To: chef@lists.opscode.com
Subject: [chef] Re: chef-client massive memory usage spikes****


I wonder if a node attribute is growing out of control. Have you looked at
the node attributes on the server after this run finishes?****

Bryan****

On Jan 19, 2012 3:40 PM, "David Leaver" dave@smartrak.co.nz wrote:****

We are running ubuntu 10.04 LTS, x64.****

chef-client is running with the default service installed with the chef
.deb packages.****


On two of our machines when chef-client runs and does a check it randomly
(once a day to once a week) takes longer than normal (30s becomes 3
minutes+) and takes up a lot more memory than chef should.****

5GB memory usage: http://pastebin.com/YKXXFRaR (Taken from a machine
running right now which had the issue earlier this morning, I’ve left
chef-client running at 5GB in case there is anything that can be pulled
from it).****


For each of the machines having this problem we have an identically
configured machine, same chef roles etc (secondary for failover) which is
not exhibiting any of the issues (These machines also have almost no load,
whereas the ones having issues have plenty of load).****



I really don’t have any idea where to start with investigating this, so
any help will be much appreciated!****



The runlist for each machine has apt (currently 1.1.1, will update that)
which installs a few packages, however none of these have changed recently,
so no packages should be being installed.****

Each machine has been running with the exact same runlist for quite a
while (6 months+), and only started exhibiting this problem in the last
month.****


The chef logs when these memory/time issues occur appear in the log as
follows:****


[Fri, 20 Jan 2012 07:48:06 +1300] INFO: Processing
easy_install_package[jonpy] action install (mapnik::default line 43)****

[Fri, 20 Jan 2012 07:49:57 +1300] INFO: Processing package[php5-gd] action
install (mapscript::default line 10)****

(jonpy is already installed, yet it took 2 minutes to check?)****


Or more commonly like this:****

[Mon, 09 Jan 2012 01:05:31 +1300] INFO: Processing package[snmpd] action
install (snmpd::default line 10)****

[Mon, 09 Jan 2012 01:08:12 +1300] INFO: Processing service[snmpd] action
nothing (snmpd::default line 14)****

….****

[Mon, 09 Jan 2012 01:08:27 +1300] INFO: Processing package[apache2] action
install (apache2::default line 10)****

[Mon, 09 Jan 2012 01:10:05 +1300] INFO: Processing service[apache2] action
nothing (apache2::default line 14)****

(Both packages already installed and happily running, yet took a long time)



One machine is on chef 10.8, the other on 10.0.****


Runlist from machines:****

http://pastebin.com/kPikRYqd****

http://pastebin.com/0uS791S4****


If there are any further details that may help investigate this I will
happily supply them.****


Cheers,****

Dave.****

Thanks Bryan, Logging on Debug now.
Results from stalls last night:

[Tue, 24 Jan 2012 00:23:25 +1300] DEBUG: Processing package[update-notifier-common] on slowbro
[Tue, 24 Jan 2012 00:23:25 +1300] INFO: Processing package[update-notifier-common] action install (apt::default line 37)
[Tue, 24 Jan 2012 00:23:25 +1300] DEBUG: package[update-notifier-common] checking package status for update-notifier-common
[Tue, 24 Jan 2012 00:23:25 +1300] DEBUG: sh(apt-cache policy update-notifier-common)
(Note the stall here for 2 minutes)
[Tue, 24 Jan 2012 00:25:29 +1300] DEBUG: package[update-notifier-common] current version is 0.99.3ubuntu0.1
[Tue, 24 Jan 2012 00:25:29 +1300] DEBUG: package[update-notifier-common] candidate version is 0.99.3ubuntu0.1
[Tue, 24 Jan 2012 00:25:29 +1300] DEBUG: package[update-notifier-common] is already installed - nothing to do

There were 2 more with stalls in the same place for different packages.

So the stall appears to be happening between executing apt-cache policy, but before any output is returned?
https://github.com/opscode/chef/blob/master/chef/lib/chef/provider/package/apt.rb

Running apt-cache policy for the 3 packages produces nothing strange looking to me:
http://pastebin.com/6f2AJvkx

Investigating a bit further (taking a guess the issue is in shell_out) leads me to:

This box is on Ubuntu 10.04 LTS with ruby 1.8:
ruby -v
ruby 1.8.7 (2010-01-10 patchlevel 249) [x86_64-linux]

Would it be worth upgrading ruby? (schisamo - you appear to have authored shell_out, any idea if those hacks are required for that version of ruby?)
Could we change the GC.disable to only happen on the affected versions of ruby? (I could remove it and do some more test runs)

And finally, I ran chef-client in a loop until the issue occurred, the results were as follows (after quite a few runs):
[Tue, 24 Jan 2012 09:32:12 +1300] DEBUG: Processing package[iftop] on slowbro
[Tue, 24 Jan 2012 09:32:12 +1300] INFO: Processing package[iftop] action install (iftop::default line 10)
[Tue, 24 Jan 2012 09:32:12 +1300] DEBUG: package[iftop] checking package status for iftop
[Tue, 24 Jan 2012 09:32:12 +1300] DEBUG: sh(apt-cache policy iftop)
iftop:
Installed: 0.17-16
Candidate: 0.17-16
Version table:
*** 0.17-16 0
500 http://nz.archive.ubuntu.com/ubuntu/ lucid/universe Packages
100 /var/lib/dpkg/status

Output stopped here, memory usage shot up to 16GB+ (machine has 32GB), cpu sat at 0, nothing more happened.
I left it running for 10 minutes before killing it with kill -9 (ctrl-c wasn’t responding).

Any other suggestions are very much appreciated.

Cheers,
Dave.

From: Bryan McLellan [mailto:btm@loftninjas.org]
Sent: Friday, 20 January 2012 11:46 a.m.
To: chef@lists.opscode.com
Subject: [chef] Re: RE: Re: chef-client massive memory usage spikes

It doesn’t look like you have logging set to DEBUG either. Try that and maybe check your log rotation to protect your disk space.

Bryan
On Jan 19, 2012 5:41 PM, “David Leaver” <dave@smartrak.co.nzmailto:dave@smartrak.co.nz> wrote:
From “knife node show --long” right now on both nodes the attributes look fine. Spits out 140KB file for one node and 98KB for the other. (Most of that is the list of cpus and abilities)
Will keep that in mind and run it immediately after it breaks next time.

Cheers,
Dave.

From: Bryan McLellan [mailto:btm@loftninjas.orgmailto:btm@loftninjas.org]
Sent: Friday, 20 January 2012 10:40 a.m.
To: chef@lists.opscode.commailto:chef@lists.opscode.com
Subject: [chef] Re: chef-client massive memory usage spikes

I wonder if a node attribute is growing out of control. Have you looked at the node attributes on the server after this run finishes?

Bryan
On Jan 19, 2012 3:40 PM, “David Leaver” <dave@smartrak.co.nzmailto:dave@smartrak.co.nz> wrote:
We are running ubuntu 10.04 LTS, x64.
chef-client is running with the default service installed with the chef .deb packages.

On two of our machines when chef-client runs and does a check it randomly (once a day to once a week) takes longer than normal (30s becomes 3 minutes+) and takes up a lot more memory than chef should.
5GB memory usage: http://pastebin.com/YKXXFRaR (Taken from a machine running right now which had the issue earlier this morning, I’ve left chef-client running at 5GB in case there is anything that can be pulled from it).

For each of the machines having this problem we have an identically configured machine, same chef roles etc (secondary for failover) which is not exhibiting any of the issues (These machines also have almost no load, whereas the ones having issues have plenty of load).

I really don’t have any idea where to start with investigating this, so any help will be much appreciated!

The runlist for each machine has apt (currently 1.1.1, will update that) which installs a few packages, however none of these have changed recently, so no packages should be being installed.
Each machine has been running with the exact same runlist for quite a while (6 months+), and only started exhibiting this problem in the last month.

The chef logs when these memory/time issues occur appear in the log as follows:

[Fri, 20 Jan 2012 07:48:06 +1300] INFO: Processing easy_install_package[jonpy] action install (mapnik::default line 43)
[Fri, 20 Jan 2012 07:49:57 +1300] INFO: Processing package[php5-gd] action install (mapscript::default line 10)
(jonpy is already installed, yet it took 2 minutes to check?)

Or more commonly like this:
[Mon, 09 Jan 2012 01:05:31 +1300] INFO: Processing package[snmpd] action install (snmpd::default line 10)
[Mon, 09 Jan 2012 01:08:12 +1300] INFO: Processing service[snmpd] action nothing (snmpd::default line 14)

[Mon, 09 Jan 2012 01:08:27 +1300] INFO: Processing package[apache2] action install (apache2::default line 10)
[Mon, 09 Jan 2012 01:10:05 +1300] INFO: Processing service[apache2] action nothing (apache2::default line 14)
(Both packages already installed and happily running, yet took a long time)

One machine is on chef 10.8, the other on 10.0.

Runlist from machines:
http://pastebin.com/kPikRYqd
http://pastebin.com/0uS791S4

If there are any further details that may help investigate this I will happily supply them.

Cheers,
Dave.

We had the same issue. It seemed to happen with searches that returned lots
of nodes. Also, occasionally when we got exceptions. We just used monit
which we were using for other things to restart chef when it was about x MB
of RAM 3/5 cycles.

I noticed high usage of memory when exception happens and exception
handlers dumps the run_state to json.
This has happened only when I had notification definitions inside LWRP. For
example

--- providers/foo.rb --

action :create do

template new_resource.name do
   source "foo.erb"
   notifies :restart, "service[zoo]"
end

end

--max

On Mon, Jan 23, 2012 at 2:36 PM, Brian Akins brian@akins.org wrote:

We had the same issue. It seemed to happen with searches that returned
lots of nodes. Also, occasionally when we got exceptions. We just used
monit which we were using for other things to restart chef when it was
about x MB of RAM 3/5 cycles.