Debugging memory leak issues with chef-client?

Folks,

I’ve seen CHEF-3432 and CHEF-3985, but I’m still seeing memory leak issues with chef-client 10.18.2, even though it’s not running in daemon mode. The VM is configured for 2GB of “RAM” and 8GB of swap, and I’m now seeing chef-client reliably running up to ~9GB VSZ and 1.7GB RSS. This just started today, as I’ve been debugging and resolving various other issues in the cookbooks and recipes that we’re using for this system.

However, it’s not exactly clear to me what the best debugging process is, to try and figure out why it’s leaking memory. I did check the node with a “knife node show -l”, and the output is about 3400 lines long, comprising some 113KB of data. That’s a little bigger than usual, but it doesn’t seem to be totally out of whack.

And no, we’re not using search at all. I wish we were, but that’s a different story for a different day.

Are there particular tools or command-line options I should be using to try and figure out why chef-client is growing without bounds?


Brad Knowles brad@shub-internet.org
LinkedIn Profile: http://tinyurl.com/y8kpxu

On Monday, July 8, 2013 at 12:34 PM, Brad Knowles wrote:

Folks,

I've seen CHEF-3432 and CHEF-3985, but I'm still seeing memory leak issues with chef-client 10.18.2, even though it's not running in daemon mode. The VM is configured for 2GB of "RAM" and 8GB of swap, and I'm now seeing chef-client reliably running up to ~9GB VSZ and 1.7GB RSS. This just started today, as I've been debugging and resolving various other issues in the cookbooks and recipes that we're using for this system.

However, it's not exactly clear to me what the best debugging process is, to try and figure out why it's leaking memory. I did check the node with a "knife node show -l", and the output is about 3400 lines long, comprising some 113KB of data. That's a little bigger than usual, but it doesn't seem to be totally out of whack.

And no, we're not using search at all. I wish we were, but that's a different story for a different day.

Are there particular tools or command-line options I should be using to try and figure out why chef-client is growing without bounds?

Ruby memory profiling is in a rough spot right now, as many of the useful tools developed for Ruby 1.8 have not been updated, and efforts to update them and incorporate them into ruby core have not yet landed.

When I was investigating CHEF-3432, I used remote pry to get a shell in the running ruby process and then ObjectSpace.each_object to count various object types. If you can reproduce your memory usage issue reliably, I'd suggest putting one remote pry invocation at the beginning of a run, to get a baseline, and then a second later in the run after the memory leak has occurred.

--
Brad Knowles <brad@shub-internet.org (mailto:brad@shub-internet.org)>
LinkedIn Profile: http://tinyurl.com/y8kpxu

--
Daniel DeLeo

On Jul 8, 2013, at 1:34 PM, Brad Knowles brad@shub-internet.org wrote:

I've seen CHEF-3432 and CHEF-3985, but I'm still seeing memory leak issues with chef-client 10.18.2, even though it's not running in daemon mode. The VM is configured for 2GB of "RAM" and 8GB of swap, and I'm now seeing chef-client reliably running up to ~9GB VSZ and 1.7GB RSS. This just started today, as I've been debugging and resolving various other issues in the cookbooks and recipes that we're using for this system.

Okay, so I think I may have found a bug in chef-client. If you change the name returned by hostname to something else, I'm guessing that chef-client will flake out in precisely this manner because the node name no longer matches the hostname.

When we changed the hostname back to match the node name, this memory leak behaviour disappeared.

Should I update CHEF-3432 or CHEF-3985 with this new piece of information?

--
Brad Knowles brad@shub-internet.org
LinkedIn Profile: http://tinyurl.com/y8kpxu

On Monday, July 8, 2013 at 4:22 PM, Brad Knowles wrote:

On Jul 8, 2013, at 1:34 PM, Brad Knowles <brad@shub-internet.org (mailto:brad@shub-internet.org)> wrote:

I've seen CHEF-3432 and CHEF-3985, but I'm still seeing memory leak issues with chef-client 10.18.2, even though it's not running in daemon mode. The VM is configured for 2GB of "RAM" and 8GB of swap, and I'm now seeing chef-client reliably running up to ~9GB VSZ and 1.7GB RSS. This just started today, as I've been debugging and resolving various other issues in the cookbooks and recipes that we're using for this system.

Okay, so I think I may have found a bug in chef-client. If you change the name returned by hostname to something else, I'm guessing that chef-client will flake out in precisely this manner because the node name no longer matches the hostname.

When we changed the hostname back to match the node name, this memory leak behaviour disappeared.

Should I update CHEF-3432 or CHEF-3985 with this new piece of information?

--
Brad Knowles <brad@shub-internet.org (mailto:brad@shub-internet.org)>
LinkedIn Profile: http://tinyurl.com/y8kpxu

CHEF-3432 is fixed, and I believe based on some internal debug work that CHEF-3985 has a different root cause than what you're describing, so please put this information in a new ticket. Can you also confirm that this has nothing to do with any cookbooks you're using by running with an empty run list or a cookbook that just sleeps for a while?

--
Daniel DeLeo

On Jul 8, 2013, at 5:31 PM, Daniel DeLeo dan@kallistec.com wrote:

CHEF-3432 is fixed, and I believe based on some internal debug work that CHEF-3985 has a different root cause than what you're describing, so please put this information in a new ticket. Can you also confirm that this has nothing to do with any cookbooks you're using by running with an empty run list or a cookbook that just sleeps for a while?

I tried running with an empty run list, and chef-client only ran for a few seconds -- not long enough to show up under "top" that I had running in another terminal window.

I'll try using a cookbook that just sleeps for a while.

--
Brad Knowles brad@shub-internet.org
LinkedIn Profile: http://tinyurl.com/y8kpxu

On Monday, July 8, 2013 at 5:17 PM, Brad Knowles wrote:

On Jul 8, 2013, at 5:31 PM, Daniel DeLeo <dan@kallistec.com (mailto:dan@kallistec.com)> wrote:

CHEF-3432 is fixed, and I believe based on some internal debug work that CHEF-3985 has a different root cause than what you're describing, so please put this information in a new ticket. Can you also confirm that this has nothing to do with any cookbooks you're using by running with an empty run list or a cookbook that just sleeps for a while?

I tried running with an empty run list, and chef-client only ran for a few seconds -- not long enough to show up under "top" that I had running in another terminal window.

I'll try using a cookbook that just sleeps for a while.

--
Brad Knowles <brad@shub-internet.org (mailto:brad@shub-internet.org)>
LinkedIn Profile: http://tinyurl.com/y8kpxu

Thanks!

The reason I ask is that we run chef on all of our boxes with generated node ids and we haven't seen this issue. Could still be a bug in Chef, but I'm pretty sure there must be an additional factor besides just the node name.

--
Daniel DeLeo

On Jul 8, 2013, at 6:23 PM, Daniel DeLeo dan@kallistec.com wrote:

The reason I ask is that we run chef on all of our boxes with generated node ids and we haven't seen this issue. Could still be a bug in Chef, but I'm pretty sure there must be an additional factor besides just the node name.

If we're thinking about opening a ticket on this issue, then it makes sense to gather enough information so that we can recreate the condition. If we can't recreate it, then we can't reasonably expect someone to find a piece of code that contributes to the problem.

--
Brad Knowles brad@shub-internet.org
LinkedIn Profile: http://tinyurl.com/y8kpxu