How to debug poor performance problems on Chef server?

Julian_C_Dunn1 · August 25, 2012, 1:17am

I have a Chef 10.12 server with about 100 client nodes & recently
started having performance problems with Chef.

The symptoms are that the chef-server process is constantly pegging
the CPU. I’m running the server on an Amazon m1.xlarge; this issue was
killing our m1.medium.

The Chef server logs are not that useful at figuring out what requests
are being issued from what clients, and what they’re doing. All I can
see that’s anomalous are constant node updates like this:

merb : chef-server (api) : worker (port 4000) ~ Started request
handling: Fri Aug 24 17:25:21 -0400 2012
merb : chef-server (api) : worker (port 4000) ~ Params:
{“inflated_object”=>#<Chef::Node:0x2b7a6331f778
@run_list=#<Chef::RunList:0x2b7a6331e288
@run_list_items=[#<Chef::RunList::RunListItem:0x2b7a632c6e70
@version=nil, @name=“Mgmt”, @type=:role>]>, …
@couchdb=#<Chef::CouchDB:0x2b7a6331df40 @db=“chef”,
@rest=#<Chef::REST:0x2b7a6331ddb0 @redirects_followed=0,
@auth_credentials=#<Chef::REST::AuthCredentials:0x2b7a6331dcc0
@client_name=nil, @key_file=nil>, @cookies={}, @sign_request=true,
@default_headers={}, @disable_gzip=false,
@url=“http://localhost:5984”, @sign_on_redirect=true,
@redirect_limit=10>>>, “format”=>nil, “action”=>“update”,
“id”=>“real-host-name-elided”, “controller”=>“nodes”}

even though the chef-client on the node mentioned,
“real-host-name-elided”, is running far less frequently than I see
these messages.

How can I start debugging the issue, short of putting a tcpdump on
port 4000? How come merb doesn’t show what the requesting client is?

Julian

Jesse_Nelson · August 25, 2012, 5:25am

Most common issues i've seen that cause this behavior are:

Node object size ( if your using windows or some other ohai plugins
that might populate the object with a lot of data). The requests can
bog down the server.
Anti-pattern usage of node.save throughout cookbooks (which causes
node to be committed back to server many times in a run)
Any combo of those and a tight run-interval. Or synchronized run
interval with no splay.
Issues around couchdb ( but this doesn't seem like that from the
tiny info you provided).

You can run your chef-server with -N from the command line to see more
about requests to your server.

Jesse Nelson

On Sat, Aug 25, 2012 at 10:17 AM, Julian C. Dunn lists@aquezada.com wrote:

I have a Chef 10.12 server with about 100 client nodes & recently
started having performance problems with Chef.

The symptoms are that the chef-server process is constantly pegging
the CPU. I'm running the server on an Amazon m1.xlarge; this issue was
killing our m1.medium.

The Chef server logs are not that useful at figuring out what requests
are being issued from what clients, and what they're doing. All I can
see that's anomalous are constant node updates like this:

merb : chef-server (api) : worker (port 4000) ~ Started request
handling: Fri Aug 24 17:25:21 -0400 2012
merb : chef-server (api) : worker (port 4000) ~ Params:
{"inflated_object"=>#<Chef::Node:0x2b7a6331f778
@run_list=#<Chef::RunList:0x2b7a6331e288
@run_list_items=[#<Chef::RunList::RunListItem:0x2b7a632c6e70
@version=nil, @name="Mgmt", @type=:role>]>, ....
@couchdb=#<Chef::CouchDB:0x2b7a6331df40 @db="chef",
@rest=#<Chef::REST:0x2b7a6331ddb0 @redirects_followed=0,
@auth_credentials=#<Chef::REST::AuthCredentials:0x2b7a6331dcc0
@client_name=nil, @key_file=nil>, @cookies={}, @sign_request=true,
@default_headers={}, @disable_gzip=false,
@url="http://localhost:5984", @sign_on_redirect=true,
@redirect_limit=10>>>, "format"=>nil, "action"=>"update",
"id"=>"real-host-name-elided", "controller"=>"nodes"}

even though the chef-client on the node mentioned,
"real-host-name-elided", is running far less frequently than I see
these messages.

How can I start debugging the issue, short of putting a tcpdump on
port 4000? How come merb doesn't show what the requesting client is?

Julian

Julian_C_Dunn1 · August 25, 2012, 5:44am

On Sat, Aug 25, 2012 at 1:25 AM, Jesse Nelson spheromak@gmail.com wrote:

Most common issues i've seen that cause this behavior are:

Node object size ( if your using windows or some other ohai plugins
that might populate the object with a lot of data). The requests can
bog down the server.

Anti-pattern usage of node.save throughout cookbooks (which causes
node to be committed back to server many times in a run)

Any combo of those and a tight run-interval. Or synchronized run
interval with no splay.

Issues around couchdb ( but this doesn't seem like that from the
tiny info you provided).

You can run your chef-server with -N from the command line to see more
about requests to your server.

Thanks for the info about "-N" and all the other suggestions.

I think I figured it out. We were using the lastrun report handler on
all nodes and for some reason it gets stuck in a state where it
hammers the Chef API, repeatedly updating the node information. Not
sure why it would do this, but the clue was that report handlers were
taking upwards of 15 minutes to run.

Once I disabled lastrun & restarted all my chef-clients, things
returned to normal. Might have something to do with the fact that I'm
on CentOS 5 and had to downgrade to Ruby 1.8.7 (long story)

Julian

Topic		Replies	Views
Debugging memory leak issues with chef-client? Chef Infra (archive)	6	803	July 9, 2013
Chef setup has become unstable Chef Infra (archive)	13	449	June 18, 2012
Chef stability? Chef Infra (archive)	12	429	November 18, 2010
Chef server webui slow responsiveness Chef Infra (archive)	0	254	May 15, 2012
Chef client conflicts Chef Infra (archive)	3	1177	February 11, 2016

How to debug poor performance problems on Chef server?

Related Topics