How to debug poor performance problems on Chef server?


#1

I have a Chef 10.12 server with about 100 client nodes & recently
started having performance problems with Chef.

The symptoms are that the chef-server process is constantly pegging
the CPU. I’m running the server on an Amazon m1.xlarge; this issue was
killing our m1.medium.

The Chef server logs are not that useful at figuring out what requests
are being issued from what clients, and what they’re doing. All I can
see that’s anomalous are constant node updates like this:

merb : chef-server (api) : worker (port 4000) ~ Started request
handling: Fri Aug 24 17:25:21 -0400 2012
merb : chef-server (api) : worker (port 4000) ~ Params:
{“inflated_object”=>#<Chef::Node:0x2b7a6331f778
@run_list=#<Chef::RunList:0x2b7a6331e288
@run_list_items=[#<Chef::RunList::RunListItem:0x2b7a632c6e70
@version=nil, @name=“Mgmt”, @type=:role>]>, …
@couchdb=#<Chef::CouchDB:0x2b7a6331df40 @db=“chef”,
@rest=#<Chef::REST:0x2b7a6331ddb0 @redirects_followed=0,
@auth_credentials=#<Chef::REST::AuthCredentials:0x2b7a6331dcc0
@client_name=nil, @key_file=nil>, @cookies={}, @sign_request=true,
@default_headers={}, @disable_gzip=false,
@url=“http://localhost:5984”, @sign_on_redirect=true,
@redirect_limit=10>>>, “format”=>nil, “action”=>“update”,
“id”=>“real-host-name-elided”, “controller”=>“nodes”}

even though the chef-client on the node mentioned,
“real-host-name-elided”, is running far less frequently than I see
these messages.

How can I start debugging the issue, short of putting a tcpdump on
port 4000? How come merb doesn’t show what the requesting client is?

  • Julian

#2

Most common issues i’ve seen that cause this behavior are:

  1. Node object size ( if your using windows or some other ohai plugins
    that might populate the object with a lot of data). The requests can
    bog down the server.

  2. Anti-pattern usage of node.save throughout cookbooks (which causes
    node to be committed back to server many times in a run)

  3. Any combo of those and a tight run-interval. Or synchronized run
    interval with no splay.

  4. Issues around couchdb ( but this doesn’t seem like that from the
    tiny info you provided).

You can run your chef-server with -N from the command line to see more
about requests to your server.

Jesse Nelson

On Sat, Aug 25, 2012 at 10:17 AM, Julian C. Dunn lists@aquezada.com wrote:

I have a Chef 10.12 server with about 100 client nodes & recently
started having performance problems with Chef.

The symptoms are that the chef-server process is constantly pegging
the CPU. I’m running the server on an Amazon m1.xlarge; this issue was
killing our m1.medium.

The Chef server logs are not that useful at figuring out what requests
are being issued from what clients, and what they’re doing. All I can
see that’s anomalous are constant node updates like this:

merb : chef-server (api) : worker (port 4000) ~ Started request
handling: Fri Aug 24 17:25:21 -0400 2012
merb : chef-server (api) : worker (port 4000) ~ Params:
{“inflated_object”=>#<Chef::Node:0x2b7a6331f778
@run_list=#<Chef::RunList:0x2b7a6331e288
@run_list_items=[#<Chef::RunList::RunListItem:0x2b7a632c6e70
@version=nil, @name=“Mgmt”, @type=:role>]>, …
@couchdb=#<Chef::CouchDB:0x2b7a6331df40 @db=“chef”,
@rest=#<Chef::REST:0x2b7a6331ddb0 @redirects_followed=0,
@auth_credentials=#<Chef::REST::AuthCredentials:0x2b7a6331dcc0
@client_name=nil, @key_file=nil>, @cookies={}, @sign_request=true,
@default_headers={}, @disable_gzip=false,
@url=“http://localhost:5984”, @sign_on_redirect=true,
@redirect_limit=10>>>, “format”=>nil, “action”=>“update”,
“id”=>“real-host-name-elided”, “controller”=>“nodes”}

even though the chef-client on the node mentioned,
“real-host-name-elided”, is running far less frequently than I see
these messages.

How can I start debugging the issue, short of putting a tcpdump on
port 4000? How come merb doesn’t show what the requesting client is?

  • Julian

#3

On Sat, Aug 25, 2012 at 1:25 AM, Jesse Nelson spheromak@gmail.com wrote:

Most common issues i’ve seen that cause this behavior are:

  1. Node object size ( if your using windows or some other ohai plugins
    that might populate the object with a lot of data). The requests can
    bog down the server.

  2. Anti-pattern usage of node.save throughout cookbooks (which causes
    node to be committed back to server many times in a run)

  3. Any combo of those and a tight run-interval. Or synchronized run
    interval with no splay.

  4. Issues around couchdb ( but this doesn’t seem like that from the
    tiny info you provided).

You can run your chef-server with -N from the command line to see more
about requests to your server.

Thanks for the info about “-N” and all the other suggestions.

I think I figured it out. We were using the lastrun report handler on
all nodes and for some reason it gets stuck in a state where it
hammers the Chef API, repeatedly updating the node information. Not
sure why it would do this, but the clue was that report handlers were
taking upwards of 15 minutes to run.

Once I disabled lastrun & restarted all my chef-clients, things
returned to normal. Might have something to do with the fact that I’m
on CentOS 5 and had to downgrade to Ruby 1.8.7 (long story)

  • Julian