Chef-server 11.0.12 tuning guide?


#1

I know 11.1.0 is right around the corner but I need something sooner… Running open source chef-server 11.0.12 on CentOS-6.

We’re seeing a bunch of nginx timeouts accessing data bags. For example:

2014/05/08 23:22:33 [error] 7637#0: *42766 upstream timed out (110: Connection timed out) while connecting to upstream, client: xx.xx.xx.xx, server: yyy.zz.com, request: “GET /data/bag/item HTTP/1.1”, upstream: “http://127.0.0.1:8000/data/bag/item”, host: “chef.zz.com:4000

If I’m understanding this correctly, nginx cannot create a connection to erchef.

I’ve found very little on tuning chef-server. There is erchef['ibrowse_max_sessions’] but that would be for outbound connections, i.e erchef->solr. Is there a parameter for the number of incoming connections to erchef?

I have 1500 clients with a 15 minute splay. So roughly 100 servers/minute with an average end-to-end chef-client run time of 43 seconds.

The same server running chef-10 with 10 merbs was able to keep up without issue. 11.0.8 was an improvement but it seems like 11.0.12 has regressed.

On this server we are not running into the depsolver issue.

Any help would be greatly appreciated.

Thanks.

Joe

#2

Hi Joe,

Joe Nuspl nuspl@nvwls.com writes:

I’ve found very little on tuning chef-server. There is
erchef['ibrowse_max_sessions’] but that would be for outbound
connections, i.e erchef->solr. Is there a parameter for the number
of incoming connections to erchef?

One part of the config that limits the number of concurrent connections
an erchef server is able to handle is the db connection pool
size. Without seeing more details on logs and such, my first guess is
that the erchef server is exhausting its db connection pool.

You can tune the db connection pool and the max connections allowed by
postgres.

erchef['db_pool_size'] = 100
postgresql['max_connections'] = 200
  • seth


Seth Falcon | Engineering Lead - Continuous Delivery | @sfalcon
CHEF | http://www.getchef.com/


#3

I bumped those up and things have improved.

Could someone explain:

erchef[‘max_cache_size’] Default value: 10000.

Is this max number of objects? Total bytes cached? Max size of any single object in the cache?

Joe

On May 10, 2014, at 3:54 PM, Seth Falcon seth@getchef.com wrote:

Hi Joe,

Joe Nuspl nuspl@nvwls.com writes:

I’ve found very little on tuning chef-server. There is
erchef['ibrowse_max_sessions’] but that would be for outbound
connections, i.e erchef->solr. Is there a parameter for the number
of incoming connections to erchef?

One part of the config that limits the number of concurrent connections
an erchef server is able to handle is the db connection pool
size. Without seeing more details on logs and such, my first guess is
that the erchef server is exhausting its db connection pool.

You can tune the db connection pool and the max connections allowed by
postgres.

erchef[‘db_pool_size’] = 100
postgresql[‘max_connections’] = 200

  • seth


Seth Falcon | Engineering Lead - Continuous Delivery | @sfalcon
CHEF | http://www.getchef.com/


#4

Hey Joe,

Based on the load your describing I wouldn’t expect the Chef server to
be having difficulty, especially if the 10.x version, which was much
more inefficient, handled it. It’s hard for me to tell from what you
posted what the issue might be. It sounds like the server works
sometimes, but fails other times under load? If you check the erchef
logs, do they provide any more info? It’s also possible you could be
hitting something like a postgres connection limit, so I’d suggest
checking the postgres logs as well.

As far as docs on tuning, I don’t believe we have anything specific to
open source. We lay out many of the options you can tweak here:
http://docs.opscode.com/config_rb_chef_server.html

Note there is a link at the bottom of that page to even more options.

Enterprise Chef has a tuning guide that might be of some help:
http://docs.opscode.com/server_tuning.html

While open source and enterprise chef share the same core, it’s not a
one for one equivalence between options, so you might need to do some
inference to determine what applies and what doesn’t. Also note that
enterprise is typically run in a tiered and HA setup, whereas open
source is typically run on a single host (which I infer is what you’re
doing, based on the localhost url for erchef).

If that doesn’t help, reply back with any questions you have and we’ll
get it sorted out.

Mark Mzyk

Joe Nuspl mailto:nuspl@nvwls.com
May 8, 2014 at 10:18 PM
I know 11.1.0 is right around the corner but I need something sooner…
Running open source chef-server 11.0.12 on CentOS-6.

We’re seeing a bunch of nginx timeouts accessing data bags. For example:

If I’m understanding this correctly, nginx cannot create a connection
to erchef.

I’ve found very little on tuning chef-server. There is
erchef['ibrowse_max_sessions’] but that would be for outbound
connections, i.e erchef->solr. Is there a parameter for the number
of incoming connections to erchef?

I have 1500 clients with a 15 minute splay. So roughly 100
servers/minute with an average end-to-end chef-client run time of 43
seconds.

The same server running chef-10 with 10 merbs was able to keep up
without issue. 11.0.8 was an improvement but it seems like 11.0.12
has regressed.

On this server we are not running into the depsolver issue.

Any help would be greatly appreciated.

Thanks.

Joe


#5

My interpretation of

upstream timed out (110: Connection timed out) while connecting to upstream
upstream: "http://127.0.0.1:8000/data/bag/item”

Is that nginx tries to do a connect() but erchef does not do an accept() in reasonable amount of time. Is this correct?

Joe

On May 8, 2014, at 7:47 PM, Mark Mzyk mmzyk@getchef.com wrote:

Hey Joe,

Based on the load your describing I wouldn’t expect the Chef server to be having difficulty, especially if the 10.x version, which was much more inefficient, handled it. It’s hard for me to tell from what you posted what the issue might be. It sounds like the server works sometimes, but fails other times under load? If you check the erchef logs, do they provide any more info? It’s also possible you could be hitting something like a postgres connection limit, so I’d suggest checking the postgres logs as well.

As far as docs on tuning, I don’t believe we have anything specific to open source. We lay out many of the options you can tweak here: http://docs.opscode.com/config_rb_chef_server.html

Note there is a link at the bottom of that page to even more options.

Enterprise Chef has a tuning guide that might be of some help: http://docs.opscode.com/server_tuning.html

While open source and enterprise chef share the same core, it’s not a one for one equivalence between options, so you might need to do some inference to determine what applies and what doesn’t. Also note that enterprise is typically run in a tiered and HA setup, whereas open source is typically run on a single host (which I infer is what you’re doing, based on the localhost url for erchef).

If that doesn’t help, reply back with any questions you have and we’ll get it sorted out.

Mark Mzyk

Joe Nuspl May 8, 2014 at 10:18 PM
I know 11.1.0 is right around the corner but I need something sooner… Running open source chef-server 11.0.12 on CentOS-6.

We’re seeing a bunch of nginx timeouts accessing data bags. For example:

If I’m understanding this correctly, nginx cannot create a connection to erchef.

I’ve found very little on tuning chef-server. There is erchef['ibrowse_max_sessions’] but that would be for outbound connections, i.e erchef->solr. Is there a parameter for the number of incoming connections to erchef?

I have 1500 clients with a 15 minute splay. So roughly 100 servers/minute with an average end-to-end chef-client run time of 43 seconds.

The same server running chef-10 with 10 merbs was able to keep up without issue. 11.0.8 was an improvement but it seems like 11.0.12 has regressed.

On this server we are not running into the depsolver issue.

Any help would be greatly appreciated.

Thanks.

Joe


#6

Hi Joe,

Do you have any error logs from erchef? If not error logs, do you have
request logs that show the response time of this request? The error code
110 from nginx doesn’t always mean that the request timed out during the
connection phase. The request may have failed the read timeout instead.

You could also try configuring chef-shell to point directly at erchef (port
8000) instead of going through nginx. Is this an abnormally large databag
item?

On Fri, May 9, 2014 at 9:40 AM, Joe Nuspl nuspl@nvwls.com wrote:

My interpretation of

upstream timed out (110: Connection timed out) while connecting to upstream

upstream: "http://127.0.0.1:8000/data/bag/item”

Is that nginx tries to do a connect() but erchef does not do an accept()
in reasonable amount of time. Is this correct?

Joe

On May 8, 2014, at 7:47 PM, Mark Mzyk mmzyk@getchef.com wrote:

Hey Joe,

Based on the load your describing I wouldn’t expect the Chef server to be
having difficulty, especially if the 10.x version, which was much more
inefficient, handled it. It’s hard for me to tell from what you posted what
the issue might be. It sounds like the server works sometimes, but fails
other times under load? If you check the erchef logs, do they provide any
more info? It’s also possible you could be hitting something like a
postgres connection limit, so I’d suggest checking the postgres logs as
well.

As far as docs on tuning, I don’t believe we have anything specific to
open source. We lay out many of the options you can tweak here:
http://docs.opscode.com/config_rb_chef_server.html

Note there is a link at the bottom of that page to even more options.

Enterprise Chef has a tuning guide that might be of some help:
http://docs.opscode.com/server_tuning.html

While open source and enterprise chef share the same core, it’s not a one
for one equivalence between options, so you might need to do some inference
to determine what applies and what doesn’t. Also note that enterprise is
typically run in a tiered and HA setup, whereas open source is typically
run on a single host (which I infer is what you’re doing, based on the
localhost url for erchef).

If that doesn’t help, reply back with any questions you have and we’ll get
it sorted out.

Mark Mzyk

Joe Nuspl nuspl@nvwls.com
May 8, 2014 at 10:18 PM
I know 11.1.0 is right around the corner but I need something sooner…
Running open source chef-server 11.0.12 on CentOS-6.

We’re seeing a bunch of nginx timeouts accessing data bags. For example:

If I’m understanding this correctly, nginx cannot create a connection to
erchef.

I’ve found very little on tuning chef-server. There is
erchef['ibrowse_max_sessions’] but that would be for outbound connections,
i.e erchef->solr. Is there a parameter for the number of incoming
connections to erchef?

I have 1500 clients with a 15 minute splay. So roughly 100 servers/minute
with an average end-to-end chef-client run time of 43 seconds.

The same server running chef-10 with 10 merbs was able to keep up without
issue. 11.0.8 was an improvement but it seems like 11.0.12 has regressed.

On this server we are not running into the depsolver issue.

Any help would be greatly appreciated.

Thanks.

Joe


Stephen Delano
Software Development Engineer
Opscode, Inc.
1008 Western Avenue
Suite 601
Seattle, WA 98104