Performance Tuning for Open Source Chef Server 11 to serve hundreds of nodes


#1

Hi Chefs,

I’m using Chef to bootstrap a 300 nodes cluster and a few chef-clients
throws the error “503 Service Unavailable” when the 300 nodes started
syncing cookbooks at the same time. The chef server log shows erchef
crashed. Most of the time, syncing cookbooks on all the nodes again
after the 1st failure will succeed.

There is a ticket Chef -4423 Chef Synchronization of Cookbook Slow
https://tickets.opscode.com/browse/CHEF-4423 to speed up syncing
cookbooks. I applied the new chef-client which contains the fix.
However, the fix doesn’t work in my case.

Is there any tips to tune Open Source Chef Server 11 to serve hundreds
of nodes ? Thanks a lot.

Here is my chef configuration:

Chef Server

2 cores + 8G Mem

$ cat /etc/issue
CentOS release 5.10 (Final)

$ rpm -q chef-server
chef-server-11.1.1-1.el5

$ cat /etc/chef-server/chef-server.rb
chef_server_webui[‘enable’] = false
nginx[‘ssl_port’] = 9443
nginx[‘non_ssl_port’] = 9080
chef_solr[‘heap_size’] = 3072
chef_solr[‘commit_interval’] = 3000
chef_solr[‘poll_seconds’] = 6
chef_expander[‘nodes’] = 4
erchef[‘s3_parallel_ops_timeout’] = 30000
erchef[‘s3_url_ttl’] = 3600
erchef[‘db_pool_size’] = 150
postgresql[‘max_connections’] = 200
rabbitmq[‘node_port’] = 5672

Chef Client

$ cat /etc/issue
CentOS release 6.4 (Final)

$ rpm -q chef
chef-11.14.0.rc.2-1.el6.x86_64

$ cat /etc/chef/client.rb
log_location STDOUT
chef_server_url "https://:9443"
validation_client_name "chef-validator"
node_name “bigtop-master-0”

no_lazy_load true
ssl_verify_mode :verify_peer
ssl_ca_path “/etc/chef/trusted_certs”

chef-server-ctl tail ==

full output of “chef-server-ctl tail” is attached.

==> /var/log/chef-server/erchef/crash.log.1 <==
** Last event was:
{log,{lager_msg,[],[{application,chef_wm},{module,chef_wm_status},{function,log_failure},{line,77},{pid,"<0.23096.41>"},{node,‘erchef@127.0.0.1’}],error,{[“2014”,45,“07”,45,“20”],[“22”,58,“40”,58,“02”,46,“080”]},{1405,896002,80894},[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<“chef_solr”>>,<<“fail”>>},{<<“chef_sql”>>,<<“pong”>>}]}}}"]}}
** When handler state ==
{state,{mask,127},lager_default_formatter,["[",severity,"]
",message,"\n"],[]}
** Reason == {badarg,[{io,put_chars,[user,unicode,["[",“error”,"]
",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<“chef_solr”>>,<<“fail”>>},{<<“chef_sql”>>,<<“pong”>>}]}}}"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,“src/lager_console_backend.erl”},{line,95}]},{gen_event,server_update,4,[{file,“gen_event.erl”},{line,504}]},{gen_event,server_notify,4,[{file,“gen_event.erl”},{line,486}]},{gen_event,handle_msg,5,[{file,“gen_event.erl”},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,227}]}]}
2014-07-20 22:40:02 =ERROR REPORT====
{<<“method=GET; path=/_status; status=500; “>>,“Internal Server
Error”}2014-07-20 22:40:02 =ERROR REPORT====
** gen_event handler lager_console_backend crashed.
** Was installed in lager_event
** Last event was:
{log,{lager_msg,[],[{pid,<0.101.0>}],error,{[“2014”,45,“07”,45,“20”],[“22”,58,“40”,58,“02”,46,“085”]},{1405,896002,85731},[103,101,110,95,101,118,101,110,116,32,“lager_console_backend”,32,105,110,115,116,97,108,108,101,100,32,105,110,32,“lager_event”,32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,“bad
argument in call to io:put_chars(user, unicode, [”[”,“error”,”]
",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\“chef_solr\”>…"],…])
in lager_console_backend:handle_event/2 line 95"]}}
** When handler state ==
{state,{mask,127},lager_default_formatter,["[",severity,"]
",message,"\n"],[]}
** Reason == {badarg,[{io,put_chars,[user,unicode,["[",“error”,"]
",[103,101,110,95,101,118,101,110,116,32,“lager_console_backend”,32,105,110,115,116,97,108,108,101,100,32,105,110,32,“lager_event”,32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,“bad
argument in call to io:put_chars(user, unicode, [”[",“error”,"]
",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\“chef_solr\”>…"],…])
in lager_console_backend:handle_event/2 line
95"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,“src/lager_console_backend.erl”},{line,95}]},{gen_event,server_update,4,[{file,“gen_event.erl”},{line,504}]},{gen_event,server_notify,4,[{file,“gen_event.erl”},{line,486}]},{gen_event,handle_msg,5,[{file,“gen_event.erl”},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,227}]}]}

==> /var/log/chef-server/erchef/erchef.log <==
2014-07-22 04:15:48.514 [error] Supervisor stats_hero_worker_sup had
child stats_hero started with {stats_hero,start_link,undefined} at
<0.30506.3> exit with reason
{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“358”,"\n",[[[“chefAPI”,".application.byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.allRequests"],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".application.byRequestType.",“roles”,".",“GET”],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequests.",“rdbms”],58,“1834”,124,“h”,"\n"],[[“chefAPI”,".upstreamReques…",…],…],…]]}]}}
in context child_terminated
2014-07-22 04:15:48.600 [error] Supervisor stats_hero_worker_sup had
child stats_hero started with {stats_hero,start_link,undefined} at
<0.30501.3> exit with reason
{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“356”,"\n",[[[“chefAPI”,".application.byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.allRequests"],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".application.byRequestType.",“data”,".",“GET”],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequests.",“rdbms”],58,“1987”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequest…",…],…],…]]}]}}
in context child_terminated
2014-07-22 04:15:48.613 [error] Supervisor stats_hero_worker_sup had
child stats_hero started with {stats_hero,start_link,undefined} at
<0.30527.3> exit with reason
{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“110”,"\n",[[[“chefAPI”,".application.allRequests"],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.byRequestType.",“data”,".",“GET”],58,“1”,124,“m”,"\n"]]]}]}}
in context child_terminated
2014-07-22 04:16:12.850 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:16:13.010 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:16:13.024 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:05.852 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:13.356 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:13.363 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:13.433 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}

Thanks
Jesse Hu


#2

You have 300/concurrent connections - try tuning pool size and max db
connections larger than that.
On Jul 28, 2014 8:57 PM, “Jesse Hu” huhui14@gmail.com wrote:

Hi Chefs,

I’m using Chef to bootstrap a 300 nodes cluster and a few chef-clients
throws the error “503 Service Unavailable” when the 300 nodes started
syncing cookbooks at the same time. The chef server log shows erchef
crashed. Most of the time, syncing cookbooks on all the nodes again after
the 1st failure will succeed.

There is a ticket Chef -4423 Chef Synchronization of Cookbook Slow
https://tickets.opscode.com/browse/CHEF-4423 to speed up syncing
cookbooks. I applied the new chef-client which contains the fix. However,
the fix doesn’t work in my case.

Is there any tips to tune Open Source Chef Server 11 to serve hundreds of
nodes ? Thanks a lot.

Here is my chef configuration:

Chef Server

2 cores + 8G Mem

$ cat /etc/issue
CentOS release 5.10 (Final)

$ rpm -q chef-server
chef-server-11.1.1-1.el5

$ cat /etc/chef-server/chef-server.rb
chef_server_webui[‘enable’] = false
nginx[‘ssl_port’] = 9443
nginx[‘non_ssl_port’] = 9080
chef_solr[‘heap_size’] = 3072
chef_solr[‘commit_interval’] = 3000
chef_solr[‘poll_seconds’] = 6
chef_expander[‘nodes’] = 4
erchef[‘s3_parallel_ops_timeout’] = 30000
erchef[‘s3_url_ttl’] = 3600
erchef[‘db_pool_size’] = 150
postgresql[‘max_connections’] = 200
rabbitmq[‘node_port’] = 5672

Chef Client

$ cat /etc/issue
CentOS release 6.4 (Final)

$ rpm -q chef
chef-11.14.0.rc.2-1.el6.x86_64

$ cat /etc/chef/client.rb
log_location STDOUT
chef_server_url "https://:9443"
validation_client_name "chef-validator"
node_name “bigtop-master-0”

no_lazy_load true
ssl_verify_mode :verify_peer
ssl_ca_path “/etc/chef/trusted_certs”

chef-server-ctl tail ==

full output of “chef-server-ctl tail” is attached.

==> /var/log/chef-server/erchef/crash.log.1 <==
** Last event was:
{log,{lager_msg,[],[{application,chef_wm},{module,chef_wm_status},{function,log_failure},{line,77},{pid,"<0.23096.41>
"},{node,‘erchef@127.0.0.1’}],error,{["
<%7D,%7Bnode,‘erchef@127.0.0.1’%7D],error,%7B[>
2014",45,“07”,45,“20”],[“22”,58,“40”,58,“02”,46,“080”]},{1405,896002,80894},[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<“chef_solr”>>,<<“fail”>>},{<<“chef_sql”>>,<<“pong”>>}]}}}"]}}
** When handler state ==
{state,{mask,127},lager_default_formatter,["[",severity,"]
",message,"\n"],[]}
** Reason == {badarg,[{io,put_chars,[user,unicode,["[",“error”,"]
",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<“chef_solr”>>,<<“fail”>>},{<<“chef_sql”>>,<<“pong”>>}]}}}"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,“src/lager_console_backend.erl”},{line,95}]},{gen_event,server_update,4,[{file,“gen_event.erl”},{line,504}]},{gen_event,server_notify,4,[{file,“gen_event.erl”},{line,486}]},{gen_event,handle_msg,5,[{file,“gen_event.erl”},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,227}]}]}
2014-07-20 22:40:02 =ERROR REPORT====
{<<“method=GET; path=/_status; status=500; “>>,“Internal Server
Error”}2014-07-20 22:40:02 =ERROR REPORT====
** gen_event handler lager_console_backend crashed.
** Was installed in lager_event
** Last event was:
{log,{lager_msg,[],[{pid,<0.101.0>}],error,{[“2014”,45,“07”,45,“20”],[“22”,58,“40”,58,“02”,46,“085”]},{1405,896002,85731},[103,101,110,95,101,118,101,110,116,32,“lager_console_backend”,32,105,110,115,116,97,108,108,101,100,32,105,110,32,“lager_event”,32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,“bad
argument in call to io:put_chars(user, unicode, [”[”,“error”,”]
",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\“chef_solr\”>…"],…])
in lager_console_backend:handle_event/2 line 95"]}}
** When handler state ==
{state,{mask,127},lager_default_formatter,["[",severity,"]
",message,"\n"],[]}
** Reason == {badarg,[{io,put_chars,[user,unicode,["[",“error”,"]
",[103,101,110,95,101,118,101,110,116,32,“lager_console_backend”,32,105,110,115,116,97,108,108,101,100,32,105,110,32,“lager_event”,32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,“bad
argument in call to io:put_chars(user, unicode, [”[",“error”,"]
",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\“chef_solr\”>…"],…])
in lager_console_backend:handle_event/2 line
95"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,“src/lager_console_backend.erl”},{line,95}]},{gen_event,server_update,4,[{file,“gen_event.erl”},{line,504}]},{gen_event,server_notify,4,[{file,“gen_event.erl”},{line,486}]},{gen_event,handle_msg,5,[{file,“gen_event.erl”},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,227}]}]}

==> /var/log/chef-server/erchef/erchef.log <==
2014-07-22 04:15:48.514 [error] Supervisor stats_hero_worker_sup had child
stats_hero started with {stats_hero,start_link,undefined} at <0.30506.3>
exit with reason
{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“358”,"\n",[[[“chefAPI”,".application.byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.allRequests"],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".application.byRequestType.",“roles”,".",“GET”],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequests.",“rdbms”],58,“1834”,124,“h”,"\n"],[[“chefAPI”,".upstreamReques…",…],…],…]]}]}}
in context child_terminated
2014-07-22 04:15:48.600 [error] Supervisor stats_hero_worker_sup had child
stats_hero started with {stats_hero,start_link,undefined} at <0.30501.3>
exit with reason
{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“356”,"\n",[[[“chefAPI”,".application.byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.allRequests"],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".application.byRequestType.",“data”,".",“GET”],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequests.",“rdbms”],58,“1987”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequest…",…],…],…]]}]}}
in context child_terminated
2014-07-22 04:15:48.613 [error] Supervisor stats_hero_worker_sup had child
stats_hero started with {stats_hero,start_link,undefined} at <0.30527.3>
exit with reason
{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“110”,"\n",[[[“chefAPI”,".application.allRequests"],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.byRequestType.",“data”,".",“GET”],58,“1”,124,“m”,"\n"]]]}]}}
in context child_terminated
2014-07-22 04:16:12.850 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:16:13.010 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:16:13.024 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:05.852 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:13.356 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:13.363 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:13.433 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}

Thanks
Jesse Hu


#3

Any preference for where I save these recommendations for posterity?

I probably should have some stuff about horizontal scaling/ha config. Oh, and how to set up estatsd for pretty graphs. Anything else we should put there?


http://josephholsten.com

On Jul 29, 2014, at 4:23, Adam Jacob adam@getchef.com wrote:

You have 300/concurrent connections - try tuning pool size and max db connections larger than that.

On Jul 28, 2014 8:57 PM, “Jesse Hu” huhui14@gmail.com wrote:
Hi Chefs,

I’m using Chef to bootstrap a 300 nodes cluster and a few chef-clients throws the error “503 Service Unavailable” when the 300 nodes started syncing cookbooks at the same time. The chef server log shows erchef crashed. Most of the time, syncing cookbooks on all the nodes again after the 1st failure will succeed.

There is a ticket Chef -4423 Chef Synchronization of Cookbook Slow to speed up syncing cookbooks. I applied the new chef-client which contains the fix. However, the fix doesn’t work in my case.

Is there any tips to tune Open Source Chef Server 11 to serve hundreds of nodes ? Thanks a lot.

Here is my chef configuration:

Chef Server

2 cores + 8G Mem

$ cat /etc/issue
CentOS release 5.10 (Final)

$ rpm -q chef-server
chef-server-11.1.1-1.el5

$ cat /etc/chef-server/chef-server.rb
chef_server_webui[‘enable’] = false
nginx[‘ssl_port’] = 9443
nginx[‘non_ssl_port’] = 9080
chef_solr[‘heap_size’] = 3072
chef_solr[‘commit_interval’] = 3000
chef_solr[‘poll_seconds’] = 6
chef_expander[‘nodes’] = 4
erchef[‘s3_parallel_ops_timeout’] = 30000
erchef[‘s3_url_ttl’] = 3600
erchef[‘db_pool_size’] = 150
postgresql[‘max_connections’] = 200
rabbitmq[‘node_port’] = 5672

Chef Client

$ cat /etc/issue
CentOS release 6.4 (Final)

$ rpm -q chef
chef-11.14.0.rc.2-1.el6.x86_64

$ cat /etc/chef/client.rb
log_location STDOUT
chef_server_url "https://:9443"
validation_client_name "chef-validator"
node_name “bigtop-master-0”

no_lazy_load true
ssl_verify_mode :verify_peer
ssl_ca_path “/etc/chef/trusted_certs”

chef-server-ctl tail ==

full output of “chef-server-ctl tail” is attached.

==> /var/log/chef-server/erchef/crash.log.1 <==
** Last event was: {log,{lager_msg,[],[{application,chef_wm},{module,chef_wm_status},{function,log_failure},{line,77},{pid,"<0.23096.41>"},{node,‘erchef@127.0.0.1’}],error,{[“2014”,45,“07”,45,“20”],[“22”,58,“40”,58,“02”,46,“080”]},{1405,896002,80894},[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<“chef_solr”>>,<<“fail”>>},{<<“chef_sql”>>,<<“pong”>>}]}}}"]}}
** When handler state == {state,{mask,127},lager_default_formatter,["[",severity,"] “,message,”\n"],[]}
** Reason == {badarg,[{io,put_chars,[user,unicode,["[",“error”,"] “,[47,95,115,116,97,116,117,115,”\n","{{status,fail},{upstreams,{[{<<“chef_solr”>>,<<“fail”>>},{<<“chef_sql”>>,<<“pong”>>}]}}}"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,“src/lager_console_backend.erl”},{line,95}]},{gen_event,server_update,4,[{file,“gen_event.erl”},{line,504}]},{gen_event,server_notify,4,[{file,“gen_event.erl”},{line,486}]},{gen_event,handle_msg,5,[{file,“gen_event.erl”},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,227}]}]}
2014-07-20 22:40:02 =ERROR REPORT====
{<<“method=GET; path=/_status; status=500; “>>,“Internal Server Error”}2014-07-20 22:40:02 =ERROR REPORT====
** gen_event handler lager_console_backend crashed.
** Was installed in lager_event
** Last event was: {log,{lager_msg,[],[{pid,<0.101.0>}],error,{[“2014”,45,“07”,45,“20”],[“22”,58,“40”,58,“02”,46,“085”]},{1405,896002,85731},[103,101,110,95,101,118,101,110,116,32,“lager_console_backend”,32,105,110,115,116,97,108,108,101,100,32,105,110,32,“lager_event”,32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,“bad argument in call to io:put_chars(user, unicode, [”[”,“error”,”] “,[47,95,115,116,97,116,117,115,”\n","{{status,fail},{upstreams,{[{<<\“chef_solr\”>…"],…]) in lager_console_backend:handle_event/2 line 95"]}}
** When handler state == {state,{mask,127},lager_default_formatter,["[",severity,"] “,message,”\n"],[]}
** Reason == {badarg,[{io,put_chars,[user,unicode,["[",“error”,"] “,[103,101,110,95,101,118,101,110,116,32,“lager_console_backend”,32,105,110,115,116,97,108,108,101,100,32,105,110,32,“lager_event”,32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,“bad argument in call to io:put_chars(user, unicode, [”[”,“error”,"] “,[47,95,115,116,97,116,117,115,”\n","{{status,fail},{upstreams,{[{<<\“chef_solr\”>…"],…]) in lager_console_backend:handle_event/2 line 95"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,“src/lager_console_backend.erl”},{line,95}]},{gen_event,server_update,4,[{file,“gen_event.erl”},{line,504}]},{gen_event,server_notify,4,[{file,“gen_event.erl”},{line,486}]},{gen_event,handle_msg,5,[{file,“gen_event.erl”},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,227}]}]}

==> /var/log/chef-server/erchef/erchef.log <==
2014-07-22 04:15:48.514 [error] Supervisor stats_hero_worker_sup had child stats_hero started with {stats_hero,start_link,undefined} at <0.30506.3> exit with reason {timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“358”,"\n",[[[“chefAPI”,".application.byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.allRequests"],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".application.byRequestType.",“roles”,".",“GET”],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequests.",“rdbms”],58,“1834”,124,“h”,"\n"],[[“chefAPI”,".upstreamReques…",…],…],…]]}]}} in context child_terminated
2014-07-22 04:15:48.600 [error] Supervisor stats_hero_worker_sup had child stats_hero started with {stats_hero,start_link,undefined} at <0.30501.3> exit with reason {timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“356”,"\n",[[[“chefAPI”,".application.byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.allRequests"],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".application.byRequestType.",“data”,".",“GET”],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequests.",“rdbms”],58,“1987”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequest…",…],…],…]]}]}} in context child_terminated
2014-07-22 04:15:48.613 [error] Supervisor stats_hero_worker_sup had child stats_hero started with {stats_hero,start_link,undefined} at <0.30527.3> exit with reason {timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“110”,"\n",[[[“chefAPI”,".application.allRequests"],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.byRequestType.",“data”,".",“GET”],58,“1”,124,“m”,"\n"]]]}]}} in context child_terminated
2014-07-22 04:16:12.850 [error] {<<"method=POST; path=/environments/_default/cookbook_versions; status=503; ">>,“Service Unavailable”}
2014-07-22 04:16:13.010 [error] {<<"method=POST; path=/environments/_default/cookbook_versions; status=503; ">>,“Service Unavailable”}
2014-07-22 04:16:13.024 [error] {<<"method=POST; path=/environments/_default/cookbook_versions; status=503; ">>,“Service Unavailable”}
2014-07-22 04:17:05.852 [error] {<<"method=POST; path=/environments/_default/cookbook_versions; status=503; ">>,“Service Unavailable”}
2014-07-22 04:17:13.356 [error] {<<"method=POST; path=/environments/_default/cookbook_versions; status=503; ">>,“Service Unavailable”}
2014-07-22 04:17:13.363 [error] {<<"method=POST; path=/environments/_default/cookbook_versions; status=503; ">>,“Service Unavailable”}
2014-07-22 04:17:13.433 [error] {<<"method=POST; path=/environments/_default/cookbook_versions; status=503; ">>,“Service Unavailable”}

Thanks
Jesse Hu


#4

as adam mentioned, the corresponding parameters are, erchef[‘db_pool_size’]
& postgresql[‘max_connections’] . you should also observe chef-server’s
cpu profile during the chef runs. We had similar issues, and we had to bump
up both the connection limits as well as provision a bigger chef server
instance.
even with new configs & instance, we did the stress testing to find out the
max concurrent chef client runs we could do, and then add splay time
between batches of chef run. Usually it only during the initial (cookbook
sync, searches etc) and end phase (posting node data back) when chef client
needs to talk to chef server, if you can design a way to distribute the
communication over a time frame, it will give you a predictable, fleet wide
chef run, where the chef server load or fleet wide chef run time will grow
predictably as your cluster grows.

On Mon, Jul 28, 2014 at 9:23 PM, Adam Jacob adam@getchef.com wrote:

You have 300/concurrent connections - try tuning pool size and max db
connections larger than that.
On Jul 28, 2014 8:57 PM, “Jesse Hu” huhui14@gmail.com wrote:

Hi Chefs,

I’m using Chef to bootstrap a 300 nodes cluster and a few chef-clients
throws the error “503 Service Unavailable” when the 300 nodes started
syncing cookbooks at the same time. The chef server log shows erchef
crashed. Most of the time, syncing cookbooks on all the nodes again after
the 1st failure will succeed.

There is a ticket Chef -4423 Chef Synchronization of Cookbook Slow
https://tickets.opscode.com/browse/CHEF-4423 to speed up syncing
cookbooks. I applied the new chef-client which contains the fix. However,
the fix doesn’t work in my case.

Is there any tips to tune Open Source Chef Server 11 to serve hundreds of
nodes ? Thanks a lot.

Here is my chef configuration:

Chef Server

2 cores + 8G Mem

$ cat /etc/issue
CentOS release 5.10 (Final)

$ rpm -q chef-server
chef-server-11.1.1-1.el5

$ cat /etc/chef-server/chef-server.rb
chef_server_webui[‘enable’] = false
nginx[‘ssl_port’] = 9443
nginx[‘non_ssl_port’] = 9080
chef_solr[‘heap_size’] = 3072
chef_solr[‘commit_interval’] = 3000
chef_solr[‘poll_seconds’] = 6
chef_expander[‘nodes’] = 4
erchef[‘s3_parallel_ops_timeout’] = 30000
erchef[‘s3_url_ttl’] = 3600
erchef[‘db_pool_size’] = 150
postgresql[‘max_connections’] = 200
rabbitmq[‘node_port’] = 5672

Chef Client

$ cat /etc/issue
CentOS release 6.4 (Final)

$ rpm -q chef
chef-11.14.0.rc.2-1.el6.x86_64

$ cat /etc/chef/client.rb
log_location STDOUT
chef_server_url "https://:9443"
validation_client_name "chef-validator"
node_name “bigtop-master-0”

no_lazy_load true
ssl_verify_mode :verify_peer
ssl_ca_path “/etc/chef/trusted_certs”

chef-server-ctl tail ==

full output of “chef-server-ctl tail” is attached.

==> /var/log/chef-server/erchef/crash.log.1 <==
** Last event was:
{log,{lager_msg,[],[{application,chef_wm},{module,chef_wm_status},{function,log_failure},{line,77},{pid,"<0.23096.41>
"},{node,‘erchef@127.0.0.1’}],error,{["
<%7D,%7Bnode,‘erchef@127.0.0.1’%7D],error,%7B[>
2014",45,“07”,45,“20”],[“22”,58,“40”,58,“02”,46,“080”]},{1405,896002,80894},[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<“chef_solr”>>,<<“fail”>>},{<<“chef_sql”>>,<<“pong”>>}]}}}"]}}
** When handler state ==
{state,{mask,127},lager_default_formatter,["[",severity,"]
",message,"\n"],[]}
** Reason == {badarg,[{io,put_chars,[user,unicode,["[",“error”,"]
",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<“chef_solr”>>,<<“fail”>>},{<<“chef_sql”>>,<<“pong”>>}]}}}"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,“src/lager_console_backend.erl”},{line,95}]},{gen_event,server_update,4,[{file,“gen_event.erl”},{line,504}]},{gen_event,server_notify,4,[{file,“gen_event.erl”},{line,486}]},{gen_event,handle_msg,5,[{file,“gen_event.erl”},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,227}]}]}
2014-07-20 22:40:02 =ERROR REPORT====
{<<“method=GET; path=/_status; status=500; “>>,“Internal Server
Error”}2014-07-20 22:40:02 =ERROR REPORT====
** gen_event handler lager_console_backend crashed.
** Was installed in lager_event
** Last event was:
{log,{lager_msg,[],[{pid,<0.101.0>}],error,{[“2014”,45,“07”,45,“20”],[“22”,58,“40”,58,“02”,46,“085”]},{1405,896002,85731},[103,101,110,95,101,118,101,110,116,32,“lager_console_backend”,32,105,110,115,116,97,108,108,101,100,32,105,110,32,“lager_event”,32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,“bad
argument in call to io:put_chars(user, unicode, [”[”,“error”,”]
",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\“chef_solr\”>…"],…])
in lager_console_backend:handle_event/2 line 95"]}}
** When handler state ==
{state,{mask,127},lager_default_formatter,["[",severity,"]
",message,"\n"],[]}
** Reason == {badarg,[{io,put_chars,[user,unicode,["[",“error”,"]
",[103,101,110,95,101,118,101,110,116,32,“lager_console_backend”,32,105,110,115,116,97,108,108,101,100,32,105,110,32,“lager_event”,32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,“bad
argument in call to io:put_chars(user, unicode, [”[",“error”,"]
",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\“chef_solr\”>…"],…])
in lager_console_backend:handle_event/2 line
95"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,“src/lager_console_backend.erl”},{line,95}]},{gen_event,server_update,4,[{file,“gen_event.erl”},{line,504}]},{gen_event,server_notify,4,[{file,“gen_event.erl”},{line,486}]},{gen_event,handle_msg,5,[{file,“gen_event.erl”},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,227}]}]}

==> /var/log/chef-server/erchef/erchef.log <==
2014-07-22 04:15:48.514 [error] Supervisor stats_hero_worker_sup had
child stats_hero started with {stats_hero,start_link,undefined} at
<0.30506.3> exit with reason
{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“358”,"\n",[[[“chefAPI”,".application.byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.allRequests"],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".application.byRequestType.",“roles”,".",“GET”],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequests.",“rdbms”],58,“1834”,124,“h”,"\n"],[[“chefAPI”,".upstreamReques…",…],…],…]]}]}}
in context child_terminated
2014-07-22 04:15:48.600 [error] Supervisor stats_hero_worker_sup had
child stats_hero started with {stats_hero,start_link,undefined} at
<0.30501.3> exit with reason
{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“356”,"\n",[[[“chefAPI”,".application.byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.allRequests"],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".application.byRequestType.",“data”,".",“GET”],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequests.",“rdbms”],58,“1987”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequest…",…],…],…]]}]}}
in context child_terminated
2014-07-22 04:15:48.613 [error] Supervisor stats_hero_worker_sup had
child stats_hero started with {stats_hero,start_link,undefined} at
<0.30527.3> exit with reason
{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“110”,"\n",[[[“chefAPI”,".application.allRequests"],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.byRequestType.",“data”,".",“GET”],58,“1”,124,“m”,"\n"]]]}]}}
in context child_terminated
2014-07-22 04:16:12.850 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:16:13.010 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:16:13.024 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:05.852 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:13.356 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:13.363 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:13.433 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}

Thanks
Jesse Hu


#5

Thanks Adam & Ranjib.

Is there a recommended ratio between erchef[‘db_pool_size’] and the
number of concurrent chef clients ? Do we need to tune the nginx and
erchef itself ? Is the postgresql connection the only bottle neck to
scale up Chef Server 11 ?

In my case, I have several chef node save calls and chef search API
calls during the cookbook exection, so the Chef Server will throw 503 or
EOF exception when saving the node or calling chef search. The chef
node save calls is to save current bootstrap progress and task ongoing,
so we can tell the user what’s being done on each node. The chef search
API calls is called when a node needs to find out dependent node running
the specified service. The dependent node will add a tag to indicate
it’s running the service, and other nodes search nodes by the tag name.

Thanks
Jesse Hu | My Profile
https://people.vmware.com/webapp/htdocs/PersonalProfile.php?guid=huh@vmware.com
| Project Serengeti http://projectserengeti.org/


Ranjib Dey wrote On Mon, 28 Jul 2014 21:56:28 -0700 :

as adam mentioned, the corresponding parameters are,
erchef[‘db_pool_size’] & postgresql[‘max_connections’] . you should
also observe chef-server’s cpu profile during the chef runs. We had
similar issues, and we had to bump up both the connection limits as
well as provision a bigger chef server instance.
even with new configs & instance, we did the stress testing to find
out the max concurrent chef client runs we could do, and then add
splay time between batches of chef run. Usually it only during the
initial (cookbook sync, searches etc) and end phase (posting node data
back) when chef client needs to talk to chef server, if you can design
a way to distribute the communication over a time frame, it will give
you a predictable, fleet wide chef run, where the chef server load or
fleet wide chef run time will grow predictably as your cluster grows.

On Mon, Jul 28, 2014 at 9:23 PM, Adam Jacob <adam@getchef.com
mailto:adam@getchef.com> wrote:

You have 300/concurrent connections - try tuning pool size and max
db connections larger than that.

On Jul 28, 2014 8:57 PM, "Jesse Hu" <huhui14@gmail.com
<mailto:huhui14@gmail.com>> wrote:

    Hi Chefs,

    I'm using Chef to bootstrap a 300 nodes cluster and a few
    chef-clients throws the error "503 Service Unavailable" when
    the 300 nodes started syncing cookbooks at the same time. The
    chef server log shows erchef crashed. Most of the time,
    syncing cookbooks on all the nodes again after the 1st failure
    will succeed.

    There is a ticket Chef -4423 Chef Synchronization of Cookbook
    Slow <https://tickets.opscode.com/browse/CHEF-4423> to speed
    up syncing cookbooks. I applied the new chef-client which
    contains the fix.  However, the fix doesn't work in my case.

    Is there any tips to tune Open Source Chef Server 11 to serve
    hundreds of nodes ?  Thanks a lot.


    Here is my chef configuration:

    ## Chef Server ##

    2 cores + 8G Mem

    $ cat /etc/issue
    CentOS release 5.10 (Final)

    $ rpm -q chef-server
    chef-server-11.1.1-1.el5

    $ cat /etc/chef-server/chef-server.rb
    chef_server_webui['enable'] = false
    nginx['ssl_port'] = 9443
    nginx['non_ssl_port'] = 9080
    chef_solr['heap_size'] = 3072
    chef_solr['commit_interval'] = 3000
    chef_solr['poll_seconds'] = 6
    chef_expander['nodes'] = 4
    erchef['s3_parallel_ops_timeout'] = 30000
    erchef['s3_url_ttl'] = 3600
    erchef['db_pool_size'] = 150
    postgresql['max_connections'] = 200
    rabbitmq['node_port'] = 5672


    ## Chef Client ##

    $ cat /etc/issue
    CentOS release 6.4 (Final)

    $ rpm -q chef
    chef-11.14.0.rc.2-1.el6.x86_64

    $ cat /etc/chef/client.rb
    log_location     STDOUT
    chef_server_url  "https://<fqdn>:9443"
    validation_client_name "chef-validator"
    node_name "bigtop-master-0"

    no_lazy_load true
    ssl_verify_mode :verify_peer
    ssl_ca_path "/etc/chef/trusted_certs"


    ## chef-server-ctl tail ==

    full output of "chef-server-ctl tail" is attached.

    ==> /var/log/chef-server/erchef/crash.log.1 <==
    ** Last event was:
    {log,{lager_msg,[],[{application,chef_wm},{module,chef_wm_status},{function,log_failure},{line,77},{pid,"<0.23096.41>"},{node,'erchef@127.0.0.1'}],error,{["
    <mailto:%7D,%7Bnode,%27erchef@127.0.0.1%27%7D],error,%7B[>2014",45,"07",45,"20"],["22",58,"40",58,"02",46,"080"]},{1405,896002,80894},[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\"chef_solr\">>,<<\"fail\">>},{<<\"chef_sql\">>,<<\"pong\">>}]}}}"]}}
    ** When handler state ==
    {state,{mask,127},lager_default_formatter,["[",severity,"]
    ",message,"\n"],[]}
    ** Reason ==
    {badarg,[{io,put_chars,[user,unicode,["[","error","]
    ",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\"chef_solr\">>,<<\"fail\">>},{<<\"chef_sql\">>,<<\"pong\">>}]}}}"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,"src/lager_console_backend.erl"},{line,95}]},{gen_event,server_update,4,[{file,"gen_event.erl"},{line,504}]},{gen_event,server_notify,4,[{file,"gen_event.erl"},{line,486}]},{gen_event,handle_msg,5,[{file,"gen_event.erl"},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
    2014-07-20 22:40:02 =ERROR REPORT====
    {<<"method=GET; path=/_status; status=500; ">>,"Internal
    Server Error"}2014-07-20 22:40:02 =ERROR REPORT====
    ** gen_event handler lager_console_backend crashed.
    ** Was installed in lager_event
    ** Last event was:
    {log,{lager_msg,[],[{pid,<0.101.0>}],error,{["2014",45,"07",45,"20"],["22",58,"40",58,"02",46,"085"]},{1405,896002,85731},[103,101,110,95,101,118,101,110,116,32,"lager_console_backend",32,105,110,115,116,97,108,108,101,100,32,105,110,32,"lager_event",32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,"bad
    argument in call to io:put_chars(user, unicode,
    [\"[\",\"error\",\"]
    \",[47,95,115,116,97,116,117,115,\"\\n\",\"{{status,fail},{upstreams,{[{<<\\\"chef_solr\\\">...\"],...])
    in lager_console_backend:handle_event/2 line 95"]}}
    ** When handler state ==
    {state,{mask,127},lager_default_formatter,["[",severity,"]
    ",message,"\n"],[]}
    ** Reason ==
    {badarg,[{io,put_chars,[user,unicode,["[","error","]
    ",[103,101,110,95,101,118,101,110,116,32,"lager_console_backend",32,105,110,115,116,97,108,108,101,100,32,105,110,32,"lager_event",32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,"bad
    argument in call to io:put_chars(user, unicode,
    [\"[\",\"error\",\"]
    \",[47,95,115,116,97,116,117,115,\"\\n\",\"{{status,fail},{upstreams,{[{<<\\\"chef_solr\\\">...\"],...])
    in lager_console_backend:handle_event/2 line
    95"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,"src/lager_console_backend.erl"},{line,95}]},{gen_event,server_update,4,[{file,"gen_event.erl"},{line,504}]},{gen_event,server_notify,4,[{file,"gen_event.erl"},{line,486}]},{gen_event,handle_msg,5,[{file,"gen_event.erl"},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}


    ==> /var/log/chef-server/erchef/erchef.log <==
    2014-07-22 04:15:48.514 [error] Supervisor
    stats_hero_worker_sup had child stats_hero started with
    {stats_hero,start_link,undefined} at <0.30506.3> exit with
    reason
    {timeout,{gen_server,call,[<0.925.0>,{send,[49,124,"358","\n",[[["chefAPI",".application.byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".","10",".byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".application.allRequests"],58,"1866",124,"h","\n"],[["chefAPI",".","10",".allRequests"],58,"1866",124,"h","\n"],[["chefAPI",".application.byRequestType.","roles",".","GET"],58,"1866",124,"h","\n"],[["chefAPI",".upstreamRequests.","rdbms"],58,"1834",124,"h","\n"],[["chefAPI",".upstreamReques...",...],...],...]]}]}}
    in context child_terminated
    2014-07-22 04:15:48.600 [error] Supervisor
    stats_hero_worker_sup had child stats_hero started with
    {stats_hero,start_link,undefined} at <0.30501.3> exit with
    reason
    {timeout,{gen_server,call,[<0.925.0>,{send,[49,124,"356","\n",[[["chefAPI",".application.byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".","10",".byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".application.allRequests"],58,"2000",124,"h","\n"],[["chefAPI",".","10",".allRequests"],58,"2000",124,"h","\n"],[["chefAPI",".application.byRequestType.","data",".","GET"],58,"2000",124,"h","\n"],[["chefAPI",".upstreamRequests.","rdbms"],58,"1987",124,"h","\n"],[["chefAPI",".upstreamRequest...",...],...],...]]}]}}
    in context child_terminated
    2014-07-22 04:15:48.613 [error] Supervisor
    stats_hero_worker_sup had child stats_hero started with
    {stats_hero,start_link,undefined} at <0.30527.3> exit with
    reason
    {timeout,{gen_server,call,[<0.925.0>,{send,[49,124,"110","\n",[[["chefAPI",".application.allRequests"],58,"1",124,"m","\n"],[["chefAPI",".","10",".allRequests"],58,"1",124,"m","\n"],[["chefAPI",".application.byRequestType.","data",".","GET"],58,"1",124,"m","\n"]]]}]}}
    in context child_terminated
    2014-07-22 04:16:12.850 [error] {<<"method=POST;
    path=/environments/_default/cookbook_versions; status=503;
    ">>,"Service Unavailable"}
    2014-07-22 04:16:13.010 [error] {<<"method=POST;
    path=/environments/_default/cookbook_versions; status=503;
    ">>,"Service Unavailable"}
    2014-07-22 04:16:13.024 [error] {<<"method=POST;
    path=/environments/_default/cookbook_versions; status=503;
    ">>,"Service Unavailable"}
    2014-07-22 04:17:05.852 [error] {<<"method=POST;
    path=/environments/_default/cookbook_versions; status=503;
    ">>,"Service Unavailable"}
    2014-07-22 04:17:13.356 [error] {<<"method=POST;
    path=/environments/_default/cookbook_versions; status=503;
    ">>,"Service Unavailable"}
    2014-07-22 04:17:13.363 [error] {<<"method=POST;
    path=/environments/_default/cookbook_versions; status=503;
    ">>,"Service Unavailable"}
    2014-07-22 04:17:13.433 [error] {<<"method=POST;
    path=/environments/_default/cookbook_versions; status=503;
    ">>,"Service Unavailable"}



    Thanks
    Jesse Hu

#6

Hi,

Is there a recommended ratio between erchef[‘db_pool_size’] and the number of concurrent chef clients ?

I’m not sure I would go as far to call it a recommendation, but I
typically start at 1:1 ratio plus some headroom if I know the machine can
support that many postgresql connections. Note that as you increase
the number of connections in the pool, erchef does take a bit longer
to start up.

Is the postgresql connection the only bottle neck to scale up Chef Server 11 ?

Based on this:

2014-07-22 04:16:13.024 [error] {<<"method=POST; path=/environments/_default/
cookbook_versions; status=503; ">>,“Service Unavailable”}

My suspicion is that you might also be running out of depsolver
workers. In order to take advantage of the gecode solver, erchef
starts a fixed number of workers that do nothing but solve dependency
problems (by default, we start 5). When all worker are busy, you’ll
see a 503 on the API call in the error message above. While the
number of workers is configurable, I’m skeptical about bumping the
number of workers as depsolving is largely a CPU bound activity and
you are already at 5 workers for your 2 CPUs. However, if you are
willing to sacrifice latency for throughput, you might experiment with
bumping

erchef[‘depsolver_worker_
count’]

Some other observations and questions below:

erchef['s3_parallel_ops_timeout'] = 30000

Did you see problems that led you to push this that high?

no_lazy_load true

You should consider keeping lazy_load on for bootstrapping unless you
really need it off, this will save a bit on the number of files you
need to sync concurrently.

Cheers,

Steven

On Tue, Jul 29, 2014 at 7:14 AM, Jesse Hu huhui14@gmail.com wrote:

Thanks Adam & Ranjib.

Is there a recommended ratio between erchef[‘db_pool_size’] and the number
of concurrent chef clients ? Do we need to tune the nginx and erchef itself
? Is the postgresql connection the only bottle neck to scale up Chef Server
11 ?

In my case, I have several chef node save calls and chef search API calls
during the cookbook exection, so the Chef Server will throw 503 or EOF
exception when saving the node or calling chef search. The chef node save
calls is to save current bootstrap progress and task ongoing, so we can tell
the user what’s being done on each node. The chef search API calls is
called when a node needs to find out dependent node running the specified
service. The dependent node will add a tag to indicate it’s running the
service, and other nodes search nodes by the tag name.

Thanks
Jesse Hu | My Profile | Project Serengeti


Ranjib Dey wrote On Mon, 28 Jul 2014 21:56:28 -0700 :

as adam mentioned, the corresponding parameters are, erchef[‘db_pool_size’]
& postgresql[‘max_connections’] . you should also observe chef-server’s cpu
profile during the chef runs. We had similar issues, and we had to bump up
both the connection limits as well as provision a bigger chef server
instance.
even with new configs & instance, we did the stress testing to find out the
max concurrent chef client runs we could do, and then add splay time between
batches of chef run. Usually it only during the initial (cookbook sync,
searches etc) and end phase (posting node data back) when chef client needs
to talk to chef server, if you can design a way to distribute the
communication over a time frame, it will give you a predictable, fleet wide
chef run, where the chef server load or fleet wide chef run time will grow
predictably as your cluster grows.

On Mon, Jul 28, 2014 at 9:23 PM, Adam Jacob adam@getchef.com wrote:

You have 300/concurrent connections - try tuning pool size and max db
connections larger than that.

On Jul 28, 2014 8:57 PM, “Jesse Hu” huhui14@gmail.com wrote:

Hi Chefs,

I’m using Chef to bootstrap a 300 nodes cluster and a few chef-clients
throws the error “503 Service Unavailable” when the 300 nodes started
syncing cookbooks at the same time. The chef server log shows erchef
crashed. Most of the time, syncing cookbooks on all the nodes again after
the 1st failure will succeed.

There is a ticket Chef -4423 Chef Synchronization of Cookbook Slow to
speed up syncing cookbooks. I applied the new chef-client which contains the
fix. However, the fix doesn’t work in my case.

Is there any tips to tune Open Source Chef Server 11 to serve hundreds of
nodes ? Thanks a lot.

Here is my chef configuration:

Chef Server

2 cores + 8G Mem

$ cat /etc/issue
CentOS release 5.10 (Final)

$ rpm -q chef-server
chef-server-11.1.1-1.el5

$ cat /etc/chef-server/chef-server.rb
chef_server_webui[‘enable’] = false
nginx[‘ssl_port’] = 9443
nginx[‘non_ssl_port’] = 9080
chef_solr[‘heap_size’] = 3072
chef_solr[‘commit_interval’] = 3000
chef_solr[‘poll_seconds’] = 6
chef_expander[‘nodes’] = 4
erchef[‘s3_parallel_ops_timeout’] = 30000
erchef[‘s3_url_ttl’] = 3600
erchef[‘db_pool_size’] = 150
postgresql[‘max_connections’] = 200
rabbitmq[‘node_port’] = 5672

Chef Client

$ cat /etc/issue
CentOS release 6.4 (Final)

$ rpm -q chef
chef-11.14.0.rc.2-1.el6.x86_64

$ cat /etc/chef/client.rb
log_location STDOUT
chef_server_url "https://:9443"
validation_client_name "chef-validator"
node_name “bigtop-master-0”

no_lazy_load true
ssl_verify_mode :verify_peer
ssl_ca_path “/etc/chef/trusted_certs”

chef-server-ctl tail ==

full output of “chef-server-ctl tail” is attached.

==> /var/log/chef-server/erchef/crash.log.1 <==
** Last event was:
{log,{lager_msg,[],[{application,chef_wm},{module,chef_wm_status},{function,log_failure},{line,77},{pid,"<0.23096.41>"},{node,‘erchef@127.0.0.1’}],error,{[“2014”,45,“07”,45,“20”],[“22”,58,“40”,58,“02”,46,“080”]},{1405,896002,80894},[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<“chef_solr”>>,<<“fail”>>},{<<“chef_sql”>>,<<“pong”>>}]}}}"]}}
** When handler state ==
{state,{mask,127},lager_default_formatter,["[",severity,"]
",message,"\n"],[]}
** Reason == {badarg,[{io,put_chars,[user,unicode,["[",“error”,"]
",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<“chef_solr”>>,<<“fail”>>},{<<“chef_sql”>>,<<“pong”>>}]}}}"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,“src/lager_console_backend.erl”},{line,95}]},{gen_event,server_update,4,[{file,“gen_event.erl”},{line,504}]},{gen_event,server_notify,4,[{file,“gen_event.erl”},{line,486}]},{gen_event,handle_msg,5,[{file,“gen_event.erl”},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,227}]}]}
2014-07-20 22:40:02 =ERROR REPORT====
{<<“method=GET; path=/_status; status=500; “>>,“Internal Server
Error”}2014-07-20 22:40:02 =ERROR REPORT====
** gen_event handler lager_console_backend crashed.
** Was installed in lager_event
** Last event was:
{log,{lager_msg,[],[{pid,<0.101.0>}],error,{[“2014”,45,“07”,45,“20”],[“22”,58,“40”,58,“02”,46,“085”]},{1405,896002,85731},[103,101,110,95,101,118,101,110,116,32,“lager_console_backend”,32,105,110,115,116,97,108,108,101,100,32,105,110,32,“lager_event”,32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,“bad
argument in call to io:put_chars(user, unicode, [”[”,“error”,”]
",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\“chef_solr\”>…"],…])
in lager_console_backend:handle_event/2 line 95"]}}
** When handler state ==
{state,{mask,127},lager_default_formatter,["[",severity,"]
",message,"\n"],[]}
** Reason == {badarg,[{io,put_chars,[user,unicode,["[",“error”,"]
",[103,101,110,95,101,118,101,110,116,32,“lager_console_backend”,32,105,110,115,116,97,108,108,101,100,32,105,110,32,“lager_event”,32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,“bad
argument in call to io:put_chars(user, unicode, [”[",“error”,"]
",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\“chef_solr\”>…"],…])
in lager_console_backend:handle_event/2 line
95"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,“src/lager_console_backend.erl”},{line,95}]},{gen_event,server_update,4,[{file,“gen_event.erl”},{line,504}]},{gen_event,server_notify,4,[{file,“gen_event.erl”},{line,486}]},{gen_event,handle_msg,5,[{file,“gen_event.erl”},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,227}]}]}

==> /var/log/chef-server/erchef/erchef.log <==
2014-07-22 04:15:48.514 [error] Supervisor stats_hero_worker_sup had
child stats_hero started with {stats_hero,start_link,undefined} at
<0.30506.3> exit with reason
{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“358”,"\n",[[[“chefAPI”,".application.byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.allRequests"],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".application.byRequestType.",“roles”,".",“GET”],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequests.",“rdbms”],58,“1834”,124,“h”,"\n"],[[“chefAPI”,".upstreamReques…",…],…],…]]}]}}
in context child_terminated
2014-07-22 04:15:48.600 [error] Supervisor stats_hero_worker_sup had
child stats_hero started with {stats_hero,start_link,undefined} at
<0.30501.3> exit with reason
{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“356”,"\n",[[[“chefAPI”,".application.byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.allRequests"],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".application.byRequestType.",“data”,".",“GET”],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequests.",“rdbms”],58,“1987”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequest…",…],…],…]]}]}}
in context child_terminated
2014-07-22 04:15:48.613 [error] Supervisor stats_hero_worker_sup had
child stats_hero started with {stats_hero,start_link,undefined} at
<0.30527.3> exit with reason
{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“110”,"\n",[[[“chefAPI”,".application.allRequests"],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.byRequestType.",“data”,".",“GET”],58,“1”,124,“m”,"\n"]]]}]}}
in context child_terminated
2014-07-22 04:16:12.850 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:16:13.010 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:16:13.024 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:05.852 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:13.356 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:13.363 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:13.433 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}

Thanks
Jesse Hu


#7

Like Steven mentioned, the 503s are actually coming from a lack of
depsolver workers, and that can be bumped accordingly. A few more should
help address this, but I typically suggest not increasing that number past
the number of CPUs you have available. In this case, you’re already 2.5x
oversubscribe. By returning a 503, the client should retry said request
with an exponential backoff of up to 60 seconds between retries. Are these
503s stopping chef-client runs?

I also agree with Steven on the no_lazy_load. This will actually increase
load on the server at the start of a chef-client run instead of spreading
it out across the run. If you’ve found that your node bootstraps are taking
longer than the 1 hour expiry time of erchef[‘s3_url_ttl’], then you can
actually just bump this value to compensate for that instead.

On Tue, Jul 29, 2014 at 4:36 AM, Steven Danna steve@opscode.com wrote:

Hi,

Is there a recommended ratio between erchef[‘db_pool_size’] and the
number of concurrent chef clients ?

I’m not sure I would go as far to call it a recommendation, but I
typically start at 1:1 ratio plus some headroom if I know the machine can
support that many postgresql connections. Note that as you increase
the number of connections in the pool, erchef does take a bit longer
to start up.

Is the postgresql connection the only bottle neck to scale up Chef
Server 11 ?

Based on this:

2014-07-22 04:16:13.024 [error] {<<"method=POST;
path=/environments/_default/
cookbook_versions; status=503; ">>,“Service Unavailable”}

My suspicion is that you might also be running out of depsolver
workers. In order to take advantage of the gecode solver, erchef
starts a fixed number of workers that do nothing but solve dependency
problems (by default, we start 5). When all worker are busy, you’ll
see a 503 on the API call in the error message above. While the
number of workers is configurable, I’m skeptical about bumping the
number of workers as depsolving is largely a CPU bound activity and
you are already at 5 workers for your 2 CPUs. However, if you are
willing to sacrifice latency for throughput, you might experiment with
bumping

erchef[‘depsolver_worker_
count’]

Some other observations and questions below:

erchef['s3_parallel_ops_timeout'] = 30000

Did you see problems that led you to push this that high?

no_lazy_load true

You should consider keeping lazy_load on for bootstrapping unless you
really need it off, this will save a bit on the number of files you
need to sync concurrently.

Cheers,

Steven

On Tue, Jul 29, 2014 at 7:14 AM, Jesse Hu huhui14@gmail.com wrote:

Thanks Adam & Ranjib.

Is there a recommended ratio between erchef[‘db_pool_size’] and the
number
of concurrent chef clients ? Do we need to tune the nginx and erchef
itself
? Is the postgresql connection the only bottle neck to scale up Chef
Server
11 ?

In my case, I have several chef node save calls and chef search API calls
during the cookbook exection, so the Chef Server will throw 503 or EOF
exception when saving the node or calling chef search. The chef node
save
calls is to save current bootstrap progress and task ongoing, so we can
tell
the user what’s being done on each node. The chef search API calls is
called when a node needs to find out dependent node running the specified
service. The dependent node will add a tag to indicate it’s running the
service, and other nodes search nodes by the tag name.

Thanks
Jesse Hu | My Profile | Project Serengeti


Ranjib Dey wrote On Mon, 28 Jul 2014 21:56:28 -0700 :

as adam mentioned, the corresponding parameters are,
erchef[‘db_pool_size’]
& postgresql[‘max_connections’] . you should also observe chef-server’s
cpu
profile during the chef runs. We had similar issues, and we had to bump
up
both the connection limits as well as provision a bigger chef server
instance.
even with new configs & instance, we did the stress testing to find out
the
max concurrent chef client runs we could do, and then add splay time
between
batches of chef run. Usually it only during the initial (cookbook sync,
searches etc) and end phase (posting node data back) when chef client
needs
to talk to chef server, if you can design a way to distribute the
communication over a time frame, it will give you a predictable, fleet
wide
chef run, where the chef server load or fleet wide chef run time will
grow
predictably as your cluster grows.

On Mon, Jul 28, 2014 at 9:23 PM, Adam Jacob adam@getchef.com wrote:

You have 300/concurrent connections - try tuning pool size and max db
connections larger than that.

On Jul 28, 2014 8:57 PM, “Jesse Hu” huhui14@gmail.com wrote:

Hi Chefs,

I’m using Chef to bootstrap a 300 nodes cluster and a few chef-clients
throws the error “503 Service Unavailable” when the 300 nodes started
syncing cookbooks at the same time. The chef server log shows erchef
crashed. Most of the time, syncing cookbooks on all the nodes again
after

the 1st failure will succeed.

There is a ticket Chef -4423 Chef Synchronization of Cookbook Slow to
speed up syncing cookbooks. I applied the new chef-client which
contains the

fix. However, the fix doesn’t work in my case.

Is there any tips to tune Open Source Chef Server 11 to serve hundreds
of

nodes ? Thanks a lot.

Here is my chef configuration:

Chef Server

2 cores + 8G Mem

$ cat /etc/issue
CentOS release 5.10 (Final)

$ rpm -q chef-server
chef-server-11.1.1-1.el5

$ cat /etc/chef-server/chef-server.rb
chef_server_webui[‘enable’] = false
nginx[‘ssl_port’] = 9443
nginx[‘non_ssl_port’] = 9080
chef_solr[‘heap_size’] = 3072
chef_solr[‘commit_interval’] = 3000
chef_solr[‘poll_seconds’] = 6
chef_expander[‘nodes’] = 4
erchef[‘s3_parallel_ops_timeout’] = 30000
erchef[‘s3_url_ttl’] = 3600
erchef[‘db_pool_size’] = 150
postgresql[‘max_connections’] = 200
rabbitmq[‘node_port’] = 5672

Chef Client

$ cat /etc/issue
CentOS release 6.4 (Final)

$ rpm -q chef
chef-11.14.0.rc.2-1.el6.x86_64

$ cat /etc/chef/client.rb
log_location STDOUT
chef_server_url "https://:9443"
validation_client_name "chef-validator"
node_name “bigtop-master-0”

no_lazy_load true
ssl_verify_mode :verify_peer
ssl_ca_path “/etc/chef/trusted_certs”

chef-server-ctl tail ==

full output of “chef-server-ctl tail” is attached.

==> /var/log/chef-server/erchef/crash.log.1 <==
** Last event was:

{log,{lager_msg,[],[{application,chef_wm},{module,chef_wm_status},{function,log_failure},{line,77},{pid,"<0.23096.41>"},{node,‘
erchef@127.0.0.1
’}],error,{[“2014”,45,“07”,45,“20”],[“22”,58,“40”,58,“02”,46,“080”]},{1405,896002,80894},[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<“chef_solr”>>,<<“fail”>>},{<<“chef_sql”>>,<<“pong”>>}]}}}"]}}

** When handler state ==
{state,{mask,127},lager_default_formatter,["[",severity,"]
",message,"\n"],[]}
** Reason == {badarg,[{io,put_chars,[user,unicode,["[",“error”,"]

“,[47,95,115,116,97,116,117,115,”\n","{{status,fail},{upstreams,{[{<<“chef_solr”>>,<<“fail”>>},{<<“chef_sql”>>,<<“pong”>>}]}}}"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,“src/lager_console_backend.erl”},{line,95}]},{gen_event,server_update,4,[{file,“gen_event.erl”},{line,504}]},{gen_event,server_notify,4,[{file,“gen_event.erl”},{line,486}]},{gen_event,handle_msg,5,[{file,“gen_event.erl”},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,227}]}]}

2014-07-20 22:40:02 =ERROR REPORT====
{<<"method=GET; path=/_status; status=500; ">>,“Internal Server
Error”}2014-07-20 22:40:02 =ERROR REPORT====
** gen_event handler lager_console_backend crashed.
** Was installed in lager_event
** Last event was:

{log,{lager_msg,[],[{pid,<0.101.0>}],error,{[“2014”,45,“07”,45,“20”],[“22”,58,“40”,58,“02”,46,“085”]},{1405,896002,85731},[103,101,110,95,101,118,101,110,116,32,“lager_console_backend”,32,105,110,115,116,97,108,108,101,100,32,105,110,32,“lager_event”,32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,"bad

argument in call to io:put_chars(user, unicode, ["[",“error”,"]

“,[47,95,115,116,97,116,117,115,”\n","{{status,fail},{upstreams,{[{<<\“chef_solr\”>…"],…])

in lager_console_backend:handle_event/2 line 95"]}}
** When handler state ==
{state,{mask,127},lager_default_formatter,["[",severity,"]
",message,"\n"],[]}
** Reason == {badarg,[{io,put_chars,[user,unicode,["[",“error”,"]

",[103,101,110,95,101,118,101,110,116,32,“lager_console_backend”,32,105,110,115,116,97,108,108,101,100,32,105,110,32,“lager_event”,32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,"bad

argument in call to io:put_chars(user, unicode, ["[",“error”,"]

“,[47,95,115,116,97,116,117,115,”\n","{{status,fail},{upstreams,{[{<<\“chef_solr\”>…"],…])

in lager_console_backend:handle_event/2 line

95"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,“src/lager_console_backend.erl”},{line,95}]},{gen_event,server_update,4,[{file,“gen_event.erl”},{line,504}]},{gen_event,server_notify,4,[{file,“gen_event.erl”},{line,486}]},{gen_event,handle_msg,5,[{file,“gen_event.erl”},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,“proc_lib.erl”},{line,227}]}]}

==> /var/log/chef-server/erchef/erchef.log <==
2014-07-22 04:15:48.514 [error] Supervisor stats_hero_worker_sup had
child stats_hero started with {stats_hero,start_link,undefined} at
<0.30506.3> exit with reason

{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“358”,"\n",[[[“chefAPI”,".application.byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.allRequests"],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".application.byRequestType.",“roles”,".",“GET”],58,“1866”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequests.",“rdbms”],58,“1834”,124,“h”,"\n"],[[“chefAPI”,".upstreamReques…",…],…],…]]}]}}

in context child_terminated
2014-07-22 04:15:48.600 [error] Supervisor stats_hero_worker_sup had
child stats_hero started with {stats_hero,start_link,undefined} at
<0.30501.3> exit with reason

{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“356”,"\n",[[[“chefAPI”,".application.byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".byStatusCode.",“200”],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.allRequests"],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".application.byRequestType.",“data”,".",“GET”],58,“2000”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequests.",“rdbms”],58,“1987”,124,“h”,"\n"],[[“chefAPI”,".upstreamRequest…",…],…],…]]}]}}

in context child_terminated
2014-07-22 04:15:48.613 [error] Supervisor stats_hero_worker_sup had
child stats_hero started with {stats_hero,start_link,undefined} at
<0.30527.3> exit with reason

{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,“110”,"\n",[[[“chefAPI”,".application.allRequests"],58,“1”,124,“m”,"\n"],[[“chefAPI”,".",“10”,".allRequests"],58,“1”,124,“m”,"\n"],[[“chefAPI”,".application.byRequestType.",“data”,".",“GET”],58,“1”,124,“m”,"\n"]]]}]}}

in context child_terminated
2014-07-22 04:16:12.850 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:16:13.010 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:16:13.024 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:05.852 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:13.356 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:13.363 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}
2014-07-22 04:17:13.433 [error] {<<"method=POST;
path=/environments/_default/cookbook_versions; status=503; ">>,“Service
Unavailable”}

Thanks
Jesse Hu


Stephen Delano
Software Development Engineer
Opscode, Inc.
1008 Western Avenue
Suite 601
Seattle, WA 98104


#8

Thanks a lot Steven & Stephen!

I’m not sure I would go as far to call it a recommendation, but I
typically start at 1:1 ratio plus some headroom if I know the machine can
support that many postgresql connections. Note that as you increase
the number of connections in the pool, erchef does take a bit longer
to start up.
I will try 1:1 ratio plus some headroom for
erchef[‘db_pool_size’] and the number of concurrent chef clients. It
should consume more CPU and memories.

Is there a limit for number of the concurrent chef clients served by a
single Chef Server 11 ? The depsolver workers number seems a blocker,
unless giving more CPU cores.

If adding more CPU cores, will increase
erchef[‘depsolver_worker_count’] and nginx[‘worker_processes’] solve the
503 issue? Both these 2 attributes’ devalue value are related to CPU
cores number ?

By returning a 503, the client should retry said request with an
exponential backoff of up to 60 seconds between retries. Are these
503s stopping chef-client runs?
Yes, the chef-client exits after getting 503. How can I tell
the chef-client to retry syncing cookbook or calling Chef Node or Search
APIs? Is there a settings in /etc/chef/client.rb? If not, I may
re-spawn chef-client when getting 503 error.

Some other observations and questions below:

erchef['s3_parallel_ops_timeout'] = 30000

Did you see problems that led you to push this that high?

If using the default value with Chef Server 11.0.8, the ‘knife
cookbook upload -a’ will hang forever, so I increased it. I’m not sure
whether it reproduces in the latest Chef Server 11.1.1.

no_lazy_load true

You should consider keeping lazy_load on for bootstrapping unless you
really need it off, this will save a bit on the number of files you
need to sync concurrently.

I tried ‘no_lazy_load false’ in my 300 nodes cluster, it still
threw 503 ERROR. I see there is a commit make no_lazy_load the default
https://github.com/opscode/chef/commit/1ee229b2407231d7ddf0f9f2fba1d725a92e6792in
7 days ago. As it described, my chef-client might run for a long time
(1~2 hours), so I think I’d better set no_lazy_load to true.

Thanks
Jesse Hu


Stephen Delano wrote On Tue, 29 Jul 2014 11:57:02 -0700 :

Like Steven mentioned, the 503s are actually coming from a lack of
depsolver workers, and that can be bumped accordingly. A few more
should help address this, but I typically suggest not increasing that
number past the number of CPUs you have available. In this case,
you’re already 2.5x oversubscribe. By returning a 503, the client
should retry said request with an exponential backoff of up to 60
seconds between retries. Are these 503s stopping chef-client runs?

I also agree with Steven on the no_lazy_load. This will actually
increase load on the server at the start of a chef-client run instead
of spreading it out across the run. If you’ve found that your node
bootstraps are taking longer than the 1 hour expiry time of
erchef[‘s3_url_ttl’], then you can actually just bump this value to
compensate for that instead.

On Tue, Jul 29, 2014 at 4:36 AM, Steven Danna <steve@opscode.com
mailto:steve@opscode.com> wrote:

Hi,

> Is there a recommended ratio between erchef['db_pool_size'] and
the number of concurrent chef clients ?

I'm not sure I would go as far to call it a recommendation, but I
typically start at 1:1 ratio plus some headroom if I know the
machine can
support that many postgresql connections.  Note that as you increase
the number of connections in the pool, erchef does take a bit longer
to start up.

> Is the postgresql connection the only bottle neck to scale up
Chef Server 11 ?

Based on this:

   2014-07-22 04:16:13.024 [error] {<<"method=POST;
path=/environments/_default/
cookbook_versions; status=503; ">>,"Service Unavailable"}

My suspicion is that you might also be running out of depsolver
workers.  In order to take advantage of the gecode solver, erchef
starts a fixed number of workers that do nothing but solve dependency
problems (by default, we start 5). When all worker are busy, you'll
see a 503 on the API call in the error message above.  While the
number of workers is configurable, I'm skeptical about bumping the
number of workers as depsolving is largely a CPU bound activity and
you are already at 5 workers for your 2 CPUs.  However, if you are
willing to sacrifice latency for throughput, you might experiment with
bumping

  erchef['depsolver_worker_
count']

Some other observations and questions below:

    erchef['s3_parallel_ops_timeout'] = 30000

Did you see problems that led you to push this that high?

    no_lazy_load true

You should consider keeping lazy_load on for bootstrapping unless you
really need it off, this will save a bit on the number of files you
need to sync concurrently.

Cheers,

Steven

On Tue, Jul 29, 2014 at 7:14 AM, Jesse Hu <huhui14@gmail.com
<mailto:huhui14@gmail.com>> wrote:
> Thanks Adam & Ranjib.
>
> Is there a recommended ratio between erchef['db_pool_size'] and
the number
> of concurrent chef clients ? Do we need to tune the nginx and
erchef itself
> ?  Is the postgresql connection the only bottle neck to scale up
Chef Server
> 11 ?
>
> In my case, I have several chef node save calls and chef search
API calls
> during the cookbook exection, so the Chef Server will throw 503
or EOF
> exception when saving the node or calling chef search.  The chef
node save
> calls is to save current bootstrap progress and task ongoing, so
we can tell
> the user what's being done on each node.  The chef search API
calls is
> called when a node needs to find out dependent node running the
specified
> service. The dependent node will add a tag to indicate it's
running the
> service, and other nodes search nodes by the tag name.
>
> Thanks
> Jesse Hu | My Profile | Project Serengeti
>
> ________________________________
> Ranjib Dey wrote On Mon, 28 Jul 2014 21:56:28 -0700 :
>
> as adam mentioned, the corresponding parameters are,
erchef['db_pool_size']
> & postgresql['max_connections'] . you should also observe
chef-server's cpu
> profile during the chef runs. We had similar issues, and we had
to bump up
> both the connection limits as well as provision a bigger chef server
> instance.
> even with new configs & instance, we did the stress testing to
find out the
> max concurrent chef client runs we could do, and then add splay
time between
> batches of chef run. Usually it only during the initial
(cookbook sync,
> searches etc) and end phase (posting node data back) when chef
client needs
> to talk to chef server, if you can design a way to distribute the
> communication over a time frame, it will give you a predictable,
fleet wide
> chef run, where the chef server load or fleet wide chef run time
will grow
> predictably as your cluster grows.
>
>
>
> On Mon, Jul 28, 2014 at 9:23 PM, Adam Jacob <adam@getchef.com
<mailto:adam@getchef.com>> wrote:
>>
>> You have 300/concurrent connections - try tuning pool size and
max db
>> connections larger than that.
>>
>> On Jul 28, 2014 8:57 PM, "Jesse Hu" <huhui14@gmail.com
<mailto:huhui14@gmail.com>> wrote:
>>>
>>> Hi Chefs,
>>>
>>> I'm using Chef to bootstrap a 300 nodes cluster and a few
chef-clients
>>> throws the error "503 Service Unavailable" when the 300 nodes
started
>>> syncing cookbooks at the same time. The chef server log shows
erchef
>>> crashed. Most of the time, syncing cookbooks on all the nodes
again after
>>> the 1st failure will succeed.
>>>
>>> There is a ticket Chef -4423 Chef Synchronization of Cookbook
Slow to
>>> speed up syncing cookbooks. I applied the new chef-client
which contains the
>>> fix.  However, the fix doesn't work in my case.
>>>
>>> Is there any tips to tune Open Source Chef Server 11 to serve
hundreds of
>>> nodes ?  Thanks a lot.
>>>
>>>
>>> Here is my chef configuration:
>>>
>>> ## Chef Server ##
>>>
>>> 2 cores + 8G Mem
>>>
>>> $ cat /etc/issue
>>> CentOS release 5.10 (Final)
>>>
>>> $ rpm -q chef-server
>>> chef-server-11.1.1-1.el5
>>>
>>> $ cat /etc/chef-server/chef-server.rb
>>> chef_server_webui['enable'] = false
>>> nginx['ssl_port'] = 9443
>>> nginx['non_ssl_port'] = 9080
>>> chef_solr['heap_size'] = 3072
>>> chef_solr['commit_interval'] = 3000
>>> chef_solr['poll_seconds'] = 6
>>> chef_expander['nodes'] = 4
>>> erchef['s3_parallel_ops_timeout'] = 30000
>>> erchef['s3_url_ttl'] = 3600
>>> erchef['db_pool_size'] = 150
>>> postgresql['max_connections'] = 200
>>> rabbitmq['node_port'] = 5672
>>>
>>>
>>> ## Chef Client ##
>>>
>>> $ cat /etc/issue
>>> CentOS release 6.4 (Final)
>>>
>>> $ rpm -q chef
>>> chef-11.14.0.rc.2-1.el6.x86_64
>>>
>>> $ cat /etc/chef/client.rb
>>> log_location     STDOUT
>>> chef_server_url  "https://<fqdn>:9443"
>>> validation_client_name "chef-validator"
>>> node_name "bigtop-master-0"
>>>
>>> no_lazy_load true
>>> ssl_verify_mode :verify_peer
>>> ssl_ca_path "/etc/chef/trusted_certs"
>>>
>>>
>>> ## chef-server-ctl tail ==
>>>
>>> full output of "chef-server-ctl tail" is attached.
>>>
>>> ==> /var/log/chef-server/erchef/crash.log.1 <==
>>> ** Last event was:
>>>
{log,{lager_msg,[],[{application,chef_wm},{module,chef_wm_status},{function,log_failure},{line,77},{pid,"<0.23096.41>"},{node,'erchef@127.0.0.1
<mailto:erchef@127.0.0.1>'}],error,{["2014",45,"07",45,"20"],["22",58,"40",58,"02",46,"080"]},{1405,896002,80894},[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\"chef_solr\">>,<<\"fail\">>},{<<\"chef_sql\">>,<<\"pong\">>}]}}}"]}}
>>> ** When handler state ==
>>> {state,{mask,127},lager_default_formatter,["[",severity,"]
>>> ",message,"\n"],[]}
>>> ** Reason == {badarg,[{io,put_chars,[user,unicode,["[","error","]
>>>
",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\"chef_solr\">>,<<\"fail\">>},{<<\"chef_sql\">>,<<\"pong\">>}]}}}"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,"src/lager_console_backend.erl"},{line,95}]},{gen_event,server_update,4,[{file,"gen_event.erl"},{line,504}]},{gen_event,server_notify,4,[{file,"gen_event.erl"},{line,486}]},{gen_event,handle_msg,5,[{file,"gen_event.erl"},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
>>> 2014-07-20 22:40:02 =ERROR REPORT====
>>> {<<"method=GET; path=/_status; status=500; ">>,"Internal Server
>>> Error"}2014-07-20 22:40:02 =ERROR REPORT====
>>> ** gen_event handler lager_console_backend crashed.
>>> ** Was installed in lager_event
>>> ** Last event was:
>>>
{log,{lager_msg,[],[{pid,<0.101.0>}],error,{["2014",45,"07",45,"20"],["22",58,"40",58,"02",46,"085"]},{1405,896002,85731},[103,101,110,95,101,118,101,110,116,32,"lager_console_backend",32,105,110,115,116,97,108,108,101,100,32,105,110,32,"lager_event",32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,"bad
>>> argument in call to io:put_chars(user, unicode,
[\"[\",\"error\",\"]
>>>
\",[47,95,115,116,97,116,117,115,\"\\n\",\"{{status,fail},{upstreams,{[{<<\\\"chef_solr\\\">...\"],...])
>>> in lager_console_backend:handle_event/2 line 95"]}}
>>> ** When handler state ==
>>> {state,{mask,127},lager_default_formatter,["[",severity,"]
>>> ",message,"\n"],[]}
>>> ** Reason == {badarg,[{io,put_chars,[user,unicode,["[","error","]
>>>
",[103,101,110,95,101,118,101,110,116,32,"lager_console_backend",32,105,110,115,116,97,108,108,101,100,32,105,110,32,"lager_event",32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,"bad
>>> argument in call to io:put_chars(user, unicode,
[\"[\",\"error\",\"]
>>>
\",[47,95,115,116,97,116,117,115,\"\\n\",\"{{status,fail},{upstreams,{[{<<\\\"chef_solr\\\">...\"],...])
>>> in lager_console_backend:handle_event/2 line
>>>
95"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,"src/lager_console_backend.erl"},{line,95}]},{gen_event,server_update,4,[{file,"gen_event.erl"},{line,504}]},{gen_event,server_notify,4,[{file,"gen_event.erl"},{line,486}]},{gen_event,handle_msg,5,[{file,"gen_event.erl"},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
>>>
>>>
>>> ==> /var/log/chef-server/erchef/erchef.log <==
>>> 2014-07-22 04:15:48.514 [error] Supervisor
stats_hero_worker_sup had
>>> child stats_hero started with {stats_hero,start_link,undefined} at
>>> <0.30506.3> exit with reason
>>>
{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,"358","\n",[[["chefAPI",".application.byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".","10",".byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".application.allRequests"],58,"1866",124,"h","\n"],[["chefAPI",".","10",".allRequests"],58,"1866",124,"h","\n"],[["chefAPI",".application.byRequestType.","roles",".","GET"],58,"1866",124,"h","\n"],[["chefAPI",".upstreamRequests.","rdbms"],58,"1834",124,"h","\n"],[["chefAPI",".upstreamReques...",...],...],...]]}]}}
>>> in context child_terminated
>>> 2014-07-22 04:15:48.600 [error] Supervisor
stats_hero_worker_sup had
>>> child stats_hero started with {stats_hero,start_link,undefined} at
>>> <0.30501.3> exit with reason
>>>
{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,"356","\n",[[["chefAPI",".application.byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".","10",".byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".application.allRequests"],58,"2000",124,"h","\n"],[["chefAPI",".","10",".allRequests"],58,"2000",124,"h","\n"],[["chefAPI",".application.byRequestType.","data",".","GET"],58,"2000",124,"h","\n"],[["chefAPI",".upstreamRequests.","rdbms"],58,"1987",124,"h","\n"],[["chefAPI",".upstreamRequest...",...],...],...]]}]}}
>>> in context child_terminated
>>> 2014-07-22 04:15:48.613 [error] Supervisor
stats_hero_worker_sup had
>>> child stats_hero started with {stats_hero,start_link,undefined} at
>>> <0.30527.3> exit with reason
>>>
{timeout,{gen_server,call,[<0.925.0>,{send,[49,124,"110","\n",[[["chefAPI",".application.allRequests"],58,"1",124,"m","\n"],[["chefAPI",".","10",".allRequests"],58,"1",124,"m","\n"],[["chefAPI",".application.byRequestType.","data",".","GET"],58,"1",124,"m","\n"]]]}]}}
>>> in context child_terminated
>>> 2014-07-22 04:16:12.850 [error] {<<"method=POST;
>>> path=/environments/_default/cookbook_versions; status=503;
">>,"Service
>>> Unavailable"}
>>> 2014-07-22 04:16:13.010 [error] {<<"method=POST;
>>> path=/environments/_default/cookbook_versions; status=503;
">>,"Service
>>> Unavailable"}
>>> 2014-07-22 04:16:13.024 [error] {<<"method=POST;
>>> path=/environments/_default/cookbook_versions; status=503;
">>,"Service
>>> Unavailable"}
>>> 2014-07-22 04:17:05.852 [error] {<<"method=POST;
>>> path=/environments/_default/cookbook_versions; status=503;
">>,"Service
>>> Unavailable"}
>>> 2014-07-22 04:17:13.356 [error] {<<"method=POST;
>>> path=/environments/_default/cookbook_versions; status=503;
">>,"Service
>>> Unavailable"}
>>> 2014-07-22 04:17:13.363 [error] {<<"method=POST;
>>> path=/environments/_default/cookbook_versions; status=503;
">>,"Service
>>> Unavailable"}
>>> 2014-07-22 04:17:13.433 [error] {<<"method=POST;
>>> path=/environments/_default/cookbook_versions; status=503;
">>,"Service
>>> Unavailable"}
>>>
>>>
>>>
>>> Thanks
>>> Jesse Hu
>>>
>
>


Stephen Delano
Software Development Engineer
Opscode, Inc.
1008 Western Avenue
Suite 601
Seattle, WA 98104


#9

Hi,

On Thu, Jul 31, 2014 at 6:05 AM, Jesse Hu huhui14@gmail.com wrote:

If adding more CPU cores, will increase erchef[‘depsolver_worker_
count’] and nginx[‘worker_processes’] solve the 503 issue? Both these
2 attributes’ devalue value are related to CPU cores number ?

nginx[‘worker_processes’] should automatically scale based on CPU
count if you run a chef-server-ctl reconfigure. You will, however,
need to manually bump the depsolver_worker_count.

Yes, the chef-client exits after getting 503. How can I tell the
chef-client to retry syncing cookbook or calling Chef Node or Search APIs?
Is there a settings in /etc/chef/client.rb? If not, I may re-spawn
chef-client when getting 503 error.

By default, any API calls should be retried if they receive a HTTP
response code in the 500-599 range. However, I believe that, the
calls to actually download the cookbook content from s3/bookshelf are
not retried.

I suspected depsolver because of the serverside error messages from
erchef, but perhaps you are seeing other errors as well that just
didn’t show up in your tail command on the server. It’s possible that
bookshelf or nginx is also failing.

For the server The Chef support team wrote a handy script that I’ve
uploaded to a gist here:

https://gist.githubusercontent.com/stevendanna/279658e5fb3961f4b347/raw/1bf2afae25a05a0b3699ded6cb80139fa6250046/gistfile1.txt

if you run it with the argument OSC, it will create a tarball of the
last bits of all of your log files. So if you generate some failures
and then run it, it should contain all the easily reachable evidence
of what is going on. If you’d like, you could email me (not the whole
list) the tarball and I can try to take a look. It would also be
helpful if you sent along an example of a chef-client run that is
failing with debug logging turned on.

Is there a limit for number of the concurrent chef clients served by a single Chef Server 11 ? The depsolver workers number seems a blocker, unless giving more CPU cores.

For any system there is a limit of how many concurrent connections it
can handle based on the available resources (CPU, RAM, etc) and
various limits imposed by the operating system (max file handles, max
processes, etc). Since EC has a number of components it is hard to
say what the hard limit is for a given machine.

Depsolving is limited by the number of depsolver workers. But, in
most cases, the task the workers do complete quickly. Further,
chef-client should retry 503s when depsolving, so the number of
concurrent /chef-client runs/ that can be handled typically far
exceeds simply the number of depsolver workers. However, at some
point, you do just need to feed depsolver more CPUs.

I tried ‘no_lazy_load false’ in my 300 nodes cluster, it still threw 503 ERROR. I see there is a commit make no_lazy_load the default in 7 days ago. As it described, my chef-client might run for a long time (1~2 hours), so I think I’d better set no_lazy_load to true.

Since you are running your own server; however, you can set your
s3_url_ttl to anything you want, avoiding the problem of expired
links. However, I think it is probably best to definitively identify
where your bottleneck is before tweaking more settings anyway.

Cheers,

Steven


#10

Hi Steven,

I found that erchef[‘depsolver_worker_count’] is listed on
http://docs.getchef.com/config_rb_chef_server_enterprise_optional_settings.html
, but not on
http://docs.getchef.com/config_rb_chef_server_optional_settings.html.

Does it mean depsolver_worker_count is not applicable on Open Source
Chef Server 11 ?
If it works for Open Source Chef Server 11, how can I know how many
depsolver_workers are running after I set it in
/etc/chef-server/chef-server.rb ?
And erchef[‘depsolver_timeout’] = 5000 means the server side timeout
when resolving cookbook dependencies? What’s the chef-client side
timeout when syncing cookbooks and how to increase it ?

Thanks
Jesse Hu


Steven Danna wrote On Thu, 31 Jul 2014 09:25:22 +0100 :

Hi,

On Thu, Jul 31, 2014 at 6:05 AM, Jesse Hu huhui14@gmail.com wrote:

If adding more CPU cores, will increase erchef[‘depsolver_worker_
count’] and nginx[‘worker_processes’] solve the 503 issue? Both these
2 attributes’ devalue value are related to CPU cores number ?

nginx[‘worker_processes’] should automatically scale based on CPU
count if you run a chef-server-ctl reconfigure. You will, however,
need to manually bump the depsolver_worker_count.

Yes, the chef-client exits after getting 503. How can I tell the
chef-client to retry syncing cookbook or calling Chef Node or Search APIs?
Is there a settings in /etc/chef/client.rb? If not, I may re-spawn
chef-client when getting 503 error.

By default, any API calls should be retried if they receive a HTTP
response code in the 500-599 range. However, I believe that, the
calls to actually download the cookbook content from s3/bookshelf are
not retried.

I suspected depsolver because of the serverside error messages from
erchef, but perhaps you are seeing other errors as well that just
didn’t show up in your tail command on the server. It’s possible that
bookshelf or nginx is also failing.

For the server The Chef support team wrote a handy script that I’ve
uploaded to a gist here:

https://gist.githubusercontent.com/stevendanna/279658e5fb3961f4b347/raw/1bf2afae25a05a0b3699ded6cb80139fa6250046/gistfile1.txt

if you run it with the argument OSC, it will create a tarball of the
last bits of all of your log files. So if you generate some failures
and then run it, it should contain all the easily reachable evidence
of what is going on. If you’d like, you could email me (not the whole
list) the tarball and I can try to take a look. It would also be
helpful if you sent along an example of a chef-client run that is
failing with debug logging turned on.

Is there a limit for number of the concurrent chef clients served by a single Chef Server 11 ? The depsolver workers number seems a blocker, unless giving more CPU cores.
For any system there is a limit of how many concurrent connections it
can handle based on the available resources (CPU, RAM, etc) and
various limits imposed by the operating system (max file handles, max
processes, etc). Since EC has a number of components it is hard to
say what the hard limit is for a given machine.

Depsolving is limited by the number of depsolver workers. But, in
most cases, the task the workers do complete quickly. Further,
chef-client should retry 503s when depsolving, so the number of
concurrent /chef-client runs/ that can be handled typically far
exceeds simply the number of depsolver workers. However, at some
point, you do just need to feed depsolver more CPUs.

I tried ‘no_lazy_load false’ in my 300 nodes cluster, it still threw 503 ERROR. I see there is a commit make no_lazy_load the default in 7 days ago. As it described, my chef-client might run for a long time (1~2 hours), so I think I’d better set no_lazy_load to true.
Since you are running your own server; however, you can set your
s3_url_ttl to anything you want, avoiding the problem of expired
links. However, I think it is probably best to definitively identify
where your bottleneck is before tweaking more settings anyway.

Cheers,

Steven


#11

On Fri, Aug 1, 2014 at 4:28 AM, Jesse Hu huhui14@gmail.com wrote:

Hi Steven,

I found that erchef[‘depsolver_worker_count’] is listed on
http://docs.getchef.com/config_rb_chef_server_enterprise_optional_settings.html
, but not on
http://docs.getchef.com/config_rb_chef_server_optional_settings.html.

Does it mean depsolver_worker_count is not applicable on Open Source Chef
Server 11 ?
If it works for Open Source Chef Server 11, how can I know how many
depsolver_workers are running after I set it in
/etc/chef-server/chef-server.rb ?

Yes, this works as well in recent versions of Open Source Chef Server 11
(11.1+), and the docs should be updated to reflect this. A quick way too
verify the number of depsolver workers that are running is executing pgrep -fl depselector. You’ll see the number of ruby processes that are waiting
to receive cookbook dependency problems for solving.

And erchef[‘depsolver_timeout’] = 5000 means the server side timeout
when resolving cookbook dependencies? What’s the chef-client side timeout
when syncing cookbooks and how to increase it ?

Correct, 5000ms is the time that erchef gives depselector / gecode to reach
a solution given the current cookboook universe, run list, and environment
constraints.

What do you mean by “client side timeouts”? What timeouts are you seeing?

Thanks
Jesse Hu


Steven Danna wrote On Thu, 31 Jul 2014 09:25:22 +0100 :

Hi,

On Thu, Jul 31, 2014 at 6:05 AM, Jesse Hu huhui14@gmail.com huhui14@gmail.com wrote:

If adding more CPU cores, will increase erchef['depsolver_worker_

count’] and nginx[‘worker_processes’] solve the 503 issue? Both these
2 attributes’ devalue value are related to CPU cores number ?

nginx[‘worker_processes’] should automatically scale based on CPU
count if you run a chef-server-ctl reconfigure. You will, however,
need to manually bump the depsolver_worker_count.

Yes, the chef-client exits after getting 503. How can I tell the
chef-client to retry syncing cookbook or calling Chef Node or Search APIs?
Is there a settings in /etc/chef/client.rb? If not, I may re-spawn
chef-client when getting 503 error.

By default, any API calls should be retried if they receive a HTTP
response code in the 500-599 range. However, I believe that, the
calls to actually download the cookbook content from s3/bookshelf are
not retried.

I suspected depsolver because of the serverside error messages from
erchef, but perhaps you are seeing other errors as well that just
didn’t show up in your tail command on the server. It’s possible that
bookshelf or nginx is also failing.

For the server The Chef support team wrote a handy script that I’ve
uploaded to a gist here:
https://gist.githubusercontent.com/stevendanna/279658e5fb3961f4b347/raw/1bf2afae25a05a0b3699ded6cb80139fa6250046/gistfile1.txt

if you run it with the argument OSC, it will create a tarball of the
last bits of all of your log files. So if you generate some failures
and then run it, it should contain all the easily reachable evidence
of what is going on. If you’d like, you could email me (not the whole
list) the tarball and I can try to take a look. It would also be
helpful if you sent along an example of a chef-client run that is
failing with debug logging turned on.

Is there a limit for number of the concurrent chef clients served by a single Chef Server 11 ? The depsolver workers number seems a blocker, unless giving more CPU cores.

For any system there is a limit of how many concurrent connections it
can handle based on the available resources (CPU, RAM, etc) and
various limits imposed by the operating system (max file handles, max
processes, etc). Since EC has a number of components it is hard to
say what the hard limit is for a given machine.

Depsolving is limited by the number of depsolver workers. But, in
most cases, the task the workers do complete quickly. Further,
chef-client should retry 503s when depsolving, so the number of
concurrent /chef-client runs/ that can be handled typically far
exceeds simply the number of depsolver workers. However, at some
point, you do just need to feed depsolver more CPUs.

I tried ‘no_lazy_load false’ in my 300 nodes cluster, it still threw 503 ERROR. I see there is a commit make no_lazy_load the default in 7 days ago. As it described, my chef-client might run for a long time (1~2 hours), so I think I’d better set no_lazy_load to true.

Since you are running your own server; however, you can set your
s3_url_ttl to anything you want, avoiding the problem of expired
links. However, I think it is probably best to definitively identify
where your bottleneck is before tweaking more settings anyway.

Cheers,

Steven


Stephen Delano
Software Development Engineer
Opscode, Inc.
1008 Western Avenue
Suite 601
Seattle, WA 98104


#12

Here’s the git commit for the docs update:

james

On Fri, Aug 1, 2014 at 9:30 AM, Stephen Delano stephen@opscode.com wrote:

On Fri, Aug 1, 2014 at 4:28 AM, Jesse Hu huhui14@gmail.com wrote:

Hi Steven,

I found that erchef[‘depsolver_worker_count’] is listed on
http://docs.getchef.com/config_rb_chef_server_enterprise_optional_settings.html
, but not on
http://docs.getchef.com/config_rb_chef_server_optional_settings.html.

Does it mean depsolver_worker_count is not applicable on Open Source Chef
Server 11 ?
If it works for Open Source Chef Server 11, how can I know how many
depsolver_workers are running after I set it in
/etc/chef-server/chef-server.rb ?

Yes, this works as well in recent versions of Open Source Chef Server 11
(11.1+), and the docs should be updated to reflect this. A quick way too
verify the number of depsolver workers that are running is executing pgrep -fl depselector. You’ll see the number of ruby processes that are waiting
to receive cookbook dependency problems for solving.

And erchef[‘depsolver_timeout’] = 5000 means the server side timeout
when resolving cookbook dependencies? What’s the chef-client side timeout
when syncing cookbooks and how to increase it ?

Correct, 5000ms is the time that erchef gives depselector / gecode to
reach a solution given the current cookboook universe, run list, and
environment constraints.

What do you mean by “client side timeouts”? What timeouts are you seeing?

Thanks
Jesse Hu


Steven Danna wrote On Thu, 31 Jul 2014 09:25:22 +0100 :

Hi,

On Thu, Jul 31, 2014 at 6:05 AM, Jesse Hu huhui14@gmail.com huhui14@gmail.com wrote:

If adding more CPU cores, will increase erchef['depsolver_worker_

count’] and nginx[‘worker_processes’] solve the 503 issue? Both these
2 attributes’ devalue value are related to CPU cores number ?

nginx[‘worker_processes’] should automatically scale based on CPU
count if you run a chef-server-ctl reconfigure. You will, however,
need to manually bump the depsolver_worker_count.

Yes, the chef-client exits after getting 503. How can I tell the
chef-client to retry syncing cookbook or calling Chef Node or Search APIs?
Is there a settings in /etc/chef/client.rb? If not, I may re-spawn
chef-client when getting 503 error.

By default, any API calls should be retried if they receive a HTTP
response code in the 500-599 range. However, I believe that, the
calls to actually download the cookbook content from s3/bookshelf are
not retried.

I suspected depsolver because of the serverside error messages from
erchef, but perhaps you are seeing other errors as well that just
didn’t show up in your tail command on the server. It’s possible that
bookshelf or nginx is also failing.

For the server The Chef support team wrote a handy script that I’ve
uploaded to a gist here:
https://gist.githubusercontent.com/stevendanna/279658e5fb3961f4b347/raw/1bf2afae25a05a0b3699ded6cb80139fa6250046/gistfile1.txt

if you run it with the argument OSC, it will create a tarball of the
last bits of all of your log files. So if you generate some failures
and then run it, it should contain all the easily reachable evidence
of what is going on. If you’d like, you could email me (not the whole
list) the tarball and I can try to take a look. It would also be
helpful if you sent along an example of a chef-client run that is
failing with debug logging turned on.

Is there a limit for number of the concurrent chef clients served by a single Chef Server 11 ? The depsolver workers number seems a blocker, unless giving more CPU cores.

For any system there is a limit of how many concurrent connections it
can handle based on the available resources (CPU, RAM, etc) and
various limits imposed by the operating system (max file handles, max
processes, etc). Since EC has a number of components it is hard to
say what the hard limit is for a given machine.

Depsolving is limited by the number of depsolver workers. But, in
most cases, the task the workers do complete quickly. Further,
chef-client should retry 503s when depsolving, so the number of
concurrent /chef-client runs/ that can be handled typically far
exceeds simply the number of depsolver workers. However, at some
point, you do just need to feed depsolver more CPUs.

I tried ‘no_lazy_load false’ in my 300 nodes cluster, it still threw 503 ERROR. I see there is a commit make no_lazy_load the default in 7 days ago. As it described, my chef-client might run for a long time (1~2 hours), so I think I’d better set no_lazy_load to true.

Since you are running your own server; however, you can set your
s3_url_ttl to anything you want, avoiding the problem of expired
links. However, I think it is probably best to definitively identify
where your bottleneck is before tweaking more settings anyway.

Cheers,

Steven


Stephen Delano
Software Development Engineer
Opscode, Inc.
1008 Western Avenue
Suite 601
Seattle, WA 98104


#13

Thanks Steven, James.

I see there are 5 depselector as expected.

$ pgrep -fl depselector
4243 ruby
/opt/chef-server/embedded/service/erchef/lib/chef_objects-e6f9f10/priv/depselector_rb/depselector.rb
4319 ruby
/opt/chef-server/embedded/service/erchef/lib/chef_objects-e6f9f10/priv/depselector_rb/depselector.rb
4321 ruby
/opt/chef-server/embedded/service/erchef/lib/chef_objects-e6f9f10/priv/depselector_rb/depselector.rb
4347 ruby
/opt/chef-server/embedded/service/erchef/lib/chef_objects-e6f9f10/priv/depselector_rb/depselector.rb
4349 ruby
/opt/chef-server/embedded/service/erchef/lib/chef_objects-e6f9f10/priv/depselector_rb/depselector.rb

Correct, 5000ms is the time that erchef gives depselector / gecode to
reach a solution given the current cookboook universe, run list, and
environment constraints.

What do you mean by “client side timeouts”? What timeouts are you seeing?
Here I mean if I want to increase the depsolver_timeout to give more
time for resolving the cookbook dependency, should I also need to
increase the chef-client timeout for syncing the cookbooks to make
chef-client timeout longer than depsolver_timeout.

Thanks
Jesse Hu


James Scott wrote On Fri, 1 Aug 2014 09:35:07 -0700 :

Here’s the git commit for the docs update:

https://github.com/opscode/chef-docs/commit/19e60cf2b108f59fbcf2197482534d8244a69fb8

james

On Fri, Aug 1, 2014 at 9:30 AM, Stephen Delano <stephen@opscode.com
mailto:stephen@opscode.com> wrote:

On Fri, Aug 1, 2014 at 4:28 AM, Jesse Hu <huhui14@gmail.com
<mailto:huhui14@gmail.com>> wrote:

    Hi Steven,

    I found that erchef['depsolver_worker_count'] is listed on
    http://docs.getchef.com/config_rb_chef_server_enterprise_optional_settings.html
    , but not on
    http://docs.getchef.com/config_rb_chef_server_optional_settings.html.


    Does it mean depsolver_worker_count is not applicable on Open
    Source Chef Server 11 ?
    If it works for Open Source Chef Server 11, how can I know how
    many depsolver_workers are running after I set it in
    /etc/chef-server/chef-server.rb ?


Yes, this works as well in recent versions of Open Source Chef
Server 11 (11.1+), and the docs should be updated to reflect this.
A quick way too verify the number of depsolver workers that are
running is executing `pgrep -fl depselector`. You'll see the
number of ruby processes that are waiting to receive cookbook
dependency problems for solving.

    And erchef['depsolver_timeout'] = 5000 means the server side
    timeout when resolving cookbook dependencies? What's the
    chef-client side timeout when syncing cookbooks and how to
    increase it ?


Correct, 5000ms is the time that erchef gives depselector / gecode
to reach a solution given the current cookboook universe, run
list, and environment constraints.

What do you mean by "client side timeouts"? What timeouts are you
seeing?


    Thanks
    Jesse Hu

    ------------------------------------------------------------------------
    Steven Danna wrote On Thu, 31 Jul 2014 09:25:22 +0100 :
    Hi,

    On Thu, Jul 31, 2014 at 6:05 AM, Jesse Hu<huhui14@gmail.com>  <mailto:huhui14@gmail.com>  wrote:
    If adding more CPU cores,  will increase erchef['depsolver_worker_
    count'] and nginx['worker_processes'] solve the 503 issue? Both these
    2 attributes' devalue value are related to CPU cores number ?

    nginx['worker_processes'] should automatically scale based on CPU
    count if you run a chef-server-ctl reconfigure.  You will, however,
    need to manually bump the depsolver_worker_count.
    <jesse>  Yes, the chef-client exits after getting 503.  How can I tell the
    chef-client to retry syncing cookbook or calling Chef Node or Search APIs?
    Is there a settings in /etc/chef/client.rb?  If not, I may re-spawn
    chef-client when getting 503 error.
    By default, any API calls should be retried if they receive a HTTP
    response code in the 500-599 range.  However, I believe that, the
    calls to actually download the cookbook content from s3/bookshelf are
    not retried.

    I suspected depsolver because of the serverside error messages from
    erchef, but perhaps you are seeing other errors as well that just
    didn't show up in your tail command on the server.  It's possible that
    bookshelf or nginx is also failing.

    For the server The Chef support team wrote a handy script that I've
    uploaded to a gist here:

    https://gist.githubusercontent.com/stevendanna/279658e5fb3961f4b347/raw/1bf2afae25a05a0b3699ded6cb80139fa6250046/gistfile1.txt

    if you run it with the argument OSC, it will create a tarball of the
    last bits of all of your log files.  So if you generate some failures
    and then run it, it should contain all the easily reachable evidence
    of what is going on.  If you'd like, you could email me (not the whole
    list) the tarball and I can try to take a look.  It would also be
    helpful if you sent along an example of a chef-client run that is
    failing with debug logging turned on.
    Is there a limit for number of the concurrent chef clients served by a single Chef Server 11 ?  The depsolver workers number seems a blocker, unless giving more CPU cores.
    For any system there is a limit of how many concurrent connections it
    can handle based on the available resources (CPU, RAM, etc) and
    various limits imposed by the operating system (max file handles, max
    processes, etc).  Since EC has a number of components it is hard to
    say what the hard limit is for a given machine.

    Depsolving is limited by the number of depsolver workers.  But, in
    most cases, the task the workers do complete quickly.  Further,
    chef-client *should* retry 503s when depsolving, so the number of
    concurrent /chef-client runs/ that can be handled typically far
    exceeds simply the number of depsolver workers.  However, at some
    point, you do just need to feed depsolver more CPUs.
    <jesse>  I tried 'no_lazy_load false' in my 300 nodes cluster, it still threw 503 ERROR.  I see there is a commit make no_lazy_load the default in 7 days ago. As it described, my chef-client might run for a long time (1~2 hours), so I think I'd better set no_lazy_load to true.
    Since you are running your own server; however, you can set your
    s3_url_ttl to anything you want, avoiding the problem of expired
    links.  However, I think it is probably best to definitively identify
    where your bottleneck is before tweaking more settings anyway.

    Cheers,

    Steven
-- 
Stephen Delano
Software Development Engineer
Opscode, Inc.
1008 Western Avenue
Suite 601
Seattle, WA 98104

#14

Hi Stephen,

Thanks for your valuable tips. There is a chef bug CHEF-ISSUE-1904
https://github.com/opscode/chef/issues/1904 that chef-client doesn’t
retry when getting HTTP 50X errors from Chef Server. I made a patch for
it : https://github.com/opscode/chef/pull/1912

With this patch, I’m able to provision a 500 nodes cluster using Chef 11
without any chef-client exception. From the chef-client output, all
chef-clients did retry about 100+ times in total and a single
chef-client retried at most 2 times.

Here is my chef version and configuration:

Open Source Chef Server 11.1.4 on a CentOS 5.10 with 4 core CPU and 8G
Memory
Chef Client 11.14.2

==== /etc/chef-server/chef-server.rb ====

chef_server_webui[‘enable’] = false
nginx[‘ssl_port’] = 9443
nginx[‘non_ssl_port’] = 9080
chef_solr[‘heap_size’] = 3072
chef_solr[‘commit_interval’] = 3000
chef_solr[‘poll_seconds’] = 6
chef_expander[‘nodes’] = 4
erchef[‘s3_parallel_ops_timeout’] = 30000
erchef[‘s3_url_ttl’] = 3600
erchef[‘depsolver_timeout’] = 5000
erchef[‘depsolver_worker_count’] = 10 # default value is 5 (per 2 CPU
cores). So I set it to 10 for my 4 CPU cores.
erchef[‘db_pool_size’] = 250**# this is because I have a few Chef Node
Save/Search API calls in the cookbook during chef-client execution.*
*postgresql[‘max_connections’] = 350

==== /etc/chef/client.rb ====

log_location STDOUT
chef_server_url "https://hostname:9443"
validation_client_name "chef-validator"
node_name “mynode”

log_level :info
no_lazy_load true
ssl_verify_mode :verify_peer
ssl_ca_path “/etc/chef/trusted_certs”

this is to reduce the chef node data size for speeding up Chef Search

API calls
Ohai::Config[:disabled_plugins] = [:Azure, :Filesystem, :Cloudv2,
:Virtualization, :Virtualizationinfo, :Dmi, :Zpools, :Blockdevice, :Lsb,
:Nodejs, :Languages, :Php, :Lua, :Perl, :C, :Java, :Python, :Erlang,
:Groovy, :Ruby, :Mono, :Os, :Openstack, :Cloud, :Rackspace, :Ps,
:Command, :Initpackage, :Rootgroup, :Keys, :Sshhostkey, :Ohai, :Chef,
:Ohaitime, :Passwd, :Gce, :Systemprofile, :Linode, :Ipscopes,
:Eucalyptus, :Ec2]

Thanks
Jesse Hu


Stephen Delano wrote On Fri, 1 Aug 2014 09:30:12 -0700 :

On Fri, Aug 1, 2014 at 4:28 AM, Jesse Hu <huhui14@gmail.com
mailto:huhui14@gmail.com> wrote:

Hi Steven,

I found that erchef['depsolver_worker_count'] is listed on
http://docs.getchef.com/config_rb_chef_server_enterprise_optional_settings.html
, but not on
http://docs.getchef.com/config_rb_chef_server_optional_settings.html.

Does it mean depsolver_worker_count is not applicable on Open
Source Chef Server 11 ?
If it works for Open Source Chef Server 11, how can I know how
many depsolver_workers are running after I set it in
/etc/chef-server/chef-server.rb ?

Yes, this works as well in recent versions of Open Source Chef Server
11 (11.1+), and the docs should be updated to reflect this. A quick
way too verify the number of depsolver workers that are running is
executing pgrep -fl depselector. You’ll see the number of ruby
processes that are waiting to receive cookbook dependency problems for
solving.

And erchef['depsolver_timeout'] = 5000   means the server side
timeout when resolving cookbook dependencies? What's the
chef-client side timeout when syncing cookbooks and how to
increase it ?

Correct, 5000ms is the time that erchef gives depselector / gecode to
reach a solution given the current cookboook universe, run list, and
environment constraints.

What do you mean by “client side timeouts”? What timeouts are you seeing?

Thanks
Jesse Hu

------------------------------------------------------------------------
Steven Danna wrote On Thu, 31 Jul 2014 09:25:22 +0100 :
Hi,

On Thu, Jul 31, 2014 at 6:05 AM, Jesse Hu<huhui14@gmail.com>  <mailto:huhui14@gmail.com>  wrote:
If adding more CPU cores,  will increase erchef['depsolver_worker_
count'] and nginx['worker_processes'] solve the 503 issue? Both these
2 attributes' devalue value are related to CPU cores number ?

nginx['worker_processes'] should automatically scale based on CPU
count if you run a chef-server-ctl reconfigure.  You will, however,
need to manually bump the depsolver_worker_count.
<jesse>  Yes, the chef-client exits after getting 503.  How can I tell the
chef-client to retry syncing cookbook or calling Chef Node or Search APIs?
Is there a settings in /etc/chef/client.rb?  If not, I may re-spawn
chef-client when getting 503 error.
By default, any API calls should be retried if they receive a HTTP
response code in the 500-599 range.  However, I believe that, the
calls to actually download the cookbook content from s3/bookshelf are
not retried.

I suspected depsolver because of the serverside error messages from
erchef, but perhaps you are seeing other errors as well that just
didn't show up in your tail command on the server.  It's possible that
bookshelf or nginx is also failing.

For the server The Chef support team wrote a handy script that I've
uploaded to a gist here:

https://gist.githubusercontent.com/stevendanna/279658e5fb3961f4b347/raw/1bf2afae25a05a0b3699ded6cb80139fa6250046/gistfile1.txt

if you run it with the argument OSC, it will create a tarball of the
last bits of all of your log files.  So if you generate some failures
and then run it, it should contain all the easily reachable evidence
of what is going on.  If you'd like, you could email me (not the whole
list) the tarball and I can try to take a look.  It would also be
helpful if you sent along an example of a chef-client run that is
failing with debug logging turned on.
Is there a limit for number of the concurrent chef clients served by a single Chef Server 11 ?  The depsolver workers number seems a blocker, unless giving more CPU cores.
For any system there is a limit of how many concurrent connections it
can handle based on the available resources (CPU, RAM, etc) and
various limits imposed by the operating system (max file handles, max
processes, etc).  Since EC has a number of components it is hard to
say what the hard limit is for a given machine.

Depsolving is limited by the number of depsolver workers.  But, in
most cases, the task the workers do complete quickly.  Further,
chef-client *should* retry 503s when depsolving, so the number of
concurrent /chef-client runs/ that can be handled typically far
exceeds simply the number of depsolver workers.  However, at some
point, you do just need to feed depsolver more CPUs.
<jesse>  I tried 'no_lazy_load false' in my 300 nodes cluster, it still threw 503 ERROR.  I see there is a commit make no_lazy_load the default in 7 days ago. As it described, my chef-client might run for a long time (1~2 hours), so I think I'd better set no_lazy_load to true.
Since you are running your own server; however, you can set your
s3_url_ttl to anything you want, avoiding the problem of expired
links.  However, I think it is probably best to definitively identify
where your bottleneck is before tweaking more settings anyway.

Cheers,

Steven


Stephen Delano
Software Development Engineer
Opscode, Inc.
1008 Western Avenue
Suite 601
Seattle, WA 98104