Clients timing out... where to start?

Jesse_Campbell · January 19, 2013, 12:21pm

I have about 400 clients connecting in what should be a staggered pattern
(splay is set to 10 minutes), but every night at least half of them are
getting errors like this:

chef-client[20246]: [2013-01-19T07:36:46+00:00] 1: *** Chef 10.16.2 ***
chef-client[20246]: [2013-01-19T07:41:47+00:00] 3: Timeout connecting to
chef-app01.ops.atl.setg:4000 for /nodes/nagios.ops, retry 1/5
chef-client[20246]: [2013-01-19T07:46:52+00:00] 3: Timeout connecting to
chef-app01.ops.atl.setg:4000 for /nodes/nagios.ops, retry 2/5
chef-client[20246]: [2013-01-19T07:48:44+00:00] 4: Stacktrace dumped to
/var/cache/chef/chef-stacktrace.out
chef-client[20246]: [2013-01-19T07:48:44+00:00] 4: Errno::ECONNRESET:
Connection reset by peer
chef-client[6790]: [2013-01-19T07:51:46+00:00] 1: *** Chef 10.16.2 ***
chef-client[6790]: [2013-01-19T07:53:34+00:00] 4: Stacktrace dumped to
/var/cache/chef/chef-stacktrace.out
chef-client[6790]: [2013-01-19T07:53:34+00:00] 4: Errno::ECONNRESET:
Connection reset by peer

I’m not sure what I should be looking at here to diagnose the issue… are
there caps on what the merb/ruby api server can handle? Do I need to boost
ram or processor? (currently 8 gigs dual core xeon)
Maybe cluster the chef-server api? Maybe drop in the chef 11 erubis server?

thanks in advance!
-jesse

Jesse_Campbell · January 19, 2013, 12:48pm

okay... well... i found one thing that might be contributing.
every node in the environment was downloading a 1.6 meg data bag item on
every chef run

On Sat, Jan 19, 2013 at 7:21 AM, Jesse Campbell hikeit@gmail.com wrote:

I have about 400 clients connecting in what should be a staggered pattern
(splay is set to 10 minutes), but every night at least half of them are
getting errors like this:

chef-client[20246]: [2013-01-19T07:36:46+00:00] 1: *** Chef 10.16.2 ***
chef-client[20246]: [2013-01-19T07:41:47+00:00] 3: Timeout connecting to
chef-app01.ops.atl.setg:4000 for /nodes/nagios.ops, retry 1/5
chef-client[20246]: [2013-01-19T07:46:52+00:00] 3: Timeout connecting to
chef-app01.ops.atl.setg:4000 for /nodes/nagios.ops, retry 2/5
chef-client[20246]: [2013-01-19T07:48:44+00:00] 4: Stacktrace dumped to
/var/cache/chef/chef-stacktrace.out
chef-client[20246]: [2013-01-19T07:48:44+00:00] 4: Errno::ECONNRESET:
Connection reset by peer
chef-client[6790]: [2013-01-19T07:51:46+00:00] 1: *** Chef 10.16.2 ***
chef-client[6790]: [2013-01-19T07:53:34+00:00] 4: Stacktrace dumped to
/var/cache/chef/chef-stacktrace.out
chef-client[6790]: [2013-01-19T07:53:34+00:00] 4: Errno::ECONNRESET:
Connection reset by peer

I'm not sure what I should be looking at here to diagnose the issue... are
there caps on what the merb/ruby api server can handle? Do I need to boost
ram or processor? (currently 8 gigs dual core xeon)
Maybe cluster the chef-server api? Maybe drop in the chef 11 erubis server?

thanks in advance!
-jesse

Topic		Replies	Views
Chef Server timeouts Chef Infra (archive)	7	1450	July 24, 2013
Chef-server 11.0.12 tuning guide? Chef Infra (archive)	5	542	May 12, 2014
Chef-client (still) randomly failing Chef Infra (archive)	12	964	November 22, 2013
Newbie chef-client question Chef Infra (archive)	2	263	October 29, 2014
Chef-client::systemd_service failed Chef Infra Client	0	595	May 11, 2023

Clients timing out... where to start?

Related topics