Clients timing out... where to start?

I have about 400 clients connecting in what should be a staggered pattern
(splay is set to 10 minutes), but every night at least half of them are
getting errors like this:

chef-client[20246]: [2013-01-19T07:36:46+00:00] 1: *** Chef 10.16.2 ***
chef-client[20246]: [2013-01-19T07:41:47+00:00] 3: Timeout connecting to
chef-app01.ops.atl.setg:4000 for /nodes/nagios.ops, retry 1/5
chef-client[20246]: [2013-01-19T07:46:52+00:00] 3: Timeout connecting to
chef-app01.ops.atl.setg:4000 for /nodes/nagios.ops, retry 2/5
chef-client[20246]: [2013-01-19T07:48:44+00:00] 4: Stacktrace dumped to
/var/cache/chef/chef-stacktrace.out
chef-client[20246]: [2013-01-19T07:48:44+00:00] 4: Errno::ECONNRESET:
Connection reset by peer
chef-client[6790]: [2013-01-19T07:51:46+00:00] 1: *** Chef 10.16.2 ***
chef-client[6790]: [2013-01-19T07:53:34+00:00] 4: Stacktrace dumped to
/var/cache/chef/chef-stacktrace.out
chef-client[6790]: [2013-01-19T07:53:34+00:00] 4: Errno::ECONNRESET:
Connection reset by peer

I’m not sure what I should be looking at here to diagnose the issue… are
there caps on what the merb/ruby api server can handle? Do I need to boost
ram or processor? (currently 8 gigs dual core xeon)
Maybe cluster the chef-server api? Maybe drop in the chef 11 erubis server?

thanks in advance!
-jesse

okay... well... i found one thing that might be contributing.
every node in the environment was downloading a 1.6 meg data bag item on
every chef run

On Sat, Jan 19, 2013 at 7:21 AM, Jesse Campbell hikeit@gmail.com wrote:

I have about 400 clients connecting in what should be a staggered pattern
(splay is set to 10 minutes), but every night at least half of them are
getting errors like this:

chef-client[20246]: [2013-01-19T07:36:46+00:00] 1: *** Chef 10.16.2 ***
chef-client[20246]: [2013-01-19T07:41:47+00:00] 3: Timeout connecting to
chef-app01.ops.atl.setg:4000 for /nodes/nagios.ops, retry 1/5
chef-client[20246]: [2013-01-19T07:46:52+00:00] 3: Timeout connecting to
chef-app01.ops.atl.setg:4000 for /nodes/nagios.ops, retry 2/5
chef-client[20246]: [2013-01-19T07:48:44+00:00] 4: Stacktrace dumped to
/var/cache/chef/chef-stacktrace.out
chef-client[20246]: [2013-01-19T07:48:44+00:00] 4: Errno::ECONNRESET:
Connection reset by peer
chef-client[6790]: [2013-01-19T07:51:46+00:00] 1: *** Chef 10.16.2 ***
chef-client[6790]: [2013-01-19T07:53:34+00:00] 4: Stacktrace dumped to
/var/cache/chef/chef-stacktrace.out
chef-client[6790]: [2013-01-19T07:53:34+00:00] 4: Errno::ECONNRESET:
Connection reset by peer

I'm not sure what I should be looking at here to diagnose the issue... are
there caps on what the merb/ruby api server can handle? Do I need to boost
ram or processor? (currently 8 gigs dual core xeon)
Maybe cluster the chef-server api? Maybe drop in the chef 11 erubis server?

thanks in advance!
-jesse