Intermittent failures with server 12.0.8?


#1

We run all of our chef clients every 60 minutes.

Throughout the day, we’re seeing clients, both 11.18.0-1 and 12.3.0-1,
report intermittent 403 authorization errors against Chef Server 12.0.8.

Is anyone else seeing this?


#2

On 5/29/2015 9:08 PM, Jeff Blaine wrote:

We run all of our chef clients every 60 minutes.

Throughout the day, we’re seeing clients, both 11.18.0-1 and 12.3.0-1,
report intermittent 403 authorization errors against Chef Server 12.0.8.

I meant 401 error. More info below.

Is anyone else seeing this?

Here’s one of the hourly cron jobs that failed with 401.

99% of its hourly runs work fine, for over a year. The hour after this
run, it worked fine too, and the next = … intermittent.

===================================================================

Authentication Error:

Failed to authenticate to the chef server (http 401).

Server Response:

An error occurred while trying to find ‘neon’. Please contact support.

Relevant Config Settings:

chef_server_url "https://cm.our.org"
node_name "neon"
client_key “/etc/chef/client.pem”

If these settings are correct, your client_key may be invalid.

===================================================================

ws% knife client show neon
admin: false
chef_type: client
json_class: Chef::ApiClient
name: neon
public_key: -----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQ…snip

===================================================================

% ssh neon sudo /opt/chef/bin/chef-client
[ works fine! ]
%


Jeff Blaine
kickflop.net
PGP/GnuPG Key ID: 0x0C8EDD02


#3

Hi,

The most likely cause of intermittent 401s are database timeouts
talking to postgresql. The erchef logs would tell for sure
(/var/log/opscode/opscode-erchef/request.log.N where N is an integer).
Chef Server 12.1.0 should improve database performance significantly;
however, for 12.0.8, you may see improvement by turning on queueing of
sql requests so that they don’t fail immediately when all connections
are in use:

opscode_erchef[‘db_pool_queue_max’] = 40
opscode_erchef[‘db_pooler_timeout’] = 2000

Placing that in your chef-server.rb and reconfiguring will instruct
erchef to queue up to 40 database requests when all connections are in
use. If the connection waits in the queue for more than 2000ms, it
will time out with an error. Another alternative would be to crease
the database connection pool size, but we’ve been preferring the
queuing where possible.

We’ve seen this reduce intermittent 401s caused by database issues at
large customer sites. We also have a patch in the works to make these
type of errors return 503s rather than 401s.

Cheers,

Steven

On Sat, May 30, 2015 at 5:45 PM, Jeff Blaine jblaine@kickflop.net wrote:

On 5/29/2015 9:08 PM, Jeff Blaine wrote:

We run all of our chef clients every 60 minutes.

Throughout the day, we’re seeing clients, both 11.18.0-1 and 12.3.0-1,
report intermittent 403 authorization errors against Chef Server 12.0.8.

I meant 401 error. More info below.

Is anyone else seeing this?

Here’s one of the hourly cron jobs that failed with 401.

99% of its hourly runs work fine, for over a year. The hour after this
run, it worked fine too, and the next = … intermittent.

===================================================================

Authentication Error:

Failed to authenticate to the chef server (http 401).

Server Response:

An error occurred while trying to find ‘neon’. Please contact support.

Relevant Config Settings:

chef_server_url "https://cm.our.org"
node_name "neon"
client_key “/etc/chef/client.pem”

If these settings are correct, your client_key may be invalid.

===================================================================

ws% knife client show neon
admin: false
chef_type: client
json_class: Chef::ApiClient
name: neon
public_key: -----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQ…snip

===================================================================

% ssh neon sudo /opt/chef/bin/chef-client
[ works fine! ]
%


Jeff Blaine
kickflop.net
PGP/GnuPG Key ID: 0x0C8EDD02


#4

The most likely cause of intermittent 401s are database timeouts
talking to postgresql. The erchef logs would tell for sure
(/var/log/opscode/opscode-erchef/request.log.N where N is an integer).
Chef Server 12.1.0 should improve database performance significantly;
however, for 12.0.8, you may see improvement by turning on queueing of
sql requests so that they don’t fail immediately when all connections
are in use:

opscode_erchef[‘db_pool_queue_max’] = 40
opscode_erchef[‘db_pooler_timeout’] = 2000

Thanks Steve. I’ve made the changes and will see how things work out.

The defaults I saw in place were db_pool_queue_max = 20 and
db_pooler_timeout = 0

What negative effect would there be to 40/2000 being a new Chef server
default going forward? Obviously it changes from a fail-now behavior to
a fail-in-2sec behavior, but I don’t immediately (and ignorantly) see
how anyone would care about that.

We’ve seen this reduce intermittent 401s caused by database issues at
large customer sites. We also have a patch in the works to make these
type of errors return 503s rather than 401s.

Good to hear. The 401 code threw us off.


Jeff Blaine
kickflop.net
PGP/GnuPG Key ID: 0x0C8EDD02


#5

Hi,

On Mon, Jun 1, 2015 at 3:50 PM, Jeff Blaine jblaine@kickflop.net wrote:

What negative effect would there be to 40/2000 being a new Chef server
default going forward? Obviously it changes from a fail-now behavior to
a fail-in-2sec behavior, but I don’t immediately (and ignorantly) see
how anyone would care about that.

Most people I’ve talked to about this agree and we are likely going to
make something similar to 40/2000 the default in an upcoming release
(instead of 40, I’m thinking something dynamic based on the
db_pool_size).

Cheers,

Steven


#6

On 6/2/2015 1:24 PM, Steven Danna wrote:

Hi,

On Mon, Jun 1, 2015 at 3:50 PM, Jeff Blaine jblaine@kickflop.net wrote:

What negative effect would there be to 40/2000 being a new Chef server
default going forward? Obviously it changes from a fail-now behavior to
a fail-in-2sec behavior, but I don’t immediately (and ignorantly) see
how anyone would care about that.

Most people I’ve talked to about this agree and we are likely going to
make something similar to 40/2000 the default in an upcoming release
(instead of 40, I’m thinking something dynamic based on the
db_pool_size).

Cool.

The 40/2000 changes made our failures go away, BTW.


Jeff Blaine
kickflop.net
PGP/GnuPG Key ID: 0x0C8EDD02