EOFError when decrypting data bags


#1

Ohai chefs!

Recently after upgrading to chef client 11.12.8 -> 11.14.6 we are seeing
frequent EOFError’s when decrypting an encrypted data bag. We are running chef
server 10.28.X. A subsequent run usually clears it up which makes it hard to
debug. The failure looks like so:

[2014-09-11T22:21:07+00:00] DEBUG: Re-raising exception: EOFError - end of file
reached
/opt/chef/embedded/lib/ruby/1.9.1/openssl/b

Anyone running into this or may have an idea why?

Thanks,
Jay


#2

These are some notes I wrote to myself to capture some of the recent
EOFError issues we’ve been seeing:

So far they all seem to be related to networking issues on large
packets with proxies and/or MTU issues. Those issues mostly did not
clear up on a retry, though, other than when the connection to S3
hiccup’d, so none of that may help your case. You’ll probably need to
collect full stack traces, ideally tcpdump’s on both ends, if you can
get server logs from a relatively quiet server when the client errors
that could be useful (but probably not so useful if its interleaved
with dozens of other clients converging).

And there’s nothing I’m aware of in 11.12.8 -> 11.14.6 that would cause
this behavior to change (openssl versions most likely got bumped, but
there’s no known issues with more recent openssl spitting out more
EOFErrors).

On Fri Sep 12 14:20:08 2014, JayP wrote:

Ohai chefs!

Recently after upgrading to chef client 11.12.8 -> 11.14.6 we are seeing
frequent EOFError’s when decrypting an encrypted data bag. We are running chef
server 10.28.X. A subsequent run usually clears it up which makes it hard to
debug. The failure looks like so:

[2014-09-11T22:21:07+00:00] DEBUG: Re-raising exception: EOFError - end of file
reached
/opt/chef/embedded/lib/ruby/1.9.1/openssl/b

Anyone running into this or may have an idea why?

Thanks,
Jay


#3

We experience infrequent EOF errors during node.save on successful runs.
Will read your notes thanks Lamont!
On Sep 12, 2014 2:39 PM, “Lamont Granquist” lamont@opscode.com wrote:

These are some notes I wrote to myself to capture some of the recent
EOFError issues we’ve been seeing:

https://gist.github.com/lamont-granquist/e25af8f50cb4ae4f8050

So far they all seem to be related to networking issues on large packets
with proxies and/or MTU issues. Those issues mostly did not clear up on a
retry, though, other than when the connection to S3 hiccup’d, so none of
that may help your case. You’ll probably need to collect full stack
traces, ideally tcpdump’s on both ends, if you can get server logs from a
relatively quiet server when the client errors that could be useful (but
probably not so useful if its interleaved with dozens of other clients
converging).

And there’s nothing I’m aware of in 11.12.8 -> 11.14.6 that would cause
this behavior to change (openssl versions most likely got bumped, but
there’s no known issues with more recent openssl spitting out more
EOFErrors).

On Fri Sep 12 14:20:08 2014, JayP wrote:

Ohai chefs!

Recently after upgrading to chef client 11.12.8 -> 11.14.6 we are seeing
frequent EOFError’s when decrypting an encrypted data bag. We are
running chef
server 10.28.X. A subsequent run usually clears it up which makes it
hard to
debug. The failure looks like so:

[2014-09-11T22:21:07+00:00] DEBUG: Re-raising exception: EOFError - end
of file
reached
/opt/chef/embedded/lib/ruby/1.9.1/openssl/b

Anyone running into this or may have an idea why?

Thanks,
Jay


#4

Yeah node.save at the end of the run sounds like the size of the node
grew past a point where something in the network decided to drop it.
Probably something involved in the network (where the network drivers
and TCP/IP stacks at both ends are included) and should be able to be
replicated by any large enough http POST to the server.

On Fri Sep 12 15:14:15 2014, Dennis Lovely wrote:

We experience infrequent EOF errors during node.save on successful
runs. Will read your notes thanks Lamont!

On Sep 12, 2014 2:39 PM, “Lamont Granquist” <lamont@opscode.com
mailto:lamont@opscode.com> wrote:

These are some notes I wrote to myself to capture some of the
recent EOFError issues we've been seeing:

https://gist.github.com/__lamont-granquist/__e25af8f50cb4ae4f8050
<https://gist.github.com/lamont-granquist/e25af8f50cb4ae4f8050>

So far they all seem to be related to networking issues on large
packets with proxies and/or MTU issues.  Those issues mostly did
not clear up on a retry, though, other than when the connection to
S3 hiccup'd, so none of that may help your case.  You'll probably
need to collect full stack traces, ideally tcpdump's on both ends,
if you can get server logs from a relatively quiet server when the
client errors that could be useful (but probably not so useful if
its interleaved with dozens of other clients converging).

And there's nothing I'm aware of in 11.12.8 -> 11.14.6 that would
cause this behavior to change (openssl versions most likely got
bumped, but there's no known issues with more recent openssl
spitting out more EOFErrors).

On Fri Sep 12 14:20:08 2014, JayP wrote:

    Ohai chefs!

    Recently after upgrading to chef client 11.12.8 -> 11.14.6 we
    are seeing
    frequent EOFError's when decrypting an encrypted data bag.  We
    are running chef
    server 10.28.X.  A subsequent run usually clears it up which
    makes it hard to
    debug.  The failure looks like so:

    [2014-09-11T22:21:07+00:00] DEBUG: Re-raising exception:
    EOFError - end of file
    reached
    /opt/chef/embedded/lib/ruby/1.__9.1/openssl/b

    Anyone running into this or may have an idea why?

    Thanks,
    Jay

#5

Thanks Lamont for the reference. Hopefully we can track something down and
try to at the least make it happen less infrequently since as of right now
in a separate AWS account it is happening almost 50% of the time when
bootstrapping a node.

On Fri, Sep 12, 2014 at 7:36 PM, Lamont Granquist lamont@opscode.com
wrote:

Yeah node.save at the end of the run sounds like the size of the node grew
past a point where something in the network decided to drop it. Probably
something involved in the network (where the network drivers and TCP/IP
stacks at both ends are included) and should be able to be replicated by
any large enough http POST to the server.

On Fri Sep 12 15:14:15 2014, Dennis Lovely wrote:

We experience infrequent EOF errors during node.save on successful
runs. Will read your notes thanks Lamont!

On Sep 12, 2014 2:39 PM, “Lamont Granquist” <lamont@opscode.com
mailto:lamont@opscode.com> wrote:

These are some notes I wrote to myself to capture some of the
recent EOFError issues we've been seeing:

https://gist.github.com/__lamont-granquist/__e25af8f50cb4ae4f8050

<https://gist.github.com/lamont-granquist/e25af8f50cb4ae4f8050>

So far they all seem to be related to networking issues on large
packets with proxies and/or MTU issues.  Those issues mostly did
not clear up on a retry, though, other than when the connection to
S3 hiccup'd, so none of that may help your case.  You'll probably
need to collect full stack traces, ideally tcpdump's on both ends,
if you can get server logs from a relatively quiet server when the
client errors that could be useful (but probably not so useful if
its interleaved with dozens of other clients converging).

And there's nothing I'm aware of in 11.12.8 -> 11.14.6 that would
cause this behavior to change (openssl versions most likely got
bumped, but there's no known issues with more recent openssl
spitting out more EOFErrors).

On Fri Sep 12 14:20:08 2014, JayP wrote:

    Ohai chefs!

    Recently after upgrading to chef client 11.12.8 -> 11.14.6 we
    are seeing
    frequent EOFError's when decrypting an encrypted data bag.  We
    are running chef
    server 10.28.X.  A subsequent run usually clears it up which
    makes it hard to
    debug.  The failure looks like so:

    [2014-09-11T22:21:07+00:00] DEBUG: Re-raising exception:
    EOFError - end of file
    reached
    /opt/chef/embedded/lib/ruby/1.__9.1/openssl/b

    Anyone running into this or may have an idea why?

    Thanks,
    Jay