Bizarre EOFError on Google Compute Engine, behind NAT

Scenario: hybrid server environment running Chef 11.14.2/11.1.6 on a mix of
physical and virtual Ubuntu 12.04 systems. The virtual instances are a mix of
EC2 and Google Compute Engine. The instances are a combination of publicly
accessible (static/Elastic IP) and private (behind NAT). The systems on GCE
behind NAT receive this error when they attempt to request
/environments//cookbook_versions:

2014-10-18T07:49:41+00:00] DEBUG: EOFError: end of file reached
/opt/chef/embedded/lib/ruby/1.9.1/openssl/buffering.rb:174:in
sysread_nonblock' /opt/chef/embedded/lib/ruby/1.9.1/openssl/buffering.rb:174:inread_nonblock’
/opt/chef/embedded/lib/ruby/1.9.1/net/protocol.rb:141:in rbuf_fill' /opt/chef/embedded/lib/ruby/1.9.1/net/protocol.rb:92:inread’
/opt/chef/embedded/lib/ruby/1.9.1/net/http.rb:2780:in ensure in read_chunked' /opt/chef/embedded/lib/ruby/1.9.1/net/http.rb:2780:inread_chunked’
/opt/chef/embedded/lib/ruby/1.9.1/net/http.rb:2751:in read_body_0' /opt/chef/embedded/lib/ruby/1.9.1/net/http.rb:2711:inread_body’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/http.rb:262:in
block (2 levels) in send_http_request' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/http/basic_client.rb:74:inblock in request’
/opt/chef/embedded/lib/ruby/1.9.1/net/http.rb:1323:in block (2 levels) in transport_request' /opt/chef/embedded/lib/ruby/1.9.1/net/http.rb:2672:inreading_body’
/opt/chef/embedded/lib/ruby/1.9.1/net/http.rb:1322:in block in transport_request' /opt/chef/embedded/lib/ruby/1.9.1/net/http.rb:1317:incatch’
/opt/chef/embedded/lib/ruby/1.9.1/net/http.rb:1317:in transport_request' /opt/chef/embedded/lib/ruby/1.9.1/net/http.rb:1294:inrequest’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/rest-client-1.6.7/lib/restclient/net_http_ext.rb:51:in
request' /opt/chef/embedded/lib/ruby/1.9.1/net/http.rb:1287:inblock in request’
/opt/chef/embedded/lib/ruby/1.9.1/net/http.rb:746:in start' /opt/chef/embedded/lib/ruby/1.9.1/net/http.rb:1285:inrequest’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/rest-client-1.6.7/lib/restclient/net_http_ext.rb:51:in
request' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/http/basic_client.rb:65:inrequest’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/http.rb:262:in
block in send_http_request' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/http.rb:294:inblock in retrying_http_errors’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/http.rb:292:in
loop' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/http.rb:292:inretrying_http_errors’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/http.rb:256:in
send_http_request' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/http.rb:143:inrequest’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/http.rb:126:in
post' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/policy_builder/expand_node_object.rb:168:insync_cookbooks’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/policy_builder/expand_node_object.rb:66:in
setup_run_context' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/client.rb:265:insetup_run_context’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/client.rb:429:in
do_run' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/client.rb:213:inblock in run’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/client.rb:207:in
fork' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/client.rb:207:inrun’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/application.rb:236:in
run_chef_client' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/application/client.rb:338:inblock in run_application’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/application/client.rb:327:in
loop' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/application/client.rb:327:inrun_application’
/opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/lib/chef/application.rb:55:in
run' /opt/chef/embedded/lib/ruby/gems/1.9.1/gems/chef-11.16.4/bin/chef-client:26:in<top (required)>’
/usr/bin/chef-client:23:in load' /usr/bin/chef-client:23:in

The issue is 100% tied to the NAT topology, or my implementation thereof, which
is vanilla iptables masquerading on Ubuntu 12.04. I can route the same instance
along a public IP and it works fine. The second I push the route back through
the NAT, I get this error again. The Chef Server indicates that a 200 was
served, both through Erchef and Nginx. The NAT gateway itself is also managed
through Chef, and the Chef cilent on this system works just fine without
generating the error above.

The error seems to be reproducible about 99% of the time; however, if it does
not fail here, it fails at another API call somewhere down the path. It’s
frustrating that it occurs just slightly less than always.

I do not receive this error on SSL connections to other services, including
large file downloads. I can comfortably pull a 1 GB+ file from Amazon S3 or
Google Cloud Storage. I can clone the Linux kernel repository from GitHub using
HTTPS. This Chef server triggers the EOFError on this particular REST API call,
when the system is located behind a NAT gateway on Google Compute Engine but
not on Amazon.

Packet captures don’t show anything tremendously out of the ordinary besides
some out-of-order packets that I’m blaming on GCE, but if anyone knows what’s
special about this particular call that might lead me to what’s up with this
networking configuration, it would be very much appreciated.

Is the NAT gateway changing the Path MTU and have you broken PMTU
discovery via blocking all ICMP with iptables or something similar?
Also look at jumbo frames, etc. You should be able to debug this with
large ping packets, large GET/POST requests with curl/wget, or force
the MTU on the interfaces on both ends of the connection lower until it
starts working.

Thanks for the response! The MTU was my first gut instinct, but it didn’t take me anywhere right away. I spent a few hours Wiresharking (including force-disabling DHE ciphers so I could decrypt the SSL payloads), then disabled gzip on the server and started getting Content-Length mismatches instead of EOFErrors. That made me notice something suspicious about the payload sizes of the TLS continuations, and that led me to suspect a fragmentation issue with the GCE networking layer. This took me to a recent issue on the GCE project:

https://code.google.com/p/google-compute-engine/issues/detail?id=118&colspec=ID%20Type%20Component%20Resource%20Service%20Status%20Stars%20Summary%20Log

So, problem solved, I guess. Very weird that it only visibly impacted a handful of Chef API calls, and no other traffic.
On October 18, 2014 at 4:30:15 PM, Lamont Granquist (lamont@opscode.com) wrote:

Is the NAT gateway changing the Path MTU and have you broken PMTU
discovery via blocking all ICMP with iptables or something similar?
Also look at jumbo frames, etc. You should be able to debug this with
large ping packets, large GET/POST requests with curl/wget, or force
the MTU on the interfaces on both ends of the connection lower until it
starts working.

On Sat Oct 18 14:06:36 2014, Jeff Goldschrafe wrote:

Thanks for the response! The MTU was my first gut instinct, but it
didn’t take me anywhere right away. I spent a few hours Wiresharking
(including force-disabling DHE ciphers so I could decrypt the SSL
payloads), then disabled gzip on the server and started getting
Content-Length mismatches instead of EOFErrors. That made me notice
something suspicious about the payload sizes of the TLS continuations,
and that led me to suspect a fragmentation issue with the GCE
networking layer. This took me to a recent issue on the GCE project:

Google Issue Tracker

So, problem solved, I guess. Very weird that it only visibly impacted
a handful of Chef API calls, and no other traffic.

On October 18, 2014 at 4:30:15 PM, Lamont Granquist
(lamont@opscode.com mailto:lamont@opscode.com) wrote:

Is the NAT gateway changing the Path MTU and have you broken PMTU
discovery via blocking all ICMP with iptables or something similar?
Also look at jumbo frames, etc. You should be able to debug this with
large ping packets, large GET/POST requests with curl/wget, or force
the MTU on the interfaces on both ends of the connection lower until it
starts working.

Yeah we ran into a similar problem with TCP offloading in rackspace
windows images that produced EOFErrors as well, and rackspace engineers
produced some magic incantations for windows to turn off offloading
which made it go away.

Generally EOFErrors like this seem to be not-a-chef-bug and be some
kind of busted networking.