Intermittent chef-expander problem


#1

We’ve an intermittent chef-expander problem… every few days it’ll
get invalid JSON from rabbitmq, the pattern’s always the same, 8
octets of corruption, the last 4 of which are always NUL. If we
restart rabbitmq & chef-expander the problem goes away for a few days,
then comes back… a (heavily snipped) example is:

[Thu, 25 Aug 2011 18:24:48 +0100] ERROR: cannot index object because it is inval
id JSON: {“action”:“add”,“payload”:{“enqueued_at”:1314293088,“id”:“92a1d57c-4579
-446c-8fda-c3331f8477d9”,“type”:“node”,“database”:“chef”,“item”:{

“dmi”:^@<80><9E>^Q^@^@^@^@{“address”:“0xEA2E0”,“firmware_revision”:“0.0”,

Any ideas?

This is Chef 0.10.0 on RHEL 5.5


Alex Kiernan


#2

On Fri, Aug 26, 2011 at 11:16 AM, Alex Kiernan alex.kiernan@gmail.com wrote:

We’ve an intermittent chef-expander problem… every few days it’ll
get invalid JSON from rabbitmq, the pattern’s always the same, 8
octets of corruption, the last 4 of which are always NUL. If we
restart rabbitmq & chef-expander the problem goes away for a few days,
then comes back… a (heavily snipped) example is:

[Thu, 25 Aug 2011 18:24:48 +0100] ERROR: cannot index object because it is inval
id JSON: {“action”:“add”,“payload”:{“enqueued_at”:1314293088,“id”:“92a1d57c-4579
-446c-8fda-c3331f8477d9”,“type”:“node”,“database”:“chef”,“item”:{

“dmi”:^@<80><9E>^Q^@^@^@^@{“address”:“0xEA2E0”,“firmware_revision”:“0.0”,

Any ideas?

This is Chef 0.10.0 on RHEL 5.5

Sorry should have added RabbitMQ is 2.5.1, Erlang is R14B.


Alex Kiernan


#3

On Friday, August 26, 2011 at 3:37 AM, Alex Kiernan wrote:

On Fri, Aug 26, 2011 at 11:16 AM, Alex Kiernan <alex.kiernan@gmail.com (mailto:alex.kiernan@gmail.com)> wrote:

We’ve an intermittent chef-expander problem… every few days it’ll
get invalid JSON from rabbitmq, the pattern’s always the same, 8
octets of corruption, the last 4 of which are always NUL. If we
restart rabbitmq & chef-expander the problem goes away for a few days,
then comes back… a (heavily snipped) example is:

Why restart chef-expander? Does the problem persist until you restart it?

[Thu, 25 Aug 2011 18:24:48 +0100] ERROR: cannot index object because it is inval
id JSON: {“action”:“add”,“payload”:{“enqueued_at”:1314293088,“id”:“92a1d57c-4579
-446c-8fda-c3331f8477d9”,“type”:“node”,“database”:“chef”,“item”:{

“dmi”:^@<80><9E>^Q^@^@^@^@{“address”:“0xEA2E0”,“firmware_revision”:“0.0”,

Any ideas?

This is Chef 0.10.0 on RHEL 5.5

Sorry should have added RabbitMQ is 2.5.1, Erlang is R14B.


Alex Kiernan
I’ve seen something like this in a different context that leads me to believe it’s a bug in the JSON gem or possibly Ruby. It seems to only occur on Red Hat systems. What version of ruby are you using? Is your system 64 bit? Is ruby 64 bit?


Dan DeLeo


#4

On Fri, Aug 26, 2011 at 4:53 PM, Daniel DeLeo dan@kallistec.com wrote:

On Friday, August 26, 2011 at 3:37 AM, Alex Kiernan wrote:

On Fri, Aug 26, 2011 at 11:16 AM, Alex Kiernan <alex.kiernan@gmail.com (mailto:alex.kiernan@gmail.com)> wrote:

We’ve an intermittent chef-expander problem… every few days it’ll
get invalid JSON from rabbitmq, the pattern’s always the same, 8
octets of corruption, the last 4 of which are always NUL. If we
restart rabbitmq & chef-expander the problem goes away for a few days,
then comes back… a (heavily snipped) example is:

Why restart chef-expander? Does the problem persist until you restart it?

Pretty sure we could just restart rabbitmq… not actually tried that,
but it would make sense for that to fix it since all that happens now
is the chef-expander child process dies, is respawned and promptly
dies with the exact same error (repeat until your disk fills with the
expander.log)

[Thu, 25 Aug 2011 18:24:48 +0100] ERROR: cannot index object because it is inval
id JSON: {“action”:“add”,“payload”:{“enqueued_at”:1314293088,“id”:“92a1d57c-4579
-446c-8fda-c3331f8477d9”,“type”:“node”,“database”:“chef”,“item”:{

“dmi”:^@<80><9E>^Q^@^@^@^@{“address”:“0xEA2E0”,“firmware_revision”:“0.0”,

Any ideas?

This is Chef 0.10.0 on RHEL 5.5

Sorry should have added RabbitMQ is 2.5.1, Erlang is R14B.

I’ve seen something like this in a different context that leads me to believe it’s a bug in the JSON gem or possibly Ruby. It seems to only occur on Red Hat systems. What version of ruby are you using? Is your system 64 bit? Is ruby 64 bit?

System is 64 bit, ruby is 64 bit:

ruby 1.8.7 (2010-04-19 patchlevel 253) [x86_64-linux], MBARI 0x6770,
Ruby Enterprise Edition 2010.02

I’ve been hacking on an omnibus version of chef-server
(https://github.com/akiernan/omnibus) maybe I should expend a bit more
effort getting that so we can actually roll it out, though I guess I’m
not clear that might not suffer from the same problems, though at
least we’d have an entirely isolated/known environment.

Thinking about it some more I think we’ve the same problem where
chef-client gets bogus JSON at startup on our clients (also 64bit
RHEL) - I get this problem on my workstation when using an omnibus
chef-client (0.10.4), so it’d suggest that the problem’s somewhere
underlying :frowning: In fact just checking the logs the corruption that sees
is also in the dmi object.


Alex Kiernan


#5

[Thu, 25 Aug 2011 18:24:48 +0100] ERROR: cannot index object because it is inval
id JSON: {“action”:“add”,“payload”:{“enqueued_at”:1314293088,“id”:“92a1d57c-4579
-446c-8fda-c3331f8477d9”,“type”:“node”,“database”:“chef”,“item”:{

“dmi”:^@<80><9E>^Q^@^@^@^@{“address”:“0xEA2E0”,“firmware_revision”:“0.0”,

Now you’ve given me the clue, I’m guessing this is a 64 bit little
endian pointer that’s gotten stamped in there…

Any ideas?

This is Chef 0.10.0 on RHEL 5.5

Sorry should have added RabbitMQ is 2.5.1, Erlang is R14B.

I’ve seen something like this in a different context that leads me to believe it’s a bug in the JSON gem or possibly Ruby. It seems to only occur on Red Hat systems. What version of ruby are you using? Is your system 64 bit? Is ruby 64 bit?

I’m guessing this:

might be the problem. Time to get that omnibus build working!


Alex Kiernan


#6

I’ve seen something like this in a different context that leads me to believe it’s a bug in the JSON gem or possibly Ruby. It seems to only occur on Red Hat systems. What version of ruby are you using? Is your system 64 bit? Is ruby 64 bit?

I’m guessing this:

https://github.com/flori/json/issues/46

might be the problem. Time to get that omnibus build working!

Tried this just on a client (once I’d hacked the gemspecs so it didn’t
continue to pick up 1.5.2) and I still got some corrupt JSON through,
but I’m guessing the JSON that a client starts with comes from the
server, so any corruption there would propagate through?

To test that theory I’ve just hacked up the server similarly…


Alex Kiernan


#7

[moved to chef-dev]

On Wed, Aug 31, 2011 at 11:12 AM, Alex Kiernan alex.kiernan@gmail.com wrote:

I’ve seen something like this in a different context that leads me to believe it’s a bug in the JSON gem or possibly Ruby. It seems to only occur on Red Hat systems. What version of ruby are you using? Is your system 64 bit? Is ruby 64 bit?

I’m guessing this:

https://github.com/flori/json/issues/46

might be the problem. Time to get that omnibus build working!

Tried this just on a client (once I’d hacked the gemspecs so it didn’t
continue to pick up 1.5.2) and I still got some corrupt JSON through,
but I’m guessing the JSON that a client starts with comes from the
server, so any corruption there would propagate through?

To test that theory I’ve just hacked up the server similarly…

And it’s just died in the same way. Though this time I just restarted
rabbitmq and chef-expander is (as expected) happy.

Anyone any ideas? It definitely looks like it’s native code corruption
:frowning: My thinking at the moment is to patch chef-expander to discard (and
log) invalid messages since there’s little point in it getting into a
die/fork/die loop.


Alex Kiernan


#8

On Wednesday, August 31, 2011 at 5:00 AM, Alex Kiernan wrote:

[moved to chef-dev]

On Wed, Aug 31, 2011 at 11:12 AM, Alex Kiernan <alex.kiernan@gmail.com (mailto:alex.kiernan@gmail.com)> wrote:

I’ve seen something like this in a different context that leads me to believe it’s a bug in the JSON gem or possibly Ruby. It seems to only occur on Red Hat systems. What version of ruby are you using? Is your system 64 bit? Is ruby 64 bit?

I’m guessing this:

https://github.com/flori/json/issues/46

might be the problem. Time to get that omnibus build working!

Tried this just on a client (once I’d hacked the gemspecs so it didn’t
continue to pick up 1.5.2) and I still got some corrupt JSON through,
but I’m guessing the JSON that a client starts with comes from the
server, so any corruption there would propagate through?

To test that theory I’ve just hacked up the server similarly…

And it’s just died in the same way. Though this time I just restarted
rabbitmq and chef-expander is (as expected) happy.

Anyone any ideas? It definitely looks like it’s native code corruption
:frowning: My thinking at the moment is to patch chef-expander to discard (and
log) invalid messages since there’s little point in it getting into a
die/fork/die loop.


Alex Kiernan
Just to be clear, you’re now running with the json gem version 1.5.4 [1] on both the client and the server and you’re still seeing this issue? When I’ve seen this problem on the client side, the server will drop the invalid JSON and will end up returning a 400 or 500. So I think what you’re seeing is that the server gets valid JSON in from the client, then generates the data to send to chef-expander, and then triggers the bug when converting this data to JSON.

Also, I agree that chef-expander should drop and log invalid messages. It’s a tricky balance between ensuring that messages get retried for intermittent errors and dropping messages that will always cause an error, but in this case log and drop is definitely correct.

  1. https://github.com/flori/json/commits/v1.5.4


Dan DeLeo


#9

On Wed, Aug 31, 2011 at 4:21 PM, Daniel DeLeo dan@kallistec.com wrote:

On Wednesday, August 31, 2011 at 5:00 AM, Alex Kiernan wrote:

[moved to chef-dev]

On Wed, Aug 31, 2011 at 11:12 AM, Alex Kiernan <alex.kiernan@gmail.com (mailto:alex.kiernan@gmail.com)> wrote:

I’ve seen something like this in a different context that leads me to believe it’s a bug in the JSON gem or possibly Ruby. It seems to only occur on Red Hat systems. What version of ruby are you using? Is your system 64 bit? Is ruby 64 bit?

I’m guessing this:

https://github.com/flori/json/issues/46

might be the problem. Time to get that omnibus build working!

Tried this just on a client (once I’d hacked the gemspecs so it didn’t
continue to pick up 1.5.2) and I still got some corrupt JSON through,
but I’m guessing the JSON that a client starts with comes from the
server, so any corruption there would propagate through?

To test that theory I’ve just hacked up the server similarly…

And it’s just died in the same way. Though this time I just restarted
rabbitmq and chef-expander is (as expected) happy.

Anyone any ideas? It definitely looks like it’s native code corruption
:frowning: My thinking at the moment is to patch chef-expander to discard (and
log) invalid messages since there’s little point in it getting into a
die/fork/die loop.

Just to be clear, you’re now running with the json gem version 1.5.4 [1] on both the client and the server and you’re still seeing this issue?

Almost but not quite 1.5.4 - what I had was HEAD from github with
commits up to fe046d68c5ed88b32b1cf3343babcf367b5cc79f, but I see
there’s work since then with more GC guard stuff so I’ll pull that and
give it a whirl (and the log references that exact defect - I could’ve
sworn I had those changes :|)

I’ve not got this code on all our clients, but the failure I observed
was from the client I upgraded (and verified using lsof… got bitten
by the version constraints first time around)

When I’ve seen this problem on the client side, the server will drop the invalid JSON and will end up returning a 400 or 500. So I think what you’re seeing is that the server gets valid JSON in from the client, then generates the data to send to chef-expander, and then triggers the bug when converting this data to JSON.

Yeah, that fits.

Also, I agree that chef-expander should drop and log invalid messages. It’s a tricky balance between ensuring that messages get retried for intermittent errors and dropping messages that will always cause an error, but in this case log and drop is definitely correct.

Cool will give that a go when I get a chance!


Alex Kiernan


#10

On Wed, Aug 31, 2011 at 5:23 PM, Alex Kiernan alex.kiernan@gmail.com wrote:

On Wed, Aug 31, 2011 at 4:21 PM, Daniel DeLeo dan@kallistec.com wrote:

On Wednesday, August 31, 2011 at 5:00 AM, Alex Kiernan wrote:

[moved to chef-dev]

On Wed, Aug 31, 2011 at 11:12 AM, Alex Kiernan <alex.kiernan@gmail.com (mailto:alex.kiernan@gmail.com)> wrote:

I’ve seen something like this in a different context that leads me to believe it’s a bug in the JSON gem or possibly Ruby. It seems to only occur on Red Hat systems. What version of ruby are you using? Is your system 64 bit? Is ruby 64 bit?

I’m guessing this:

https://github.com/flori/json/issues/46

might be the problem. Time to get that omnibus build working!

Tried this just on a client (once I’d hacked the gemspecs so it didn’t
continue to pick up 1.5.2) and I still got some corrupt JSON through,
but I’m guessing the JSON that a client starts with comes from the
server, so any corruption there would propagate through?

To test that theory I’ve just hacked up the server similarly…

And it’s just died in the same way. Though this time I just restarted
rabbitmq and chef-expander is (as expected) happy.

Anyone any ideas? It definitely looks like it’s native code corruption
:frowning: My thinking at the moment is to patch chef-expander to discard (and
log) invalid messages since there’s little point in it getting into a
die/fork/die loop.

Just to be clear, you’re now running with the json gem version 1.5.4 [1] on both the client and the server and you’re still seeing this issue?

Almost but not quite 1.5.4 - what I had was HEAD from github with
commits up to fe046d68c5ed88b32b1cf3343babcf367b5cc79f, but I see
there’s work since then with more GC guard stuff so I’ll pull that and
give it a whirl (and the log references that exact defect - I could’ve
sworn I had those changes :|)

Well… 16 hours in from pushing 1.5.4 on and we’re still running,
which is about as good as we’ve ever managed previously. We’ve
occasionally had several days so will leave it running and see how it
goes.

Also, I agree that chef-expander should drop and log invalid messages. It’s a tricky balance between ensuring that messages get retried for intermittent errors and dropping messages that will always cause an error, but in this case log and drop is definitely correct.

Cool will give that a go when I get a chance!

Had a look last night… I can’t for the life of me figure out what
the code path is which makes it exit when it gets invalid JSON as it
looks like it’s designed to cope :expressionless:


Alex Kiernan


#11

Had a look last night… I can’t for the life of me figure out what
the code path is which makes it exit when it gets invalid JSON as it
looks like it’s designed to cope :expressionless:


Alex Kiernan
Everything still working?

Also, do you still have a stack trace from when expander was crashing? Looks like we might need to explicitly return nil here:

https://github.com/opscode/chef/blob/master/chef-expander/lib/chef/expander/solrizer.rb#L110


Dan DeLeo


#12

On Fri, Sep 2, 2011 at 5:23 PM, Daniel DeLeo dan@kallistec.com wrote:

Had a look last night… I can’t for the life of me figure out what
the code path is which makes it exit when it gets invalid JSON as it
looks like it’s designed to cope :expressionless:

Everything still working?

It is. I didn’t want to jinx it by declaring success too soon, but if
we get through the weekend I think we can say it’s fixed.

Also, do you still have a stack trace from when expander was crashing?

I don’t :frowning: Once it’s clear it’s fixed I’ll swap the old json back in
and get a new one.

Looks like we might need to explicitly return nil here:
https://github.com/opscode/chef/blob/master/chef-expander/lib/chef/expander/solrizer.rb#L110

/me slaps head

yes - that makes perfect sense of the failure we were seeing… I’ll
make that change and swap back to the old json after the weekend and
see if that fixes it up.


Alex Kiernan


#13

On Fri, Sep 2, 2011 at 7:52 PM, Alex Kiernan alex.kiernan@gmail.com wrote:

On Fri, Sep 2, 2011 at 5:23 PM, Daniel DeLeo dan@kallistec.com wrote:

Had a look last night… I can’t for the life of me figure out what
the code path is which makes it exit when it gets invalid JSON as it
looks like it’s designed to cope :expressionless:

Everything still working?

It is. I didn’t want to jinx it by declaring success too soon, but if
we get through the weekend I think we can say it’s fixed.

Died over the weekend, same symptoms, though something (probably chef)
has put it back to my pre-hacked state.

Also, do you still have a stack trace from when expander was crashing?

I don’t :frowning: Once it’s clear it’s fixed I’ll swap the old json back in
and get a new one.

Just realised I was looking at chef-client stackstraces when I had
them not expander - does expander drop them anywhere I’m missing?

Looks like we might need to explicitly return nil here:
https://github.com/opscode/chef/blob/master/chef-expander/lib/chef/expander/solrizer.rb#L110

/me slaps head

yes - that makes perfect sense of the failure we were seeing… I’ll
make that change and swap back to the old json after the weekend and
see if that fixes it up.

With it broken this morning I left rabbitmq running and added in that
nil to chef-expander, it resolves the problem, you get the two log
messages you’d exect and the message gets discarded:

diff --git a/chef-expander/lib/chef/expander/solrizer.rb
b/chef-expander/lib/chef/expander/solrizer.rb
index 1a6fed9…350d0ba 100644
— a/chef-expander/lib/chef/expander/solrizer.rb
+++ b/chef-expander/lib/chef/expander/solrizer.rb
@@ -108,6 +108,7 @@ module Chef
Yajl::Parser.parse(serialized_object)
rescue Yajl::ParseError
log.error { “cannot index object because it is invalid JSON:
#{serialized_object}” }

  •    nil
     end
    
     def run
    

I’ll throw in a ticket later.


Alex Kiernan


#14

On Mon, Sep 5, 2011 at 10:11 AM, Alex Kiernan alex.kiernan@gmail.com wrote:

On Fri, Sep 2, 2011 at 7:52 PM, Alex Kiernan alex.kiernan@gmail.com wrote:

On Fri, Sep 2, 2011 at 5:23 PM, Daniel DeLeo dan@kallistec.com wrote:

Had a look last night… I can’t for the life of me figure out what
the code path is which makes it exit when it gets invalid JSON as it
looks like it’s designed to cope :expressionless:

Everything still working?

It is. I didn’t want to jinx it by declaring success too soon, but if
we get through the weekend I think we can say it’s fixed.

Died over the weekend, same symptoms, though something (probably chef)
has put it back to my pre-hacked state.

Sorry that wasn’t very clear… chef-server has gotten rolled back to
json 1.4.6 and isn’t using 1.5.4, so I’m still pretty sure that this
is the right fix, it just needs to be applied in a way which means the
chef-server doesn’t come along and roll it back.


Alex Kiernan