Of Ohai plugins and chef server crashes


#1

Hi Chefs,

I have a bit of a mystery(at least to me) on my hands. One of my
environments has hit the Solr maxFieldLength issue
(http://tickets.opscode.com/browse/CHEF-2346) and following the advise
in the ticket hasn’t worked since it just lead to a server crash. So
after ignoring the problem for the last couple months, the pain of not
having half of my production nodes available in search became a
unbearable. The idea this time was to attack the size of the node
object itself. Since this environment was linked to a larger Active
Directory domain the ‘etc’ hash that Ohai creates was pretty big, so I
decided to remove it by disabling the passwd plugin. To do this I
added a small bit of code to the client.rb.erb template:

<% if node.attribute?(“ohai”) && node[“ohai”].attribute?(“disabled_plugins”) -%>

Ohai::Config[:disabled_plugins] = [<%=
node[“ohai”][“disabled_plugins”].join(",") %>]
<% end -%>

and this to the environment:

"ohai": {
  "disabled_plugins": [
    "\"passwd\""
  ]
},

I added the disabled plugin to one of the environments in our lab and
everything went pretty smooth. Saw a slight increase the the server
cpu load but still within tolerance. After letting that burn in for
24hrs I got the go ahead to apply this to the one production
environment that wasn’t getting indexed. My nodes are on the default
30 minute interval so about an hour after making the change the cpu
usage for the chef-server process went to 100% and it started to
consume all the available memory and eventually stopped responding to
clients. Restarting the process didnt help as it would immediately hit
100% usage and quickly consume all the memory/swap that was regained.

I rolled the changes back and spent the next couple hrs babysitting
the server and eventually ended up restarting chef-client on all the
nodes in that environment. My question is, why would disabling an Ohai
plugin do this? Or did it? Since disabling it in ‘test’ didnt have the
same result. The test chef-server and the production chef-server share
the same couch/rabbit/solr/expander and all the clients use pass
through a proxy which decides which server to send them to, so i’m
pretty certain they’re configured the same. The only real difference
is that production has about 60 more clients.

Any thoughts/suggestions on what this could be? Or more likely, what i
screwed up?

Thanks

– chris

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.


#2

On Friday, April 27, 2012 at 8:39 AM, Chris wrote:

Hi Chefs,

I have a bit of a mystery(at least to me) on my hands. One of my
environments has hit the Solr maxFieldLength issue
(http://tickets.opscode.com/browse/CHEF-2346) and following the advise
in the ticket hasn’t worked since it just lead to a server crash. So
after ignoring the problem for the last couple months, the pain of not
having half of my production nodes available in search became a
unbearable. The idea this time was to attack the size of the node
object itself. Since this environment was linked to a larger Active
Directory domain the ‘etc’ hash that Ohai creates was pretty big, so I
decided to remove it by disabling the passwd plugin. To do this I
added a small bit of code to the client.rb.erb template:

<% if node.attribute?(“ohai”) && node[“ohai”].attribute?(“disabled_plugins”) -%>

Ohai::Config[:disabled_plugins] = [<%=
node[“ohai”][“disabled_plugins”].join(",") %>]
<% end -%>

and this to the environment:

“ohai”: {
“disabled_plugins”: [
"“passwd”"
]
},

I added the disabled plugin to one of the environments in our lab and
everything went pretty smooth. Saw a slight increase the the server
cpu load but still within tolerance. After letting that burn in for
24hrs I got the go ahead to apply this to the one production
environment that wasn’t getting indexed. My nodes are on the default
30 minute interval so about an hour after making the change the cpu
usage for the chef-server process went to 100% and it started to
consume all the available memory and eventually stopped responding to
clients. Restarting the process didnt help as it would immediately hit
100% usage and quickly consume all the memory/swap that was regained.

I rolled the changes back and spent the next couple hrs babysitting
the server and eventually ended up restarting chef-client on all the
nodes in that environment. My question is, why would disabling an Ohai
plugin do this? Or did it? Since disabling it in ‘test’ didnt have the
same result. The test chef-server and the production chef-server share
the same couch/rabbit/solr/expander and all the clients use pass
through a proxy which decides which server to send them to, so i’m
pretty certain they’re configured the same. The only real difference
is that production has about 60 more clients.

Any thoughts/suggestions on what this could be? Or more likely, what i
screwed up?

It’s hard to imagine how these could be related. Did you get any more info about what chef server was doing? Was there a stack trace when you killed it? Did the CPU spike happen during a particular request, or during the startup routines?

Thanks

– chris

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.


Dan DeLeo


#3

Unfortunately the server was set to :warn. But this is the last thing
in the log prior to the restart: https://gist.github.com/2510531

Looks like it can’t connect to Solr? Solr is remote and set to
solr_url “http://servername:8983/solr” in server.rb. The /solr was
added recently after a 0.10.4 to 0.10.8 upgrade.

On Fri, Apr 27, 2012 at 8:57 AM, Daniel DeLeo dan@kallistec.com wrote:

On Friday, April 27, 2012 at 8:39 AM, Chris wrote:

Hi Chefs,

I have a bit of a mystery(at least to me) on my hands. One of my
environments has hit the Solr maxFieldLength issue
(http://tickets.opscode.com/browse/CHEF-2346) and following the advise
in the ticket hasn’t worked since it just lead to a server crash. So
after ignoring the problem for the last couple months, the pain of not
having half of my production nodes available in search became a
unbearable. The idea this time was to attack the size of the node
object itself. Since this environment was linked to a larger Active
Directory domain the ‘etc’ hash that Ohai creates was pretty big, so I
decided to remove it by disabling the passwd plugin. To do this I
added a small bit of code to the client.rb.erb template:

<% if node.attribute?(“ohai”) && node[“ohai”].attribute?(“disabled_plugins”) -%>

Ohai::Config[:disabled_plugins] = [<%=
node[“ohai”][“disabled_plugins”].join(",") %>]
<% end -%>

and this to the environment:

“ohai”: {
“disabled_plugins”: [
"“passwd”"
]
},

I added the disabled plugin to one of the environments in our lab and
everything went pretty smooth. Saw a slight increase the the server
cpu load but still within tolerance. After letting that burn in for
24hrs I got the go ahead to apply this to the one production
environment that wasn’t getting indexed. My nodes are on the default
30 minute interval so about an hour after making the change the cpu
usage for the chef-server process went to 100% and it started to
consume all the available memory and eventually stopped responding to
clients. Restarting the process didnt help as it would immediately hit
100% usage and quickly consume all the memory/swap that was regained.

I rolled the changes back and spent the next couple hrs babysitting
the server and eventually ended up restarting chef-client on all the
nodes in that environment. My question is, why would disabling an Ohai
plugin do this? Or did it? Since disabling it in ‘test’ didnt have the
same result. The test chef-server and the production chef-server share
the same couch/rabbit/solr/expander and all the clients use pass
through a proxy which decides which server to send them to, so i’m
pretty certain they’re configured the same. The only real difference
is that production has about 60 more clients.

Any thoughts/suggestions on what this could be? Or more likely, what i
screwed up?

It’s hard to imagine how these could be related. Did you get any more info about what chef server was doing? Was there a stack trace when you killed it? Did the CPU spike happen during a particular request, or during the startup routines?

Thanks

– chris

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.


Dan DeLeo


Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.


#4

On Friday, April 27, 2012 at 9:23 AM, Chris wrote:

Unfortunately the server was set to :warn. But this is the last thing
in the log prior to the restart: https://gist.github.com/2510531

Looks like it can’t connect to Solr? Solr is remote and set to
solr_url “http://servername:8983/solr” in server.rb. The /solr was
added recently after a 0.10.4 to 0.10.8 upgrade.

Connection refused indicates that nothing is listening on that port. Is Solr up? can you talk to it with curl?


Dan DeLeo


#5

yes, solr is running and is reachable from the API server. All the
nodes that were missing got indexed too.

On Fri, Apr 27, 2012 at 9:29 AM, Daniel DeLeo dan@kallistec.com wrote:

On Friday, April 27, 2012 at 9:23 AM, Chris wrote:

Unfortunately the server was set to :warn. But this is the last thing
in the log prior to the restart: https://gist.github.com/2510531

Looks like it can’t connect to Solr? Solr is remote and set to
solr_url “http://servername:8983/solr” in server.rb. The /solr was
added recently after a 0.10.4 to 0.10.8 upgrade.

Connection refused indicates that nothing is listening on that port. Is Solr up? can you talk to it with curl?


Dan DeLeo


Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.