Problems with Search?

So some of my nagios checks started failing just now because the search that fills out allowed_hosts in nrpe.cfg was empty. This is the second time this happened to me - the first was immediately after upgrading to chef-server 0.9.12.

The server is The clients in question are still on 0.9.8.

My cookbook code is (probably the standard nagios::client cookbook, possibly a few versions out of date):

search(:node, “roles:monitoring AND app_environment:#{node[:app_environment]}”) do |n|
mon_host << n[‘ipaddress’]
end

template “/etc/nagios/nrpe.cfg” do
source "nrpe.cfg.erb"
owner "nagios"
group "nagios"
mode "0644"
variables :mon_host => mon_host
notifies :restart, resources(:service => “nagios-nrpe-server”)
end
END

The initial problem on upgrading the server was the search was previously “role:…” which was working fine but needed to be changed to “roles:…”.

However this time nothing changed and simply logging in to the problem hosts and re-running chef-client fixed it.

What went wrong and how to I track it down so it doesn’t happen again?

-ash

Ash, can you give use some gists of the log output from the Chef
Server at the time?

Solr logs would help as well.

Adam

On Wed, Jan 26, 2011 at 4:09 PM, Ash Berlin ash_opscode@firemirror.com wrote:

So some of my nagios checks started failing just now because the search that fills out allowed_hosts in nrpe.cfg was empty. This is the second time this happened to me - the first was immediately after upgrading to chef-server 0.9.12.

The server is The clients in question are still on 0.9.8.

My cookbook code is (probably the standard nagios::client cookbook, possibly a few versions out of date):

search(:node, "roles:monitoring AND app_environment:#{node[:app_environment]}") do |n|
mon_host << n['ipaddress']
end

...

template "/etc/nagios/nrpe.cfg" do
source "nrpe.cfg.erb"
owner "nagios"
group "nagios"
mode "0644"
variables :mon_host => mon_host
notifies :restart, resources(:service => "nagios-nrpe-server")
end
END

The initial problem on upgrading the server was the search was previously "role:..." which was working fine but needed to be changed to "roles:...".

However this time nothing changed and simply logging in to the problem hosts and re-running chef-client fixed it.

What went wrong and how to I track it down so it doesn't happen again?

-ash

--
Opscode, Inc.
Adam Jacob, Chief Product Officer
T: (206) 619-7151 E: adam@opscode.com

https://gist.github.com/860921611bbcaca7b338

Let me know if you need more and I'll get it in the morning - its 00:30 here now.

-ash

On 27 Jan 2011, at 00:12, Adam Jacob wrote:

Ash, can you give use some gists of the log output from the Chef
Server at the time?

Solr logs would help as well.

Adam

On Wed, Jan 26, 2011 at 4:09 PM, Ash Berlin ash_opscode@firemirror.com wrote:

So some of my nagios checks started failing just now because the search that fills out allowed_hosts in nrpe.cfg was empty. This is the second time this happened to me - the first was immediately after upgrading to chef-server 0.9.12.

The server is The clients in question are still on 0.9.8.

My cookbook code is (probably the standard nagios::client cookbook, possibly a few versions out of date):

search(:node, "roles:monitoring AND app_environment:#{node[:app_environment]}") do |n|
mon_host << n['ipaddress']
end

...

template "/etc/nagios/nrpe.cfg" do
source "nrpe.cfg.erb"
owner "nagios"
group "nagios"
mode "0644"
variables :mon_host => mon_host
notifies :restart, resources(:service => "nagios-nrpe-server")
end
END

The initial problem on upgrading the server was the search was previously "role:..." which was working fine but needed to be changed to "roles:...".

However this time nothing changed and simply logging in to the problem hosts and re-running chef-client fixed it.

What went wrong and how to I track it down so it doesn't happen again?

-ash

Odd - can you give us another gist with some of the solr log going
further back, say a few minutes before where the log starts now?

In your chef server logs, what was happening with node convergence for
the machine with the actual monitoring role at the time the search
request came in?

Adam

On Wed, Jan 26, 2011 at 4:30 PM, Ash Berlin ash_opscode@firemirror.com wrote:

https://gist.github.com/860921611bbcaca7b338

Let me know if you need more and I'll get it in the morning - its 00:30 here now.

-ash

On 27 Jan 2011, at 00:12, Adam Jacob wrote:

Ash, can you give use some gists of the log output from the Chef
Server at the time?

Solr logs would help as well.

Adam

On Wed, Jan 26, 2011 at 4:09 PM, Ash Berlin ash_opscode@firemirror.com wrote:

So some of my nagios checks started failing just now because the search that fills out allowed_hosts in nrpe.cfg was empty. This is the second time this happened to me - the first was immediately after upgrading to chef-server 0.9.12.

The server is The clients in question are still on 0.9.8.

My cookbook code is (probably the standard nagios::client cookbook, possibly a few versions out of date):

search(:node, "roles:monitoring AND app_environment:#{node[:app_environment]}") do |n|
mon_host << n['ipaddress']
end

...

template "/etc/nagios/nrpe.cfg" do
source "nrpe.cfg.erb"
owner "nagios"
group "nagios"
mode "0644"
variables :mon_host => mon_host
notifies :restart, resources(:service => "nagios-nrpe-server")
end
END

The initial problem on upgrading the server was the search was previously "role:..." which was working fine but needed to be changed to "roles:...".

However this time nothing changed and simply logging in to the problem hosts and re-running chef-client fixed it.

What went wrong and how to I track it down so it doesn't happen again?

-ash

--
Opscode, Inc.
Adam Jacob, Chief Product Officer
T: (206) 619-7151 E: adam@opscode.com

I reported the same issue under the title "search sometimes returns no
result", I pasted my logs in the email.

On Wed, Jan 26, 2011 at 6:52 PM, Adam Jacob adam@opscode.com wrote:

Odd - can you give us another gist with some of the solr log going
further back, say a few minutes before where the log starts now?

In your chef server logs, what was happening with node convergence for
the machine with the actual monitoring role at the time the search
request came in?

Adam

On Wed, Jan 26, 2011 at 4:30 PM, Ash Berlin ash_opscode@firemirror.com wrote:

https://gist.github.com/860921611bbcaca7b338

Let me know if you need more and I'll get it in the morning - its 00:30 here now.

-ash

On 27 Jan 2011, at 00:12, Adam Jacob wrote:

Ash, can you give use some gists of the log output from the Chef
Server at the time?

Solr logs would help as well.

Adam

On Wed, Jan 26, 2011 at 4:09 PM, Ash Berlin ash_opscode@firemirror.com wrote:

So some of my nagios checks started failing just now because the search that fills out allowed_hosts in nrpe.cfg was empty. This is the second time this happened to me - the first was immediately after upgrading to chef-server 0.9.12.

The server is The clients in question are still on 0.9.8.

My cookbook code is (probably the standard nagios::client cookbook, possibly a few versions out of date):

search(:node, "roles:monitoring AND app_environment:#{node[:app_environment]}") do |n|
mon_host << n['ipaddress']
end

...

template "/etc/nagios/nrpe.cfg" do
source "nrpe.cfg.erb"
owner "nagios"
group "nagios"
mode "0644"
variables :mon_host => mon_host
notifies :restart, resources(:service => "nagios-nrpe-server")
end
END

The initial problem on upgrading the server was the search was previously "role:..." which was working fine but needed to be changed to "roles:...".

However this time nothing changed and simply logging in to the problem hosts and re-running chef-client fixed it.

What went wrong and how to I track it down so it doesn't happen again?

-ash

--
Opscode, Inc.
Adam Jacob, Chief Product Officer
T: (206) 619-7151 E: adam@opscode.com

On Thu, Jan 27, 2011 at 8:54 AM, Gilles Devaux gilles.devaux@gmail.com wrote:

I reported the same issue under the title "search sometimes returns no
result", I pasted my logs in the email.

Is this a search for one specific host? Do the bad searches coincide
with chef client runs on the searched-for system? If so, I think this
is a resolved bug on the client, and you can fix it by upgrading the
client on the searched-for box to 0.9.12.

The issue I'm thinking of is that the node is saved at the beginning
of the run so that it has the correct run list when the client asks
the server for the list of cookbooks to sync. However, some of the
attributes have been wiped and not yet rebuilt when this node save
occurs. Sound like your issue?

Thanks,
Dan

This could be it yes - basically i'm searching on nrpe client machines for the host running nagois to allow nrpe connections from just this host, and all my clients are still on 0.9.8.

-ash

On 27 Jan 2011, at 17:11, Daniel DeLeo wrote:

On Thu, Jan 27, 2011 at 8:54 AM, Gilles Devaux gilles.devaux@gmail.com wrote:

I reported the same issue under the title "search sometimes returns no
result", I pasted my logs in the email.

Is this a search for one specific host? Do the bad searches coincide
with chef client runs on the searched-for system? If so, I think this
is a resolved bug on the client, and you can fix it by upgrading the
client on the searched-for box to 0.9.12.

The issue I'm thinking of is that the node is saved at the beginning
of the run so that it has the correct run list when the client asks
the server for the list of cookbooks to sync. However, some of the
attributes have been wiped and not yet rebuilt when this node save
occurs. Sound like your issue?

Thanks,
Dan

I've had similar problems in the past where the nagios would fail to restart
because a client saved its node[:ipaddress] as nil, and the next chef-client
run on my nagios server broke it. Anecdotally, it only happens when I reboot
a node. I suspect there's a race condition at startup where chef-client
starts running (and collects the ohai attributes) before the node gets a
DHCP lease, and ends up writing a nil node[:ipaddress] when the run
completes.

Perhaps you have the same problem, but with the node[:ipaddress]
disappearing from your nagios node rather than the monitored nodes?

-Paul

On Thu, Jan 27, 2011 at 9:18 AM, Ash Berlin ash_opscode@firemirror.comwrote:

This could be it yes - basically i'm searching on nrpe client machines for
the host running nagois to allow nrpe connections from just this host, and
all my clients are still on 0.9.8.

-ash

On 27 Jan 2011, at 17:11, Daniel DeLeo wrote:

On Thu, Jan 27, 2011 at 8:54 AM, Gilles Devaux gilles.devaux@gmail.com
wrote:

I reported the same issue under the title "search sometimes returns no
result", I pasted my logs in the email.

Is this a search for one specific host? Do the bad searches coincide
with chef client runs on the searched-for system? If so, I think this
is a resolved bug on the client, and you can fix it by upgrading the
client on the searched-for box to 0.9.12.

The issue I'm thinking of is that the node is saved at the beginning
of the run so that it has the correct run list when the client asks
the server for the list of cookbooks to sync. However, some of the
attributes have been wiped and not yet rebuilt when this node save
occurs. Sound like your issue?

Thanks,
Dan

On Fri, Jan 28, 2011 at 1:01 PM, Paul Paradise paulpar@gmail.com wrote:

I've had similar problems in the past where the nagios would fail to
restart because a client saved its node[:ipaddress] as nil, and the next
chef-client run on my nagios server broke it. Anecdotally, it only happens
when I reboot a node. I suspect there's a race condition at startup where
chef-client starts running (and collects the ohai attributes) before the
node gets a DHCP lease, and ends up writing a nil node[:ipaddress] when the
run completes.

Perhaps you have the same problem, but with the node[:ipaddress]
disappearing from your nagios node rather than the monitored nodes?

-Paul

I've seen it happen on a machine that gets into an oom condition when
running ohai. The partial data for the run still gets submitted and indexed
into the search engine. We tend to write very defensive recipes when using
search to guard against malformed data. We also switched to starting chef
from cron, though the memory leak issues we saw in daemon mode I think are
mostly fixed.

Jason