Chef-expander not working / search problem

Nanukkrinner · September 11, 2012, 2:45pm

Hi all,

I have a problem with my Chef 10.12.0 installation as 'knife search'
gives no results. I think it is connected to chef-expander.

Here is what I have done so far:

I start chef-expander with

chef-expander -n 1 -i 1 -l debug

and chef-solr with

chef-solr -l debug

The console output can be seen here: [2012-09-11T13:56:49+00:00] INFO: Starting Solr with java -Xmx256M -Xms256M -Dso - Pastebin.com

chef-expander and a worker are running:

ps aux | grep expander

root 16510 0.1 0.1 150860 18672 pts/0 Sl 13:55 0:00
/usr/bin/ruby1.9 /usr/bin/chef-expander -n 1 -i 1 -l debug
root 16515 0.1 0.1 153924 19484 pts/0 Sl 13:55 0:00
chef-expander worker #1 (vnodes 0-1023)

But when I look at the node status, I get the following output:

chef-expanderctl node-status

Collecting status info from the cluster...done

Host PID GUID Vnodes

No entries. I think this wrong and there should be something, right?

All the other chef-expanderctl commands like queue-depth or queue-status
just do nothing and I have to kill them by myself with Ctrl-C. There is
also nothing in the logs which you can see here:
[Tue, 11 Sep 2012 13:55:53 +0000] INFO: Chef Expander 10.12.0 starting cluster - Pastebin.com . I expected it to have more entries. That
I ran chef-expanderctl queue-depth is not visible in the log.

Also all searches done with 'knife search' always gve zero results. I
already adjusted maxFieldLength though.

Any clues where my problem is? I would especially like to know what you
think abut the output of 'chef-serverctl node-status'.

Thanks in advance.

Regards,
Nanuk

Nanukkrinner · September 11, 2012, 8:14pm

Hi all,

When I follow the chef-expander and the chef-expander worker process
with strace, I get the following output:

strace -p

restart_syscall(<… resuming interrupted call …>) = -1 ETIMEDOUT
(Connection timed out)
futex(0x603d98, FUTEX_WAKE_PRIVATE, 1) = 0
wait4(20349, 0x7fff7cb0c52c, WNOHANG, NULL) = 0
futex(0x603d34, FUTEX_WAIT_BITSET_PRIVATE, 19547, {2782149, 248024819},
ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x603d98, FUTEX_WAKE_PRIVATE, 1) = 0
wait4(20349, 0x7fff7cb0c52c, WNOHANG, NULL) = 0
futex(0x603d34, FUTEX_WAIT_BITSET_PRIVATE, 19549, {2782150, 248305174},
ffffffff) = -1 ETIMEDOUT (Connection timed out)
[…]

For the chef-expander worker I get the following:

strace -p

select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
getppid() = 20344
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 1 (in [10], left {0, 82382})
read(10, “\1\0\1\0\0\0\26\0002\0\v\tvnode-812\0\0\0\0\0\0\0\1\316”,
16384) = 30
rt_sigprocmask(SIG_SETMASK, ~[INT KILL SEGV TERM STOP VTALRM RTMIN
RT_1], NULL, 8) = 0
read(10, 0x7fff7cb07d70, 16384) = -1 EAGAIN (Resource
temporarily unavailable)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
getppid()
[…]

I don’t really understand the output, but I get the impression something
is wrong here. How does it look on a working system?

On very rare occasions I get something like this in the chef-expander
worker strace output:

getppid() = 20344
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [10], [], {0, 90000}) = 1 (out [10], left {0, 89997})
writev(10, [{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-0\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-1\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-2\1\0\0\0\0\316", 27}], 3) = 81
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [10], [], {0, 90000}) = 1 (out [10], left {0, 89995})
writev(10, [{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-3\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-4\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-5\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-6\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-7\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-8\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-9\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-10\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-11\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-12\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-13\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-14\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-15\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-16\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-17\1\0\0\0\0\316", 28}], 15) = 413
[…]

Is this how it looks when chef-expander does what it should?

Thanks in advance.

Regards,
Nanuk

kallistec · September 12, 2012, 1:11am

On Tuesday, September 11, 2012 at 1:14 PM, Nanuk Krinner wrote:

Hi all,

When I follow the chef-expander and the chef-expander worker process
with strace, I get the following output:

strace -p

restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT
(Connection timed out)
futex(0x603d98, FUTEX_WAKE_PRIVATE, 1) = 0
wait4(20349, 0x7fff7cb0c52c, WNOHANG, NULL) = 0
futex(0x603d34, FUTEX_WAIT_BITSET_PRIVATE, 19547, {2782149, 248024819},
ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x603d98, FUTEX_WAKE_PRIVATE, 1) = 0
wait4(20349, 0x7fff7cb0c52c, WNOHANG, NULL) = 0
futex(0x603d34, FUTEX_WAIT_BITSET_PRIVATE, 19549, {2782150, 248305174},
ffffffff) = -1 ETIMEDOUT (Connection timed out)
[...]

For the chef-expander worker I get the following:

strace -p

select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
getppid() = 20344
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 1 (in [10], left {0, 82382})
read(10, "\1\0\1\0\0\0\26\0002\0\v\tvnode-812\0\0\0\0\0\0\0\1\316",
16384) = 30
rt_sigprocmask(SIG_SETMASK, ~[INT KILL SEGV TERM STOP VTALRM RTMIN
RT_1], NULL, 8) = 0
read(10, 0x7fff7cb07d70, 16384) = -1 EAGAIN (Resource
temporarily unavailable)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
getppid()
[...]

I don't really understand the output, but I get the impression something
is wrong here. How does it look on a working system?

On very rare occasions I get something like this in the chef-expander
worker strace output:

getppid() = 20344
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], [10], , {0, 90000}) = 1 (out [10], left {0, 89997})
writev(10, [{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-0\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-1\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-2\1\0\0\0\0\316", 27}], 3) = 81
select(11, [8 10], , , {0, 90000}) = 0 (Timeout)
select(11, [8 10], [10], , {0, 90000}) = 1 (out [10], left {0, 89995})
writev(10, [{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-3\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-4\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-5\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-6\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-7\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-8\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-9\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-10\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-11\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-12\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-13\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-14\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-15\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-16\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-17\1\0\0\0\0\316", 28}], 15) = 413
[...]

Is this how it looks when chef-expander does what it should?

Thanks in advance.

Regards,
Nanuk

Is there a firewall dropping packets to rabbitmq?

--
Daniel DeLeo

Nanukkrinner · September 12, 2012, 2:33pm

Hi,

On 09/12/2012 03:11 AM, Daniel DeLeo wrote:
[...]

Is there a firewall dropping packets to rabbitmq?

Rabbitmq, Couchdb, chef-expander and chef-solr are all running on the
same machine and communication between the components seems to work.

I tested to talk to rabbitmq-server and chef-solr with curl and got
replies. I also listened to the solr port with nc and started a 'knife
index rebuild' command and could see incoming traffic. This would have
come through chef-expander.

The problem remains that 'knife search' does not work properly. To
clarify my problems:

I have two nodes. with 'knife search node :' I sometimes get one
result - the node where all the components are installed. The other one
never appeared.

I have number of roles set up. With 'knife search role :' I always get
only one single role as a result. The role I created first as a test
role. base_example or other roles that were created never show up in the
results.

Of course I can see roles and nodes and everything via the 'knife
list' queries.

So at least at some point the process did work somehow. Also what I find
strange is that all 'chef-expanderctl ' queries except
'chef-expanderctl node-status' do fail, while the later one just shows
no data.

The logs show no errors, so I am kinda clueless now where to search for
the source of my problem. I will post the solrconfig.xml below, but
everything looks ok for me there.

Any ideas what my problem could be?

Regards,
Nanuk

solarconfig.xml:

<?xml version="1.0" encoding="UTF-8" ?>

${solr.abortOnConfigurationError:true}

${solr.data.dir:./solr/data}

 <useCompoundFile>false</useCompoundFile>

 <mergeFactor>10</mergeFactor>

 <ramBufferSizeMB>32</ramBufferSizeMB>
 <maxMergeDocs>2147483647</maxMergeDocs>
 <maxFieldLength>2000000</maxFieldLength>
 <writeLockTimeout>1000</writeLockTimeout>
 <commitLockTimeout>10000</commitLockTimeout>


 <lockType>single</lockType>

false 32 25 2147483647 2000000 false true false 1

 <autoCommit>
   <maxDocs>100</maxDocs>
   <maxTime>10000</maxTime>
 </autoCommit>

 <maxBooleanClauses>1024</maxBooleanClauses>

 <filterCache
   class="solr.FastLRUCache"
   size="512"
   initialSize="512"
   autowarmCount="128"/>

 <queryResultCache
   class="solr.LRUCache"
   size="512"
   initialSize="512"
   autowarmCount="32"/>

 <documentCache
   class="solr.LRUCache"
   size="512"
   initialSize="512"
   autowarmCount="0"/>

 <enableLazyFieldLoading>true</enableLazyFieldLoading>

 <queryResultWindowSize>50</queryResultWindowSize>

 <queryResultMaxDocsCached>200</queryResultMaxDocsCached>

 <useColdSearcher>false</useColdSearcher>

 <maxWarmingSearchers>20</maxWarmingSearchers>

 <httpCaching lastModifiedFrom="openTime"
              etagSeed="Solr">

 </httpCaching>

  <lst name="defaults">
    <str name="echoParams">explicit</str>

  </lst>

standard solrpingquery all explicit true solr

Topic		Replies	Views
Chef Solr/search issue Chef Infra (archive)	3	318	August 12, 2011
Chef Search Broken Chef Infra (archive)	5	804	December 5, 2012
Knife search node name does not yield results Chef Infra (archive)	0	318	June 22, 2012
Upgrade from 0.9.x -> 0.10.4: search ':' finds all hosts, search ''name:*" gets 0 results Chef Infra (archive)	2	283	September 5, 2011
Trouble with Chef-solr indexing Chef Infra (archive)	2	304	January 13, 2013

Chef-expander not working / search problem

chef-expander -n 1 -i 1 -l debug

chef-solr -l debug

ps aux | grep expander

But when I look at the node status, I get the following output:

chef-expanderctl node-status

When I follow the chef-expander and the chef-expander worker process with strace, I get the following output:

strace -p

For the chef-expander worker I get the following:

strace -p

On very rare occasions I get something like this in the chef-expander worker strace output:

When I follow the chef-expander and the chef-expander worker process with strace, I get the following output:

strace -p

For the chef-expander worker I get the following:

strace -p

On very rare occasions I get something like this in the chef-expander worker strace output:

solarconfig.xml:

Related topics

When I follow the chef-expander and the chef-expander worker process
with strace, I get the following output:

On very rare occasions I get something like this in the chef-expander
worker strace output:

When I follow the chef-expander and the chef-expander worker process
with strace, I get the following output:

On very rare occasions I get something like this in the chef-expander
worker strace output: