Chef-expander not working / search problem


#1

Hi all,

I have a problem with my Chef 10.12.0 installation as 'knife search’
gives no results. I think it is connected to chef-expander.

Here is what I have done so far:

I start chef-expander with

chef-expander -n 1 -i 1 -l debug

and chef-solr with

chef-solr -l debug

The console output can be seen here: http://pastebin.com/QX8bUppv

chef-expander and a worker are running:

ps aux | grep expander

root 16510 0.1 0.1 150860 18672 pts/0 Sl 13:55 0:00
/usr/bin/ruby1.9 /usr/bin/chef-expander -n 1 -i 1 -l debug
root 16515 0.1 0.1 153924 19484 pts/0 Sl 13:55 0:00
chef-expander worker #1 (vnodes 0-1023)

But when I look at the node status, I get the following output:

chef-expanderctl node-status

Collecting status info from the cluster…done

Host PID GUID Vnodes


No entries. I think this wrong and there should be something, right?

All the other chef-expanderctl commands like queue-depth or queue-status
just do nothing and I have to kill them by myself with Ctrl-C. There is
also nothing in the logs which you can see here:
http://pastebin.com/vV8ppMJ4 . I expected it to have more entries. That
I ran chef-expanderctl queue-depth is not visible in the log.

Also all searches done with ‘knife search’ always gve zero results. I
already adjusted maxFieldLength though.

Any clues where my problem is? I would especially like to know what you
think abut the output of ‘chef-serverctl node-status’.

Thanks in advance.

Regards,
Nanuk


#2

Hi all,

When I follow the chef-expander and the chef-expander worker process
with strace, I get the following output:

strace -p

restart_syscall(<… resuming interrupted call …>) = -1 ETIMEDOUT
(Connection timed out)
futex(0x603d98, FUTEX_WAKE_PRIVATE, 1) = 0
wait4(20349, 0x7fff7cb0c52c, WNOHANG, NULL) = 0
futex(0x603d34, FUTEX_WAIT_BITSET_PRIVATE, 19547, {2782149, 248024819},
ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x603d98, FUTEX_WAKE_PRIVATE, 1) = 0
wait4(20349, 0x7fff7cb0c52c, WNOHANG, NULL) = 0
futex(0x603d34, FUTEX_WAIT_BITSET_PRIVATE, 19549, {2782150, 248305174},
ffffffff) = -1 ETIMEDOUT (Connection timed out)
[…]

For the chef-expander worker I get the following:

strace -p

select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
getppid() = 20344
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 1 (in [10], left {0, 82382})
read(10, “\1\0\1\0\0\0\26\0002\0\v\tvnode-812\0\0\0\0\0\0\0\1\316”,
16384) = 30
rt_sigprocmask(SIG_SETMASK, ~[INT KILL SEGV TERM STOP VTALRM RTMIN
RT_1], NULL, 8) = 0
read(10, 0x7fff7cb07d70, 16384) = -1 EAGAIN (Resource
temporarily unavailable)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
getppid()
[…]

I don’t really understand the output, but I get the impression something
is wrong here. How does it look on a working system?

On very rare occasions I get something like this in the chef-expander
worker strace output:

getppid() = 20344
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [10], [], {0, 90000}) = 1 (out [10], left {0, 89997})
writev(10, [{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-0\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-1\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-2\1\0\0\0\0\316", 27}], 3) = 81
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [10], [], {0, 90000}) = 1 (out [10], left {0, 89995})
writev(10, [{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-3\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-4\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-5\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-6\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-7\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-8\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-9\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-10\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-11\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-12\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-13\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-14\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-15\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-16\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-17\1\0\0\0\0\316", 28}], 15) = 413
[…]

Is this how it looks when chef-expander does what it should?

Thanks in advance.

Regards,
Nanuk


#3

On Tuesday, September 11, 2012 at 1:14 PM, Nanuk Krinner wrote:

Hi all,

When I follow the chef-expander and the chef-expander worker process
with strace, I get the following output:

strace -p

restart_syscall(<… resuming interrupted call …>) = -1 ETIMEDOUT
(Connection timed out)
futex(0x603d98, FUTEX_WAKE_PRIVATE, 1) = 0
wait4(20349, 0x7fff7cb0c52c, WNOHANG, NULL) = 0
futex(0x603d34, FUTEX_WAIT_BITSET_PRIVATE, 19547, {2782149, 248024819},
ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x603d98, FUTEX_WAKE_PRIVATE, 1) = 0
wait4(20349, 0x7fff7cb0c52c, WNOHANG, NULL) = 0
futex(0x603d34, FUTEX_WAIT_BITSET_PRIVATE, 19549, {2782150, 248305174},
ffffffff) = -1 ETIMEDOUT (Connection timed out)
[…]

For the chef-expander worker I get the following:

strace -p

select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
getppid() = 20344
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 1 (in [10], left {0, 82382})
read(10, “\1\0\1\0\0\0\26\0002\0\v\tvnode-812\0\0\0\0\0\0\0\1\316”,
16384) = 30
rt_sigprocmask(SIG_SETMASK, ~[INT KILL SEGV TERM STOP VTALRM RTMIN
RT_1], NULL, 8) = 0
read(10, 0x7fff7cb07d70, 16384) = -1 EAGAIN (Resource
temporarily unavailable)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
getppid()
[…]

I don’t really understand the output, but I get the impression something
is wrong here. How does it look on a working system?

On very rare occasions I get something like this in the chef-expander
worker strace output:

getppid() = 20344
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [10], [], {0, 90000}) = 1 (out [10], left {0, 89997})
writev(10, [{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-0\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-1\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-2\1\0\0\0\0\316", 27}], 3) = 81
select(11, [8 10], [], [], {0, 90000}) = 0 (Timeout)
select(11, [8 10], [10], [], {0, 90000}) = 1 (out [10], left {0, 89995})
writev(10, [{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-3\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-4\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-5\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-6\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-7\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-8\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\23\0002\0\n\0\1\7vnode-9\1\0\0\0\0\316", 27},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-10\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-11\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-12\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-13\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-14\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-15\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-16\1\0\0\0\0\316", 28},
{"\1\0\1\0\0\0\24\0002\0\n\0\1\10vnode-17\1\0\0\0\0\316", 28}], 15) = 413
[…]

Is this how it looks when chef-expander does what it should?

Thanks in advance.

Regards,
Nanuk

Is there a firewall dropping packets to rabbitmq?


Daniel DeLeo


#4

Hi,

On 09/12/2012 03:11 AM, Daniel DeLeo wrote:
[…]

Is there a firewall dropping packets to rabbitmq?

Rabbitmq, Couchdb, chef-expander and chef-solr are all running on the
same machine and communication between the components seems to work.

I tested to talk to rabbitmq-server and chef-solr with curl and got
replies. I also listened to the solr port with nc and started a ‘knife
index rebuild’ command and could see incoming traffic. This would have
come through chef-expander.

The problem remains that ‘knife search’ does not work properly. To
clarify my problems:

I have two nodes. with ‘knife search node :’ I sometimes get one
result - the node where all the components are installed. The other one
never appeared.

I have number of roles set up. With ‘knife search role :’ I always get
only one single role as a result. The role I created first as a test
role. base_example or other roles that were created never show up in the
results.

Of course I can see roles and nodes and everything via the ‘knife
list’ queries.

So at least at some point the process did work somehow. Also what I find
strange is that all ‘chef-expanderctl ’ queries except
’chef-expanderctl node-status’ do fail, while the later one just shows
no data.

The logs show no errors, so I am kinda clueless now where to search for
the source of my problem. I will post the solrconfig.xml below, but
everything looks ok for me there.

Any ideas what my problem could be?

Regards,
Nanuk

solarconfig.xml:

<?xml version="1.0" encoding="UTF-8" ?>

${solr.abortOnConfigurationError:true}

${solr.data.dir:./solr/data}

 <useCompoundFile>false</useCompoundFile>

 <mergeFactor>10</mergeFactor>

 <ramBufferSizeMB>32</ramBufferSizeMB>
 <maxMergeDocs>2147483647</maxMergeDocs>
 <maxFieldLength>2000000</maxFieldLength>
 <writeLockTimeout>1000</writeLockTimeout>
 <commitLockTimeout>10000</commitLockTimeout>


 <lockType>single</lockType>
false 32 25 2147483647 2000000 false true false 1
 <autoCommit>
   <maxDocs>100</maxDocs>
   <maxTime>10000</maxTime>
 </autoCommit>
 <maxBooleanClauses>1024</maxBooleanClauses>

 <filterCache
   class="solr.FastLRUCache"
   size="512"
   initialSize="512"
   autowarmCount="128"/>

 <queryResultCache
   class="solr.LRUCache"
   size="512"
   initialSize="512"
   autowarmCount="32"/>

 <documentCache
   class="solr.LRUCache"
   size="512"
   initialSize="512"
   autowarmCount="0"/>

 <enableLazyFieldLoading>true</enableLazyFieldLoading>

 <queryResultWindowSize>50</queryResultWindowSize>

 <queryResultMaxDocsCached>200</queryResultMaxDocsCached>

 <useColdSearcher>false</useColdSearcher>

 <maxWarmingSearchers>20</maxWarmingSearchers>
 <httpCaching lastModifiedFrom="openTime"
              etagSeed="Solr">

 </httpCaching>

  <lst name="defaults">
    <str name="echoParams">explicit</str>

  </lst>

standard solrpingquery all explicit true solr