[RESOLVED] Maximum knife SSH nodes?

I have a chef 12.11.1 server with 1278 nodes. If I do either
knife node list | wc -l
-or-
knife search node “*” -i | wc -l

I get 1278 nodes. These are all linux hosts and that’s about the number I expect. However, when I do a:
knife ssh “:” – hostname

I only get 996 lines of results. What’s more telling, even if I alter the search a bit, I always get 996 lines back.

For example, I have some hosts that are currently turned off. If I don’t exclude those hosts, I get one error line for each host that it cannot SSH into. The total lines returned is 996. If I do exclude them, the error lines are gone, but I still get 996 lines of output. If I exclude a few other hosts, I still get 996 lines of results.

So it seems that somewhere, I’m hitting a limit. I’ve tried this with a “knife ssh -VV” and I get the following:

jrhodes-mbp:chef-repo jrhodes$ knife ssh -VV -x root "*:*" -- hostname
INFO: Using configuration from /Users/jrhodes/workspace/chef-repo/.chef/knife.rb
DEBUG: Chef::HTTP calling Chef::HTTP::JSONInput#handle_request
DEBUG: Chef::HTTP calling Chef::HTTP::JSONOutput#handle_request
DEBUG: Chef::HTTP calling Chef::HTTP::CookieManager#handle_request
DEBUG: Chef::HTTP calling Chef::HTTP::Decompressor#handle_request
DEBUG: Chef::HTTP calling Chef::HTTP::Authenticator#handle_request
DEBUG: Signing the request as jrhodes
DEBUG: Chef::HTTP calling Chef::HTTP::RemoteRequestID#handle_request
DEBUG: Chef::HTTP calling Chef::HTTP::ValidateContentLength#handle_request
DEBUG: Initiating GET to https://vap-chef.tpn.thinkingphones.net/organizations/fuze/search/node?q=*:*&sort=X_CHEF_id_CHEF_X%20asc&start=0
DEBUG: ---- HTTP Request Header Data: ----
DEBUG: Accept: application/json
DEBUG: Accept-Encoding: gzip;q=1.0,deflate;q=0.6,identity;q=0.3
DEBUG: X-Ops-Server-API-Version: 1
DEBUG: X-OPS-SIGN: algorithm=sha1;version=1.1;
DEBUG: X-OPS-USERID: jrhodes
DEBUG: X-OPS-TIMESTAMP: 2017-01-19T13:33:25Z
DEBUG: X-OPS-CONTENT-HASH: 2jmj7l5rSw0yVb/vlWAYkK/YBwk=
DEBUG: X-OPS-AUTHORIZATION-1: removed
DEBUG: X-OPS-AUTHORIZATION-2: removed
DEBUG: X-OPS-AUTHORIZATION-3: removed
DEBUG: X-OPS-AUTHORIZATION-4: removed
DEBUG: X-OPS-AUTHORIZATION-5: removed
DEBUG: X-OPS-AUTHORIZATION-6: removed
DEBUG: HOST: vap-chef.tpn.thinkingphones.net:443
DEBUG: X-REMOTE-REQUEST-ID: 20499a7e-cdb3-4705-8c6c-2221807920df
DEBUG: ---- End HTTP Request Header Data ----
DEBUG: ---- HTTP Status and Header Data: ----
DEBUG: HTTP 1.1 200 OK
DEBUG: date: Thu, 19 Jan 2017 13:33:27 GMT
DEBUG: content-type: application/json
DEBUG: transfer-encoding: chunked
DEBUG: connection: close
DEBUG: server: openresty/1.11.2.1
DEBUG: x-ops-server-api-version: {"min_version":"0","max_version":"1","request_version":"1","response_version":"1"}
DEBUG: x-ops-api-info: flavor=cs;version=12.0.0;oc_erchef=12.11.1+20161118001025
DEBUG: content-encoding: gzip
DEBUG: ---- End HTTP Status/Header Data ----
DEBUG: Chef::HTTP calling Chef::HTTP::ValidateContentLength#handle_response
DEBUG: HTTP server did not include a Content-Length header in response, cannot identify truncated downloads.
DEBUG: Chef::HTTP calling Chef::HTTP::RemoteRequestID#handle_response
DEBUG: Chef::HTTP calling Chef::HTTP::Authenticator#handle_response
DEBUG: Chef::HTTP calling Chef::HTTP::Decompressor#handle_response
DEBUG: Decompressing gzip response
DEBUG: Chef::HTTP calling Chef::HTTP::CookieManager#handle_response
DEBUG: Chef::HTTP calling Chef::HTTP::JSONOutput#handle_response
DEBUG: Chef::HTTP calling Chef::HTTP::JSONInput#handle_response
DEBUG: Using node attribute 'fqdn' as the ssh target
DEBUG: Using node attribute 'fqdn' as the ssh target

I’ve tried this on several different hosts (Mac OS X 10.12 and CentOS 6.8) with the same results. I originally tried this with chef-client/knife 12.4.3, but I get the same results after I upgraded to chef-client/knife 12.17.44.

I’ve also tried limiting things with concurrency (“knife ssh -C 100…”) and I still get 996 lines of output.

Is there a config switch somewhere to be able to “knife ssh” into more than what seems like 1000 hosts?

I’m not seeing any obvious errors in my chef server logs, but I could be missing something.

My first guess is that you may be hitting a system limit somewhere. Each SSH connection will be a separate process, and each will need one ephemeral port. I don’t know the internal implementation of knife ssh, but it may actually start a separate thread for each SSH connection, it it probably needs three file descriptors to communicate stdin/stdout/stderr.

What does ulimit -a report?

Kevin Keane
Whom the IT Pros Call
The NetTech
http://www.4nettech.com
Our values: Privacy, Liberty, Justice
See https://www.4nettech.com/corp/the-nettech-values.html

Yeah, I thought of that as well. On OS X , ulimit reports:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 524288
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1064
virtual memory          (kbytes, -v) unlimited

And on CentOS it reports:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 46622
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I had to bump up my limits under OS X (which proved to be more complicated than it should have been, but whatever). When I hit the open file limit, I get very obvious error messages, which are gone now that I’ve raised the limit. So I don’t think it’s a ulimit issue.

I found this old ticket: http://tickets.chef.io/browse/CHEF-5204 where you can see in the one comment that, when using the “-VV” option, there was a default limit of “rows=1000” in the URL. I don’t see that in the debug of my knife SSH command, but I’m wondering if it’s perhaps a default that can be overridden with the right switch.

Just a wild guess, but I notice that in both ulimits, the max user processes is only a tad above 1000.

Interestingly, on my own CentOS system, max user processes is far higher (30790).

Kevin Keane
Whom the IT Pros Call
The NetTech
http://www.4nettech.com
Our values: Privacy, Liberty, Justice
See https://www.4nettech.com/corp/the-nettech-values.html

Nice idea, but not it. I raised my process limit:

[jrhodes@bos-devops1 chef-repo]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 46622
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 16384
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

And I’m still getting exactly 996 lines of results back, regardless whether I filter some hosts out or not.

Ulimit is a red herring, knife ssh doesn’t use a subprocess for SSH, it uses the native Ruby implementation. More likely this is because it isn’t handling search pagination correctly. As mention the default page size is 1000 and the 4 missing from that probably don’t have an FQDN or IP address.

This probably doesn’t come up a whole lot because 1000 servers is well past the point where SSH stops making sense as a comms medium. The crypto overhead just makes it take way longer than you probably want to wait in most cases. Tools like mcollective or salt stack provide non-SSH comms systems that have much less overhead.

Knife ssh, if I recall correctly, uses the Ruby net/ssh/multi library – which, while it reads the OpenSSH config files, doesn’t actually fork an external ssh process for each connection. I believe it just allocates a new thread.

That said, I’ve found that setting the concurrency with knife ssh is a good idea… the overhead of opening up a lot of connections actually slows things down if done all at once, especially if you start adding in nodes that may not respond (like decommissioned in your provider, but not from Chef, or powered off, or locked up) which hangs things up until whatever timeout is set expires (which, last time I looked, didn’t seem to be something that you could configure without modifying your ~/.ssh/config). And if you’re SSHing to hosts to kick off a chef-client run, or otherwise access some centralized resource, you don’t want them to hammer that, either.

Changing the concurrently limit doesn’t make a difference in this case. No matter what, I always get 996 lines of response back.

coderanger: In my case, I’m kinda stuck with SSH, and “knife ssh” makes the task very simple for me. What I’m hoping to find out is if there’s a way to raise or adjust that 1,000 row limit. I agree, that seems like what I’m hitting, even though I don’t see it in the debug output (-VV).

Do you have to do everything all at once, though, or can you do say one environment at a time, or some iteration that’s less than 1000? Or are you using a different query in reality vs the example given and still bumping against the 1000 limit?

You could also use net/ssh/multi and the Chef API yourself. If it’s going to be a repeatable task that’s probably more ideal for you anyways. I have an old script that I just pasted up here: https://gist.github.com/stormerider/0a76923a38e5f8e83ac9d277ac7e8644 – I think that it probably can be cleaned up a bit at this point, I think partial searching was pulled into the API already and things like that. Supports automatically using knife winrm if the node is Windows, otherwise it uses net/ssh/multi (I just didn’t feel like spending the time figuring out WinRM Ruby code myself). The script was designed to find nodes that hadn’t checked in to the Chef server in a given amount of time and to force a Chef run. Mostly so that I could see what kind of errors came back—host was down, or couldn’t log in, or the chef-client converge failed. I would not recommend using this as is on a prod setup, but it can serve as a starting point or example of what can be done. I haven’t used this in a few years, but still had it kicking around.

Well, it seems that I’m going to have to split things into sub 1K runs or I risk missing hosts. My preference is to find a way to continue to use “knife ssh”, just because our entire organization is all ready comfortable with it.

I know I can work around the limitation. The question is if there’s a way to remove/change what seems like an in-built limitation.

I’ll take a look at that script you’ve posted. Perhaps going our own way has some value. One thing I’ve noticed is that knife seems to retrieve the entire node object from the server and then does some parsing of all that data to build it’s list for ssh. With a large number of hosts, that makes the ruby process swell to around 4 GB. Seems like that script could be updated to search with a filter as in:

https://docs.chef.io/chef_search.html#filter-search-results

That would certainly trim down the time and resources needed on the workstation to build the initial list.

Or just use a tool done for mass management… there’s a bunch, talking for the one I know Rundeck is quite light and there’s the chef_rundeck gem to retrieve nodes from the chef-server itself and then filter by environment/recipe/roles/tags… This allow keeping a trace of who did what and when, and to schedule things.

Knife ssh is not intended for this usage and I don’t think pushing it toward this way is a good idea at all.

This is indeed a bug in Chef, as @coderanger notes it’s not paginating correctly.
https://github.com/chef/chef/pull/5744 is the fix, hopefully it’ll be in next month’s release. I also switched to partial search.
-Thom

Outstanding news on both counts! Thanks.

I’d just like to follow up that I did some testing today with Chef client 12.19.36. I was able to ‘knife ssh’ into more than 1,000 hosts and the memory footprint of the ruby process stayed below 200 MB. As a result, the operation was also much faster.

Thanks @thommay for the fix!