[RESOLVED] Timeouts and delays contacting Chef Server

Hi,

I am trying to learn Chef by setting up a test installation in my Hyper-V lab. Thus far, I have been able to install Chef Server on one VM that is running Ubuntu 16.04. I also installed chef-mange and push-jobs-server on this same Ubuntu server. On another VM loaded with Windows Server 2012 R2, I installed ChefDK and configured it to talk to the chef server. I have a knife.rb file and user.pem file and I was able to get a cert running ‘knife ssl fetch’.

So at this point, client and server are able to talk to each other. The problem I am having is that routine commands take a loooong time to run. From the ChefDK console I run the command ‘knife cookbook list’, for instance… After many minutes and often one or more timeout messages, it will work. However, I am talking easily 5-10 minutes.

Here is a sample output from trying to list cookbooks:

PS C:\Users\administrator.COATELAB\chef> knife cookbook list
ERROR: Timeout connecting to https://server6/organizations/coatelab/cookbooks?num_versions=1, retry 1/5
ERROR: Timeout connecting to https://server6/organizations/coatelab/cookbooks?num_versions=1, retry 2/5
learn_chef_iis 0.2.0

The last line is what I expected. Only one cookbook has been uploaded so far. So obviously, it is able to make a connection eventually. So most likely it is trying to connect with some kind of protocol that does not work, before it fails over to something that does work. I am thinking it is either name resolution (NetBIOS name vs FQDN) or a problem with the certificates between client and server.

Ideas anyone?

-Dave

I run a test server using VMWare Workstation, CentOS 7.2 with 4 cores & 4GB RAM on a USB3 SSD disk (Lower than recommended I believe) - Manage, push jobs & reporting installed. Although it isn’t what I would call lightening quick, it only takes seconds to react.

Now, as the VM is running on a Hyper-V platform, I would be looking at this as the culprit, or more to the point, the network card/NIC drivers. Make sure after installing you remove the NIC and then add a ‘synthetic NIC’ within Hyper-V. Ubuntu has the correct drivers for it and as far as I know it will use paravirtualization that way. Also on Windows guests OS’es you’ll really want the synthetic NIC.

The default NIC will get you going straight out of the box with most OS’s (including things like PXE boots etc), but as it is designed for compatibility, generally speaking you lose performance.

This Microsoft page is worth a review. It discusses the Linux Integration Services (LIS) for Hyper-V and the support for each OS. Down the bottom of the page there are links to the specific operating systems and their support tables within Hyper-V. Ubuntu info is found here.

So, as you have Ubuntu 16.04 installed, LIS is also installed in the default image and should work with the Synthetic Adapter.

Interested to know if this resolves your issue.

How much memory did you give your Chef server? I find that it can be very hungry even for small installations. I would suggest at least 4 GB.

The second thing to look at is networking. With timeouts like that, DNS is a prime suspect. To test that, temporarily add the chef server’s FQDN to /etc/hosts on your client.

Kevin Keane
Whom the IT Pros Call
The NetTech
http://www.4nettech.com
Our values: Privacy, Liberty, Justice
See https://www.4nettech.com/corp/the-nettech-values.html

Thanks for the replies!

@svucich, I am trying to understand how to use/specify synthetic NICs. My HyperV host is Server 2012 R2. When I go to add a new NIC, I do not see a choice when selecting NIC. Based on my reading so far today, the default NIC in 2012R2 is a synthetic NIC??

@kkeane, I had memory on the Ubuntu Chef server set to dynamic and it ramped up to 6GB. Just for fun I hard set it to use 8GB, but that did not seem to help.

I rather agree with the idea that this is likely a DNS issue. I added the server FQDN to the Hosts file as you suggested, but that did not seem to help. My background is as a Windows System Administrator. My HyperV lab has 5 Windows Server 2012R2 servers. One of them is a Domain Controller and runs DNS for the lab. I have several environments I am working on here. Until now they have all been Windows. For learning Chef, I added one more VM and loaded it with Ubuntu Server. For DNS on the Ubuntu/Chef server I simply added a line to /etc/network/interfaces for dns-nameservers and gave it the ip address of the Windows DNS server. I am a bit concerned that the Ubuntu server is not entirely aware that it is a part of a DNS domain/zone.

Does that make sense?

-Dave

OK, using my local machine (Windows 10) I only get an option for using 2 types of network adapter when I choose a Generation 1 type virtual machine - There is no option to choose anything other than ‘Network Adapter’ when using a ‘Generation 2’ VM. On a gen1 virtual machine I can choose between ‘Legacy Network Adapter’ & ‘Network Adapter.’ ‘Legacy Network Adapter’ has the most OS support where are ‘Network Adapter’ is the full synthetic adapter.

I still feel like network/network interface is the issue. I can’t see if being DNS once you have set the host file. Unless off course the ubuntu machine can’t resolve its own FQDN (On CentOS you run hostname from Bash to get your netbios name and hostname -f to see what FQDN the machine thinks it has. The FQDN must be set correctly when using Chef.

I am using Gen 2 VMs, so I think I am already using a Synthetic NIC.

Does the physical hardware of the host support VMQ? If not, then try turning this feature off.

Have a quick read of this article. This discusses the VMQ options. It is also detailed here in a spiceworks post.

Gut feeling is that this is still related the the network config, adapter type, or the hosts physical NIC (drivers?) in some way, even if it is still specific to Linux. Happy to be wrong, and would love to know if you find a resolution.

I am near on the end of what I can help you with. I get the feeling that this isn’t a Chef specific issue.

Last few thing I would be looking at is re-creating the VM using ubuntu and Gen1 virtual hardware (BIOS - NOT UEFI) or, using CentOS 7.x (which I have successfully working on a VM environment). CentOS (In my opinion) is a better more featured Enterprise option when considering Linux as an operating system. Ubuntu is great, but this is an OS I prefer to use as a desktop OS, not a server (But that is just my opinion).

I wonder – is it possible to profile the code while it’s running?

I ask because I’ve seen complaints on the testssl.sh project regarding how fast that code runs under LI on Windows, and I’m pretty sure that’s a problem with how dead-dog slow the bash emulation code is. I know that Chef client isn’t running in bash, but I do wonder if the problem is in the underlying emulation libraries.

I do not see VMQ as an option on any of the NICs in this lab. So I assume the physical hardware does not support it. I have been using this lab for other purposes for months now. It is not screamingly fast, but it seems to work just fine for these other purposes. I loaded Samba on the Linux box to allow for easy file transfer. (I mentioned before that I am mostly a Windows SA) I have been able to transfer large files from the windows VMs at reasonable speeds. It is only when I try to run the knife command from the windows server on which I have ChefDK installed that I have any problems.

I have a couple of ‘reality check’ questions to pose.

  1. Does Chef need to be connected to the Internet during normal operations? I saw an Internet connection was a requirement for some of the tutorials I am following, but I assumed that meant I would need for downloading software. I generally run my HyperV lab entirely within its host. The VMs are normally on an internal switch. I occasionally switch them to an external switch just long enough to download software and patches as needed.

  2. It seems like the only time I have a problem is when I run the knife command from a Windows server loaded with the ChefDK. The knife.rb file on this box uses Short/NetBIOS name for it’s chef_server_url. (e.g. “https://server6/organizations/coatelab”) I keep thinking that it ought to use the FQDN. (“https://server6.coatelab.com/organizations/coatelab”) but that does not work. I tried changing this manually in the file, but that does not work at all. Can someone explain how this URL is generated and how I might change it for experimental purposes?

That URL problem sounds suspicious. If it really uses NetBIOS to resolve the name, that would explain the slowness. OTOH, it may also be the unqualified DNS name. Finally, it may also be using zeroconf/bonjour/mdns to resolve the name, which may also be slow.

First things first: do you using FQDNs that end with .local? I recently learned the hard way that this is not a good idea. Unfortunately, the mdns RFCs recently “hijacked” this TLD, so there can be a conflict between a .local FQDN and zeroconf. That is especially true in the most recent versions of Windows that have mdns built in (earlier versions required that you install Bonjour).

Can you ping that server using the FQDN? Does the FQDN resolve using nslookup? And conversely, using nbtstat, do you see the short name in the cache?

Kevin Keane
Whom the IT Pros Call
The NetTech
http://www.4nettech.com
Our values: Privacy, Liberty, Justice
See https://www.4nettech.com/corp/the-nettech-values.html

The FQDN of the chef server is literally ‘server6.coatelab.com’. I do not mind revealing the details of this lab because it is not normally connected to the Internet and contains no data of consequence. I have name resolution working now in the lab, but it is possible I may not have had everything configured when I installed/configured Chef Server. I brought up the Ubuntu server specifically to run chef server.

At this time, I can ping the chef server via FQDN and Short Name. The FQDN resolves with nslookup and the short name appears in NetBIOS cache.

Because my real goal is to learn Chef. I am mostly interested in what chef_server_url is normally set to in a typical knife.rb. I recently got access to Chef at my place of employment and I notice that chef_server_url on my laptop is an FQDN that makes sense for the enterprise domain I am working in.

If I have to, I will blow away my current chef installation and reinstall now that I am sure I have a more complete network configuration, but I think I would learn more if understood how to change it in my current installation. Having researched it a bit I seem to have found multiple conflicting solutions that involve editing chef-server.rb??

Does that make sense? Am I on the right track?

Fundamentally, the chef server is an HTTP(S) server. So the proper URL needs to meet these criteria:

  • It needs to resolve to an IP address using DNS (or the hosts file) - not through mDns or NetBIOS. It does not have to be the hostname, but can also be a CNAME. I find a CNAME to be a better idea because you can easily move it to a different physical host.
  • It needs to match the SSL certificate you are using (unless you use HTTP, of course, but that’s probably only acceptable in a lab environment).

Also, the FQDN should resolve both on the client and on the server. IIRC, the host name (not necessarily the CNAME) on the server should be listed in /etc/hosts and point to 127.0.0.1 and ::1 respectively.

You may also want to dig through log files on the server to see if you can find any indication of internal delays. There are a lot of TCP connections between the various components of a chef server. The client’s request kicks off a whole cascade of more internal TCP connections.

Kevin Keane
Whom the IT Pros Call
The NetTech
http://www.4nettech.com
Our values: Privacy, Liberty, Justice
See https://www.4nettech.com/corp/the-nettech-values.html

Hi,

I am going to try posing this again.

Over the last few days, I completely reinstalled my Ubuntu server and Chef-Server and Chef Manage. There are two relevant servers in my HyperV lab environment. Server4 is Windows 2012R2 with ChefDK installed on it. Server6 is Ubuntu 16.04 and Chef-Server 12.12 installed. There is a DNS server on Server1 (Windows), but I have largely bypassed that by placing entries in the Hosts files on Server4 and Server6 to point at each other.

After I finished installing Chef-Server and Chef-Manage, I was able to use a browser on Server4 to connect to Chef-Manage on Server6. I used this to generate a Knife.rb file to be placed on Server4. This time, the knife.rb file uses the FQDN for Server6.coatelab.com. Subsequently from Server4 I was able to obtain a cert using “knife ssl fetch” and then verify using “knife ssl check”. Everything seems to be fine up to this point. However when I then try to use knife on Server4 to query the Chef server, I get timeouts and long delays.

Specifically, after setting this all up, I run 'knife user list" from the ChefDK/PowerShell prompt and it takes many minutes (3 - 10) to return an answer. As an output, I get 1 or more timeout messages before it does return the expected list of users. (one user… my admin user). Here is the msg:

ERROR: Timeout connecting to https://server6.coatelab.com/organizations/coatelab/users, retry 1/5

To be very clear, I can put “https://server6.coatelab.com/organizations/coatelab/users” into a browser on Server 4 and I get a very quick answer. (less than 5 sec) It is only when I try to use it in ChefDK/PowerShell that it takes up to 5 MINUTES and returns one or more timeout errors before surrendering the requested list.

I know in the past there have been some issues with ChefDK on windows being slow, but if you’re using a current version of ChefDK that shouldn’t be the case. I would however suggest launching an Ubuntu VM and see if that’s able to operate ChefDK fine or if the trouble is isolated to that Server4 VM.

Wow! The solution to this was to this problem was to increase the number of processors on the ChefDK (Windows Server 2012 R2) VM. Just increasing the processors from one to two decreased the amount of time to run “chef user list” from multiple minutes (3 - 10 minutes) to about 7 seconds. This was all just to enumerate one user. I saw the same behavior when trying to list nodes and cookbooks, of which there were none.