On Fri, Jan 27, 2012 at 9:27 PM, email@example.com wrote:
I guess my question is, if we were to flick the switch on a rollout to 200
nodes at the sametime, would Chef server cope with that level of
concurrency? If we have our nodes running as daemons, we can stagger deploys
with a splay time I guess.
I can’t give you an easy answer, but I’ll give you lots to think about.
One of the core architectural designs of Chef from the start was to
make it scalable. Contrary to other configuration management software,
the bulk of the processing is done on the client; recipes and
templates are compiled and converged at the edge. With Service
Oriented Architecture in mind, the server is split up into multiple
servers with an API between them. Opscode built a hosted platform off
of this code base, although we’ve done a lot of work we haven’t gotten
around to merging back to the community yet because it is divergent;
but you can go a long way before you need this.
Of course performance will be affected by basic things like the size
of your hardware. But it also depends a lot on what you’re doing in
your recipes. Think about the book “The Goal”  and the “Theory of
If you’re putting a lot of big files in your cookbooks, like a tarball
of the software you want to deploy, you’re likely going to block on a
deploy on all the systems downloading this file. Someone already
mentioned running out of unicorns. It takes a while to move that data,
which is going to generate contention.
If you’re using search, remember that you’re going to to receive full
node objects from the search. (We need to work on that.) If you’re
doing a lot of search, you may have to wait for the solr server to
So where is the bottleneck?
Disk access? The database? Move it to another server. Put it on some fast disks.
Search? Move the solr server off the chef server. Chef was designed so
you could do this. Is there a system resource that should be
increased? We’ve used solr read-slaves in the past, but it was kind of
a band-aid for a design issue that we’ve since fixed. But you can
certainly do use read-slaves if you’re search heavy and don’t need
instantaneous results. (Which you’re not going to have anyway, because
you have to wait for new nodes to be indexed. Discuss and argue the
CAP theorem here.)
Chef server connections? Standard tricks for Ruby applications usually
apply, start there. You can run more than one Chef server too.
Some tricks, perhaps obvious, have already been mentioned. If you’re
daemonizing the Chef server, use the splay option so the servers don’t
all run at once. If you’re using another method, there are similar
options. If you need to run all the systems at once, try to build the
systems out before hand and save the steps that require
synchronization for later.
Until you get pretty big, scaling the Chef server is easy if you’re
familiar with scaling services in general. There is, as we say, no
magical unicorn. You still need system administrators. You still have
to do the work. If you hit a real wall then Opscode can help, we’ve
already hit it.