What kind of load can Chef Server support?


#1

Hi all,

We’re firing up the Open Source Chef Server hosted on our own network.

Initially, we’ll have a 100-200 nodes. Potentially that could ramp up to
several thousand.

Initially, frequency of deployment could be high as our software hasn’t
been given the TLC it deserves. We’re a startup and the pressure is on
getting something out to secure the next round of funding. So reliance on
Chef will likely start high as we need to deploy fixes for busted software, and
then flatten potentially as we’re given more latitude to bake in some quality
as we go when cash-flow becomes a little more predictable.

Our recipe sizes aren’t particularly large and currently all nodes have the
same roles. Just a deploy-revision from Github and then a bit of
templating/file management finishing with an Apache and Tomcat restart.

I guess my question is, if we were to flick the switch on a rollout to 200
nodes at the sametime, would Chef server cope with that level of
concurrency? If we have our nodes running as daemons, we can stagger deploys
with a splay time I guess.

Anyone had any experiences with loading chef server with a large number of
nodes?

Cheers

Ben


#2

do you really need 200 nodes to communicate with chef server at the same
time? You can run chef-client as an hourly cron job and use the last octet
% 60 of servers ip as the minute value in your crontab. Our chef server
handles similar load with 2GB ram /1 vcpu, with couchdb/rabbitmq/solr all
running on the same box. You need to do regular couchdb compaction to keep
the disk usage low. We had few pain points with solr , but never really any
scalability issue.

for more :
http://wiki.opscode.com/display/chef/Scalability+and+High+Availability

regards
ranjib

On Sat, Jan 28, 2012 at 7:57 AM, mailbox@bensullivan.net wrote:

Hi all,

We’re firing up the Open Source Chef Server hosted on our own network.

Initially, we’ll have a 100-200 nodes. Potentially that could ramp up to
several thousand.

Initially, frequency of deployment could be high as our software hasn’t
been given the TLC it deserves. We’re a startup and the pressure is on
getting something out to secure the next round of funding. So reliance on
Chef will likely start high as we need to deploy fixes for busted
software, and
then flatten potentially as we’re given more latitude to bake in some
quality
as we go when cash-flow becomes a little more predictable.

Our recipe sizes aren’t particularly large and currently all nodes have the
same roles. Just a deploy-revision from Github and then a bit of
templating/file management finishing with an Apache and Tomcat restart.

I guess my question is, if we were to flick the switch on a rollout to 200
nodes at the sametime, would Chef server cope with that level of
concurrency? If we have our nodes running as daemons, we can stagger
deploys
with a splay time I guess.

Anyone had any experiences with loading chef server with a large number of
nodes?

Cheers

Ben


#3

On Jan 27, 2012, at 9:27 PM, mailbox@bensullivan.net mailbox@bensullivan.net wrote:

Anyone had any experiences with loading chef server with a large number of
nodes?

It depends on what the recipes are doing. We can have several dozen nodes of one type hitting chef-server at the same time and everything is fine. However, other types, we can kill the server with just a few nodes. If you have a large number of nodes, full node searches – search(:node, “:”) – is a good way to kill chef-server. When we were running chef-server under unicorn, it took x nodes running that search to kill chef-serevr where x was the number of unicorn processes. No other nodes could run chef. Our problem was that we regularly triggered chef to run on those nodes at the same time. We tried larger splay, etc. We switched to rainbows and when this situation happens, other nodes can still run chef, just very slowly. Ultimately, we figured out a way to remove the full node search from those recipes. Yes, I know trimming ohai data would have helped, but we want/need that data and it would have just delayed the eventual problem, I think.

We still have issues with chef-server getting slow. It generally is caused by certain things happening in recipes (or via knife). We run chef-server, solr, and couch on a fairly “wimpy” box - 4 cores, 8GB RAM. This was our “proof-of-concept” hardware that just got rolled into production. It’s been on our list for weeks to migrate to a larger box. Cobbler’s children and all that.

FWIW, with rainbows it seems each individual chef run is slightly slower (17 seconds vs 15seconds, for example) but we can run several more concurrently before noticing any slowdown.We bumped up unicorn process count, but memory usage became an issue. We front chef-server with nginx.

Looking at the chef-server api, it seems it would be fairly easy to write a version in Lua embedded in nginx. But I guess that’s not as cool as erlang :wink:

–Brian


#4

On Fri, Jan 27, 2012 at 9:27 PM, mailbox@bensullivan.net wrote:

I guess my question is, if we were to flick the switch on a rollout to 200
nodes at the sametime, would Chef server cope with that level of
concurrency? If we have our nodes running as daemons, we can stagger deploys
with a splay time I guess.

I can’t give you an easy answer, but I’ll give you lots to think about.

One of the core architectural designs of Chef from the start was to
make it scalable. Contrary to other configuration management software,
the bulk of the processing is done on the client; recipes and
templates are compiled and converged at the edge. With Service
Oriented Architecture in mind, the server is split up into multiple
servers with an API between them. Opscode built a hosted platform off
of this code base, although we’ve done a lot of work we haven’t gotten
around to merging back to the community yet because it is divergent;
but you can go a long way before you need this.

Of course performance will be affected by basic things like the size
of your hardware. But it also depends a lot on what you’re doing in
your recipes. Think about the book “The Goal” [1] and the “Theory of
Constraints.” [2]

If you’re putting a lot of big files in your cookbooks, like a tarball
of the software you want to deploy, you’re likely going to block on a
deploy on all the systems downloading this file. Someone already
mentioned running out of unicorns. It takes a while to move that data,
which is going to generate contention.

If you’re using search, remember that you’re going to to receive full
node objects from the search. (We need to work on that.) If you’re
doing a lot of search, you may have to wait for the solr server to
return results.

So where is the bottleneck?

Disk access? The database? Move it to another server. Put it on some fast disks.

Search? Move the solr server off the chef server. Chef was designed so
you could do this. Is there a system resource that should be
increased? We’ve used solr read-slaves in the past, but it was kind of
a band-aid for a design issue that we’ve since fixed. But you can
certainly do use read-slaves if you’re search heavy and don’t need
instantaneous results. (Which you’re not going to have anyway, because
you have to wait for new nodes to be indexed. Discuss and argue the
CAP theorem here.)

Chef server connections? Standard tricks for Ruby applications usually
apply, start there. You can run more than one Chef server too.

Some tricks, perhaps obvious, have already been mentioned. If you’re
daemonizing the Chef server, use the splay option so the servers don’t
all run at once. If you’re using another method, there are similar
options. If you need to run all the systems at once, try to build the
systems out before hand and save the steps that require
synchronization for later.

Until you get pretty big, scaling the Chef server is easy if you’re
familiar with scaling services in general. There is, as we say, no
magical unicorn. You still need system administrators. You still have
to do the work. If you hit a real wall then Opscode can help, we’ve
already hit it. :slight_smile:

Bryan

[1] http://en.wikipedia.org/wiki/The_Goal_(novel)
[2] http://en.wikipedia.org/wiki/Theory_of_Constraints


#5

On 28 January 2012 16:13, Bryan McLellan btm@loftninjas.org wrote:

On Fri, Jan 27, 2012 at 9:27 PM, mailbox@bensullivan.net wrote:

I guess my question is, if we were to flick the switch on a rollout to 200
nodes at the sametime, would Chef server cope with that level of
concurrency? If we have our nodes running as daemons, we can stagger deploys
with a splay time I guess.

[snip]

Until you get pretty big, scaling the Chef server is easy if you’re
familiar with scaling services in general. There is, as we say, no
magical unicorn. You still need system administrators. You still have
to do the work. If you hit a real wall then Opscode can help, we’ve
already hit it. :slight_smile:

As noted the complexity of your recipes directly impacts the number of
clients which you can host on your setup.

Beyond that then ‘number of convergences per hour’ becomes the useful
measure, and remember, relaxing the interval works for many people
and directly translates to significantly reduced load.

With a reasonably powerful system and a little tuning, getting to
5000+ convergences an hour should be trivial. If you split-out the
components to put CouchDB and Solr onto dedicated hardware and then
run multiple API nodes behind a load balancer, then you can achieve
significantly more, but it is more of an investment to get this spun
up.