So yeah I think maybe some knobs that need to be tuned.
This isn’t my area of expertise so I’ll poke one of the other Chef devs
who knows more to weigh in, but I’m curious if you can provide more info on
your setup. I’d like to know what your Chef 10 setup looked like, just for
comparison to Chef 11. While your Chef 11 setup seems to be hurting, what
was the baseline hardware/number of boxes you had with Chef 10? Also, for
completely clarity - this is open source Chef right? I just want to make
sure we’re all on the same page.
On to actually trying to solve your problem. Having everything on a single
box might be causing some of the backup. While Chef 11 is much more
performant than Chef 10 if you’re throwing 5000 nodes at it with everything
on a single box that might hurt some. Typically we haven’t seen much need
to tune postgres. You might need to look at upping the connection count on
postgres, but as far as I know that is usually the only tuning that is done.
I’m not aware of much rabbit tuning that typically happens either, but
Solr, that sits on the other end of rabbit might need some tuning. Out of
the box it has some fairly vanilla settings and so you might see
improvements if you look there.
What Jeff said is valid. Cutting down on node data sent frees up not only
network but what Solr has to ingest.
Could you do possibly do some more monitoring on the box and try to figure
out where the bottleneck is? That would certainly make it easier to give
In the meantime I’ll ask one of the other engineers to weigh in. I’ll also
follow up and see if we can’t get a doc page on ways to tune Chef, as that
seems like it could prove helpful.
One of the first things that comes to mind, having nothing else
to offer aside from “start finding the bottleneck”, is reduction
of node data saved with every Chef run. That might help.
“Allows you to provide a whitelist of node attributes to save on the
server. All of the attributes are still available throughout the chef run,
but only those specifically listed will be saved to the server.”
Ohai’s full output on the CentOS 6.4 box I just tested on
returns 28KB(!) of data, 99% of which I have never wanted
to query the server for yet. So you could find at least some
I/O gain by whitelisting most of it based on your needs. If
your needs change, change the whitelist.
August 31, 2013 4:30 AM
I am wondering about any guidelines on ‘right sizing’ a Chef11 server. I
understand things like your mileage may vary - but meanwhile usually with a
popular community supported product which also has a commercial edition
are usually at least basic guidelines.
My situation is that we have approximately 2500 nodes distributed across
data centers. We have about 350ms round trip to the worst case data center.
What we did was turn up a single instance of Chef11 with 8-CPU and 32GB
Guy before me went to all the Chef conferences, and I guess he must have
the kook-aid because we migrated from Chef10, added a few hundred nodes,
Chef11 tipped over with ‘500’.
In all fairness, our original setup was set to have all nodes converge
5-minute splay time with a standard 30 minute cycle time. Meanwhile, our
expectation was that Chef11 performs better. We also moved everything (no
Couch, etc) - onto a single server.
Workaround we did was to increase splay time to 30-minutes within 30-minute
schedule for now.
My impression is that we just installed Chef11 and did not spend any time
tuning the right knobs? I have seen some posts where Postgres and such is
supposed to auto-size itself, but apparently that is based only on
and re-sizing does not work?
Sorry for lengthy post, sometimes context helps. Questions are:
For a use case of:
5 data centers
expected latency being around 300ms
Are there any knobs, dials, or other things that should be tuned to ensure
a single Chef11 instance can handle that? pg-sql, rabbit tuning jump to the
forefront for me.
Thanks in advance for help since this is my first post to the community and
generally like your product.