Right sizing Chef11 Server


#1

Hi All,
I am wondering about any guidelines on ‘right sizing’ a Chef11 server. I
understand things like your mileage may vary - but meanwhile usually with a
popular community supported product which also has a commercial edition there
are usually at least basic guidelines.

My situation is that we have approximately 2500 nodes distributed across four
data centers. We have about 350ms round trip to the worst case data center.

What we did was turn up a single instance of Chef11 with 8-CPU and 32GB RAM.
Guy before me went to all the Chef conferences, and I guess he must have drank
the kook-aid because we migrated from Chef10, added a few hundred nodes, and
Chef11 tipped over with ‘500’.

In all fairness, our original setup was set to have all nodes converge within a
5-minute splay time with a standard 30 minute cycle time. Meanwhile, our
expectation was that Chef11 performs better. We also moved everything (no more
Couch, etc) - onto a single server.

Workaround we did was to increase splay time to 30-minutes within 30-minute
schedule for now.

My impression is that we just installed Chef11 and did not spend any time
tuning the right knobs? I have seen some posts where Postgres and such is
supposed to auto-size itself, but apparently that is based only on installation
and re-sizing does not work?

Sorry for lengthy post, sometimes context helps. Questions are:

For a use case of:
5000 nodes
5 data centers
expected latency being around 300ms

Are there any knobs, dials, or other things that should be tuned to ensure that
a single Chef11 instance can handle that? pg-sql, rabbit tuning jump to the
forefront for me.

Thanks in advance for help since this is my first post to the community and
generally like your product.

  • Michael deMan

#2

I am wondering about any guidelines on ‘right sizing’ a Chef11 server.

One of the first things that comes to mind, having nothing else
to offer aside from “start finding the bottleneck”, is reduction
of node data saved with every Chef run. That might help.

See: https://github.com/opscode/whitelist-node-attrs

“Allows you to provide a whitelist of node attributes to save on the
server. All of the attributes are still available throughout the chef
run, but only those specifically listed will be saved to the server.”

Ohai’s full output on the CentOS 6.4 box I just tested on
returns 28KB(!) of data, 99% of which I have never wanted
to query the server for yet. So you could find at least some
I/O gain by whitelisting most of it based on your needs. If
your needs change, change the whitelist.


#3

Hey Michael,

This isn’t my area of expertise so I’ll poke one of the other Chef devs
who knows more to weigh in, but I’m curious if you can provide more info
on your setup. I’d like to know what your Chef 10 setup looked like,
just for comparison to Chef 11. While your Chef 11 setup seems to be
hurting, what was the baseline hardware/number of boxes you had with
Chef 10? Also, for completely clarity - this is open source Chef right?
I just want to make sure we’re all on the same page.

On to actually trying to solve your problem. Having everything on a
single box might be causing some of the backup. While Chef 11 is much
more performant than Chef 10 if you’re throwing 5000 nodes at it with
everything on a single box that might hurt some. Typically we haven’t
seen much need to tune postgres. You might need to look at upping the
connection count on postgres, but as far as I know that is usually the
only tuning that is done.

I’m not aware of much rabbit tuning that typically happens either, but
Solr, that sits on the other end of rabbit might need some tuning. Out
of the box it has some fairly vanilla settings and so you might see
improvements if you look there.

What Jeff said is valid. Cutting down on node data sent frees up not
only network but what Solr has to ingest.

Could you do possibly do some more monitoring on the box and try to
figure out where the bottleneck is? That would certainly make it easier
to give recommendations.

In the meantime I’ll ask one of the other engineers to weigh in. I’ll
also follow up and see if we can’t get a doc page on ways to tune Chef,
as that seems like it could prove helpful.

  • Mark Mzyk
    Opscode Software Engineer

Jeff Blaine mailto:jblaine@kickflop.net
August 31, 2013 10:20 AM

One of the first things that comes to mind, having nothing else
to offer aside from “start finding the bottleneck”, is reduction
of node data saved with every Chef run. That might help.

See: https://github.com/opscode/whitelist-node-attrs

“Allows you to provide a whitelist of node attributes to save on the
server. All of the attributes are still available throughout the chef
run, but only those specifically listed will be saved to the server.”

Ohai’s full output on the CentOS 6.4 box I just tested on
returns 28KB(!) of data, 99% of which I have never wanted
to query the server for yet. So you could find at least some
I/O gain by whitelisting most of it based on your needs. If
your needs change, change the whitelist.
chef@deman.com mailto:chef@deman.com
August 31, 2013 4:30 AM
Hi All,
I am wondering about any guidelines on ‘right sizing’ a Chef11 server. I
understand things like your mileage may vary - but meanwhile usually
with a
popular community supported product which also has a commercial
edition there
are usually at least basic guidelines.

My situation is that we have approximately 2500 nodes distributed
across four
data centers. We have about 350ms round trip to the worst case data
center.

What we did was turn up a single instance of Chef11 with 8-CPU and
32GB RAM.
Guy before me went to all the Chef conferences, and I guess he must
have drank
the kook-aid because we migrated from Chef10, added a few hundred
nodes, and
Chef11 tipped over with ‘500’.

In all fairness, our original setup was set to have all nodes converge
within a
5-minute splay time with a standard 30 minute cycle time. Meanwhile, our
expectation was that Chef11 performs better. We also moved everything
(no more
Couch, etc) - onto a single server.

Workaround we did was to increase splay time to 30-minutes within
30-minute
schedule for now.

My impression is that we just installed Chef11 and did not spend any time
tuning the right knobs? I have seen some posts where Postgres and such is
supposed to auto-size itself, but apparently that is based only on
installation
and re-sizing does not work?

Sorry for lengthy post, sometimes context helps. Questions are:

For a use case of:
5000 nodes
5 data centers
expected latency being around 300ms

Are there any knobs, dials, or other things that should be tuned to
ensure that
a single Chef11 instance can handle that? pg-sql, rabbit tuning jump
to the
forefront for me.

Thanks in advance for help since this is my first post to the
community and
generally like your product.

  • Michael deMan

#4

Hi Michael,

chef@deman.com writes:

What we did was turn up a single instance of Chef11 with 8-CPU and 32GB RAM.
Guy before me went to all the Chef conferences, and I guess he must have drank
the kook-aid because we migrated from Chef10, added a few hundred nodes, and
Chef11 tipped over with ‘500’.

Providing details of how the server tipped over would help us help
you. When you write “tipped over with ‘500’”, do you mean when you added
500 nodes, or that you started to see 500 errors from the server?

What do you observe when things tip over? Is there a hot process on the
Chef server? What do you see in the logs (sudo chef-server-ctl tail)?
What things, if any, have you tuned (what’s in
/etc/chef-server/chef-server.rb)?

Upgrading from Chef 10 to Chef 11 for an infrastructure of your size was
a smart move. From everything I’ve seen, if you were to compare apples
to apples 10 vs 11, you will see a large difference.

In all fairness, our original setup was set to have all nodes converge within a
5-minute splay time with a standard 30 minute cycle time. Meanwhile, our
expectation was that Chef11 performs better. We also moved everything (no more
Couch, etc) - onto a single server.

Workaround we did was to increase splay time to 30-minutes within 30-minute
schedule for now.

My impression is that we just installed Chef11 and did not spend any time
tuning the right knobs? I have seen some posts where Postgres and such is
supposed to auto-size itself, but apparently that is based only on installation
and re-sizing does not work?

For a use case of:
5000 nodes
5 data centers
expected latency being around 300ms

Are there any knobs, dials, or other things that should be tuned to ensure that
a single Chef11 instance can handle that? pg-sql, rabbit tuning jump to the
forefront for me.

Here are a few things to look into:

  1. Search the erchef logs for an error message containing
    "no_connection". This is an indication that the pool of db client
    connections is exhausted. You can tune this via erchef['db_pool_size']
    in chef-server.rb. Keep in mind that erchef will open connection on
    startup and is ultimately limited by the configured max in postgres (see
    postgresql['max_connections']).

  2. Do you have recipes that execute searches that typically return all
    or nearly all nodes in your infrastructure (dead give away is a query
    like “:”)? Such searches are relatively costly since all of the node
    data will be fetched from the db and sent to the client. You can reduce
    the impact by making use of more focused queries and by using the
    partial search API.

  3. Review the size of your node data. You may be able to disable some
    ohai plugins and greatly reduce the size of the node data without losing
    data of interest.

  • seth


Seth Falcon | Development Lead | Opscode | @sfalcon


#5

I think the thing that was missing from the predecessor’s ‘kool-aid’ is the
fact that facebook (who was likely the presentation that got him gaga)
stripped out quite a bit of things from a standard Chef install - the
amount of node data they save back being one of those things. There were
quite a few other changes they made as well. I don’t think they use search
for instance.

So yeah I think maybe some knobs that need to be tuned.

On Sat, Aug 31, 2013 at 10:43 AM, Mark Mzyk mmzyk@programmersparadox.comwrote:

Hey Michael,

This isn’t my area of expertise so I’ll poke one of the other Chef devs
who knows more to weigh in, but I’m curious if you can provide more info on
your setup. I’d like to know what your Chef 10 setup looked like, just for
comparison to Chef 11. While your Chef 11 setup seems to be hurting, what
was the baseline hardware/number of boxes you had with Chef 10? Also, for
completely clarity - this is open source Chef right? I just want to make
sure we’re all on the same page.

On to actually trying to solve your problem. Having everything on a single
box might be causing some of the backup. While Chef 11 is much more
performant than Chef 10 if you’re throwing 5000 nodes at it with everything
on a single box that might hurt some. Typically we haven’t seen much need
to tune postgres. You might need to look at upping the connection count on
postgres, but as far as I know that is usually the only tuning that is done.

I’m not aware of much rabbit tuning that typically happens either, but
Solr, that sits on the other end of rabbit might need some tuning. Out of
the box it has some fairly vanilla settings and so you might see
improvements if you look there.

What Jeff said is valid. Cutting down on node data sent frees up not only
network but what Solr has to ingest.

Could you do possibly do some more monitoring on the box and try to figure
out where the bottleneck is? That would certainly make it easier to give
recommendations.

In the meantime I’ll ask one of the other engineers to weigh in. I’ll also
follow up and see if we can’t get a doc page on ways to tune Chef, as that
seems like it could prove helpful.

One of the first things that comes to mind, having nothing else
to offer aside from “start finding the bottleneck”, is reduction
of node data saved with every Chef run. That might help.

See: https://github.com/opscode/whitelist-node-attrs

“Allows you to provide a whitelist of node attributes to save on the
server. All of the attributes are still available throughout the chef run,
but only those specifically listed will be saved to the server.”

Ohai’s full output on the CentOS 6.4 box I just tested on
returns 28KB(!) of data, 99% of which I have never wanted
to query the server for yet. So you could find at least some
I/O gain by whitelisting most of it based on your needs. If
your needs change, change the whitelist.
chef@deman.com
August 31, 2013 4:30 AM
Hi All,
I am wondering about any guidelines on ‘right sizing’ a Chef11 server. I

understand things like your mileage may vary - but meanwhile usually with a
popular community supported product which also has a commercial edition
there
are usually at least basic guidelines.

My situation is that we have approximately 2500 nodes distributed across
four
data centers. We have about 350ms round trip to the worst case data center.

What we did was turn up a single instance of Chef11 with 8-CPU and 32GB
RAM.
Guy before me went to all the Chef conferences, and I guess he must have
drank
the kook-aid because we migrated from Chef10, added a few hundred nodes,
and
Chef11 tipped over with ‘500’.

In all fairness, our original setup was set to have all nodes converge
within a
5-minute splay time with a standard 30 minute cycle time. Meanwhile, our
expectation was that Chef11 performs better. We also moved everything (no
more
Couch, etc) - onto a single server.

Workaround we did was to increase splay time to 30-minutes within 30-minute
schedule for now.

My impression is that we just installed Chef11 and did not spend any time
tuning the right knobs? I have seen some posts where Postgres and such is
supposed to auto-size itself, but apparently that is based only on
installation
and re-sizing does not work?

Sorry for lengthy post, sometimes context helps. Questions are:

For a use case of:
5000 nodes
5 data centers
expected latency being around 300ms

Are there any knobs, dials, or other things that should be tuned to ensure
that
a single Chef11 instance can handle that? pg-sql, rabbit tuning jump to the
forefront for me.

Thanks in advance for help since this is my first post to the community and
generally like your product.

  • Michael deMan

#6

Predecessor here… You have the whitelist cookbook available to you and it’s configured with everything you’d want to turn off already. Just make sure it’s last in the run list

Sent from a phone

On Aug 31, 2013, at 7:20 AM, Jeff Blaine jblaine@kickflop.net wrote:

I am wondering about any guidelines on ‘right sizing’ a Chef11 server.

One of the first things that comes to mind, having nothing else
to offer aside from “start finding the bottleneck”, is reduction
of node data saved with every Chef run. That might help.

See: https://github.com/opscode/whitelist-node-attrs

“Allows you to provide a whitelist of node attributes to save on the server. All of the attributes are still available throughout the chef run, but only those specifically listed will be saved to the server.”

Ohai’s full output on the CentOS 6.4 box I just tested on
returns 28KB(!) of data, 99% of which I have never wanted
to query the server for yet. So you could find at least some
I/O gain by whitelisting most of it based on your needs. If
your needs change, change the whitelist.


#7

Hi All,

Thanks for the responses.

We are going to schedule a window to change the splay time from 30 minutes back to 5 minutes, try to reproduce the problem and look at things in more detail.
Trimming down the amount of data ohai posts is another good idea, but we are going to try that as a last resort.

Other background information is:

Original Chef10…

  • was split out with separate Couch, API, webserver, etc
  • hosts are decommissioned and I seem to recall the hosts had a variety of CPU/RAM configuration depending on their purpose.

New Chef11…

  • Single server, 8-core, 32GBRAM, chef version 11.0.8 on CentOS 6.4 and almost all the clients are 11.4.4.
  • The 500s were reported by the clients when connecting to the server, and it occurred somewhere when we went from about 1200 clients to about 1400 clients.
  • The 500s were reported on all clients.

When it happened, we were a bit panicked and intuitively decided to reduce client load (moving splay time from 5-minutes to 30-minutes) fixed the problem, but…

  • No processes jumped out at us as ‘hot’
  • Nothing jumped out at us in the logs, but we were not really sure what to be looking for, we know more now.
  • There were plenty of CPU and RAM resources available on the host.
  • We did not see any file/socket resource constraints via lsof.
  • WAN connectivity to the data centers all seemed fine.

The fact that there was plenty of unused CPU/RAM available on the server yet it seemed it could not keep up sent us in the direction of wondering about tuning.

Thanks,

  • Mike

On Sep 2, 2013, at 9:02 PM, Seth Falcon seth@opscode.com wrote:

Hi Michael,

chef@deman.com writes:

What we did was turn up a single instance of Chef11 with 8-CPU and 32GB RAM.
Guy before me went to all the Chef conferences, and I guess he must have drank
the kook-aid because we migrated from Chef10, added a few hundred nodes, and
Chef11 tipped over with ‘500’.

Providing details of how the server tipped over would help us help
you. When you write “tipped over with ‘500’”, do you mean when you added
500 nodes, or that you started to see 500 errors from the server?

What do you observe when things tip over? Is there a hot process on the
Chef server? What do you see in the logs (sudo chef-server-ctl tail)?
What things, if any, have you tuned (what’s in
/etc/chef-server/chef-server.rb)?

Upgrading from Chef 10 to Chef 11 for an infrastructure of your size was
a smart move. From everything I’ve seen, if you were to compare apples
to apples 10 vs 11, you will see a large difference.

In all fairness, our original setup was set to have all nodes converge within a
5-minute splay time with a standard 30 minute cycle time. Meanwhile, our
expectation was that Chef11 performs better. We also moved everything (no more
Couch, etc) - onto a single server.

Workaround we did was to increase splay time to 30-minutes within 30-minute
schedule for now.

My impression is that we just installed Chef11 and did not spend any time
tuning the right knobs? I have seen some posts where Postgres and such is
supposed to auto-size itself, but apparently that is based only on installation
and re-sizing does not work?

For a use case of:
5000 nodes
5 data centers
expected latency being around 300ms

Are there any knobs, dials, or other things that should be tuned to ensure that
a single Chef11 instance can handle that? pg-sql, rabbit tuning jump to the
forefront for me.

Here are a few things to look into:

  1. Search the erchef logs for an error message containing
    "no_connection". This is an indication that the pool of db client
    connections is exhausted. You can tune this via erchef['db_pool_size']
    in chef-server.rb. Keep in mind that erchef will open connection on
    startup and is ultimately limited by the configured max in postgres (see
    postgresql['max_connections']).

  2. Do you have recipes that execute searches that typically return all
    or nearly all nodes in your infrastructure (dead give away is a query
    like “:”)? Such searches are relatively costly since all of the node
    data will be fetched from the db and sent to the client. You can reduce
    the impact by making use of more focused queries and by using the
    partial search API.

  3. Review the size of your node data. You may be able to disable some
    ohai plugins and greatly reduce the size of the node data without losing
    data of interest.

  • seth


Seth Falcon | Development Lead | Opscode | @sfalcon