Chef setup has become unstable

Madhurranjan_Mohaan · June 16, 2012, 8:16am

Hi experts,

We’re running 140 Windows nodes and 20 linux based nodes on our chef server
with roughly around 40 cookbooks. We’ve experienced over the last week or
so that chef server/ couchdb and chef-solr dies randomly . Can’t seem to
find much in the logs . I have set it to debug mode. We’ve setup monit to
automatically restart the processes. We run couchdb compaction at midnight
every day . Also , we’ve seen that the chef server response is very slow
when we upload cookbooks or perform any activity through knife . This
definitely wasn’t the case a couple of months back .

Can you please suggest some ways so that atleast the response times improve
considerably?

Ranjan

Matt_Ray_OPSCODE · June 16, 2012, 12:58pm

Perhaps you could explicitly spell out your operating system and the
versions of the software you're using? Hopefully someone running the
same setup can provide more hints.

Thanks,
Matt Ray
Senior Technical Evangelist | Opscode Inc.
matt@opscode.com | (512) 731-2218
Twitter, IRC, GitHub: mattray

On Sat, Jun 16, 2012 at 3:16 AM, Madhurranjan Mohaan
maadhuuranjan.m@gmail.com wrote:

Hi experts,

We're running 140 Windows nodes and 20 linux based nodes on our chef server
with roughly around 40 cookbooks. We've experienced over the last week or so
that chef server/ couchdb and chef-solr dies randomly . Can't seem to find
much in the logs . I have set it to debug mode. We've setup monit to
automatically restart the processes. We run couchdb compaction at midnight
every day . Also , we've seen that the chef server response is very slow
when we upload cookbooks or perform any activity through knife . This
definitely wasn't the case a couple of months back .

Can you please suggest some ways so that atleast the response times improve
considerably?

Ranjan

Joshua_Timberman · June 16, 2012, 1:55pm

Are you running all the chef server services on one machine? What is the hardware spec of it? 160 nodes is quite a few. Sounds like you may need to start scaling out the server and run services on separate systems.

On Jun 16, 2012, at 2:16, Madhurranjan Mohaan maadhuuranjan.m@gmail.com wrote:

Hi experts,

We're running 140 Windows nodes and 20 linux based nodes on our chef server with roughly around 40 cookbooks. We've experienced over the last week or so that chef server/ couchdb and chef-solr dies randomly . Can't seem to find much in the logs . I have set it to debug mode. We've setup monit to automatically restart the processes. We run couchdb compaction at midnight every day . Also , we've seen that the chef server response is very slow when we upload cookbooks or perform any activity through knife . This definitely wasn't the case a couple of months back .

Can you please suggest some ways so that atleast the response times improve considerably?

Ranjan

Madhurranjan_Mohaan · June 16, 2012, 7:41pm

Thanks for the responses. Versions are as follows;

Chef server that runs all the services - Centos 5.5 64bit with 4GB Ram .
Its a VM in Vmware .
Chef server version - 0.10.6 since we faced some issues in 0.10.8 windows
gem when we initially did it .

Do you think we should scale out ? If yes, what services do you think we
should run on different servers? Also, on my end, I am trying to see if all
of them are needed and trying to delete unnecessary nodes but that would
just be a max of 10-15 nodes.

thanks
Ranjan

On Sat, Jun 16, 2012 at 7:25 PM, Joshua Timberman joshua@opscode.comwrote:

Are you running all the chef server services on one machine? What is the
hardware spec of it? 160 nodes is quite a few. Sounds like you may need to
start scaling out the server and run services on separate systems.

On Jun 16, 2012, at 2:16, Madhurranjan Mohaan maadhuuranjan.m@gmail.com
wrote:

Hi experts,

We're running 140 Windows nodes and 20 linux based nodes on our chef
server with roughly around 40 cookbooks. We've experienced over the last
week or so that chef server/ couchdb and chef-solr dies randomly . Can't
seem to find much in the logs . I have set it to debug mode. We've setup
monit to automatically restart the processes. We run couchdb compaction at
midnight every day . Also , we've seen that the chef server response is
very slow when we upload cookbooks or perform any activity through knife .
This definitely wasn't the case a couple of months back .

Can you please suggest some ways so that atleast the response times
improve considerably?

Ranjan

KC_Braunschweig · June 16, 2012, 8:47pm

On Sat, Jun 16, 2012 at 12:41 PM, Madhurranjan Mohaan
maadhuuranjan.m@gmail.com wrote:

Do you think we should scale out ? If yes, what services do you think we
should run on different servers? Also, on my end, I am trying to see if all

Regarding the instability, I can tell you I had issues on RHEL 5.7
because the versions of couchdb and erlang were old. Newer packages
probably would have fixed it, but I upgraded to RHEL 6.1 which also
had newer versions and things were happier. Doesn't sound exactly like
your instability, but worth considering.

Regarding the performance issues, I hope that Josh was joking. 160
nodes is nothing. Are they converging every 30 minutes? Do you have a
reasonable splay? Are your recipes very search heavy? It could be a
lot of things, but I'd start with considering the concurrency on the
server API. Are you running a single Thin process for the API server?
If so, consider running multiple processes with proxy balancer or some
such in front of them. Alternatively switch the server to run in
unicorn with nginx in front of it. I've been happy with unicorn so
far.

I don't think you should be there yet, but 4gb is probably not gonna
be enough forever. Eventually solr will want more heap and you'll need
memory as you add api server workers and couch will take whatever's
left. Which leads back to either adding memory or Josh's point of
splitting components on different servers. That's eventually though,
I'd hope you could get at least a couple hundred nodes with your
current VM and 1000+ with 8gb without too much trouble.

To give you an example, I have a preprod server with about 1000 nodes:
RHEL 6.1 VM
8gb
4 virtual cores
unicorn - 8 api workers, 2 webui workers
solr - 2gb heap
chef 0.10.4

KC

On Sat, Jun 16, 2012 at 7:25 PM, Joshua Timberman joshua@opscode.com
wrote:

Are you running all the chef server services on one machine? What is the
hardware spec of it? 160 nodes is quite a few. Sounds like you may need to
start scaling out the server and run services on separate systems.

Sascha_Bates · June 16, 2012, 9:56pm

Could it be the number of Windows servers and the astonishing amount of
ohai data collected for Windows? My understanding is that Windows ohai
has an awful lot of data. I haven't worked with it in a few months so my
memory is fading a bit and I was chef-solo anyway. 120 Windows nodes
might produce a lot of data.

On 6/16/12 3:47 PM, KC Braunschweig wrote:

On Sat, Jun 16, 2012 at 12:41 PM, Madhurranjan Mohaan
maadhuuranjan.m@gmail.com wrote:

Do you think we should scale out ? If yes, what services do you think we
should run on different servers? Also, on my end, I am trying to see if all
Regarding the instability, I can tell you I had issues on RHEL 5.7
because the versions of couchdb and erlang were old. Newer packages
probably would have fixed it, but I upgraded to RHEL 6.1 which also
had newer versions and things were happier. Doesn't sound exactly like
your instability, but worth considering.

Regarding the performance issues, I hope that Josh was joking. 160
nodes is nothing. Are they converging every 30 minutes? Do you have a
reasonable splay? Are your recipes very search heavy? It could be a
lot of things, but I'd start with considering the concurrency on the
server API. Are you running a single Thin process for the API server?
If so, consider running multiple processes with proxy balancer or some
such in front of them. Alternatively switch the server to run in
unicorn with nginx in front of it. I've been happy with unicorn so
far.

I don't think you should be there yet, but 4gb is probably not gonna
be enough forever. Eventually solr will want more heap and you'll need
memory as you add api server workers and couch will take whatever's
left. Which leads back to either adding memory or Josh's point of
splitting components on different servers. That's eventually though,
I'd hope you could get at least a couple hundred nodes with your
current VM and 1000+ with 8gb without too much trouble.

To give you an example, I have a preprod server with about 1000 nodes:
RHEL 6.1 VM
8gb
4 virtual cores
unicorn - 8 api workers, 2 webui workers
solr - 2gb heap
chef 0.10.4

KC

On Sat, Jun 16, 2012 at 7:25 PM, Joshua Timbermanjoshua@opscode.com
wrote:

Are you running all the chef server services on one machine? What is the
hardware spec of it? 160 nodes is quite a few. Sounds like you may need to
start scaling out the server and run services on separate systems.

Madhurranjan_Mohaan · June 17, 2012, 8:31am

Thanks yet again for the response!

@KC- Yes, we're running just one thread and the nodes are converging every
hour. I'll spike out the unicorn + nginx setup on a new box with Centos 6.2
and get see how that behaves and then probably move these out to that
setup. Thanks for the tip!

@Sascha - Its a mix of Windows 2003 32 bit server and WIn 2003 64 bit
mostly. I ain't sure if the sheer amount of ohai data is causing this. Any
other parameters I should consider?

Ranjan

On Sun, Jun 17, 2012 at 3:26 AM, Sascha Bates sascha.bates@gmail.comwrote:

Could it be the number of Windows servers and the astonishing amount of
ohai data collected for Windows? My understanding is that Windows ohai has
an awful lot of data. I haven't worked with it in a few months so my memory
is fading a bit and I was chef-solo anyway. 120 Windows nodes might produce
a lot of data.

On 6/16/12 3:47 PM, KC Braunschweig wrote:

On Sat, Jun 16, 2012 at 12:41 PM, Madhurranjan Mohaan
maadhuuranjan.m@gmail.com wrote:

Do you think we should scale out ? If yes, what services do you think we
should run on different servers? Also, on my end, I am trying to see if
all

Regarding the instability, I can tell you I had issues on RHEL 5.7
because the versions of couchdb and erlang were old. Newer packages
probably would have fixed it, but I upgraded to RHEL 6.1 which also
had newer versions and things were happier. Doesn't sound exactly like
your instability, but worth considering.

Regarding the performance issues, I hope that Josh was joking. 160
nodes is nothing. Are they converging every 30 minutes? Do you have a
reasonable splay? Are your recipes very search heavy? It could be a
lot of things, but I'd start with considering the concurrency on the
server API. Are you running a single Thin process for the API server?
If so, consider running multiple processes with proxy balancer or some
such in front of them. Alternatively switch the server to run in
unicorn with nginx in front of it. I've been happy with unicorn so
far.

I don't think you should be there yet, but 4gb is probably not gonna
be enough forever. Eventually solr will want more heap and you'll need
memory as you add api server workers and couch will take whatever's
left. Which leads back to either adding memory or Josh's point of
splitting components on different servers. That's eventually though,
I'd hope you could get at least a couple hundred nodes with your
current VM and 1000+ with 8gb without too much trouble.

To give you an example, I have a preprod server with about 1000 nodes:
RHEL 6.1 VM
8gb
4 virtual cores
unicorn - 8 api workers, 2 webui workers
solr - 2gb heap
chef 0.10.4

KC

On Sat, Jun 16, 2012 at 7:25 PM, Joshua Timbermanjoshua@opscode.com

wrote:

Are you running all the chef server services on one machine? What is the
hardware spec of it? 160 nodes is quite a few. Sounds like you may need
to
start scaling out the server and run services on separate systems.

Jeremiah_Snapp · June 17, 2012, 8:56am

I'm just adding my two cents to the great suggestions from MC and Sascha.

As KC suggested you want to consider preventing your nodes from converging
at the same time to reduce the amount of concurrent requests to the server.

When considering the large amount of windows ohai data you may want to look
at a chef thread from may 29 with the subject "Knife search note returning
a node". It mentions disabling a Windows ohai plugin to reduce the amount
of content.

Refer to http://wiki.opscode.com/display/chef/Disabling+Ohai+Plugins
On Jun 17, 2012 4:31 AM, "Madhurranjan Mohaan" maadhuuranjan.m@gmail.com
wrote:

Thanks yet again for the response!

@KC- Yes, we're running just one thread and the nodes are converging every
hour. I'll spike out the unicorn + nginx setup on a new box with Centos 6.2
and get see how that behaves and then probably move these out to that
setup. Thanks for the tip!

@Sascha - Its a mix of Windows 2003 32 bit server and WIn 2003 64 bit
mostly. I ain't sure if the sheer amount of ohai data is causing this. Any
other parameters I should consider?

Ranjan

On Sun, Jun 17, 2012 at 3:26 AM, Sascha Bates sascha.bates@gmail.comwrote:

Could it be the number of Windows servers and the astonishing amount of
ohai data collected for Windows? My understanding is that Windows ohai has
an awful lot of data. I haven't worked with it in a few months so my memory
is fading a bit and I was chef-solo anyway. 120 Windows nodes might produce
a lot of data.

On 6/16/12 3:47 PM, KC Braunschweig wrote:

On Sat, Jun 16, 2012 at 12:41 PM, Madhurranjan Mohaan
maadhuuranjan.m@gmail.com wrote:

Do you think we should scale out ? If yes, what services do you think we
should run on different servers? Also, on my end, I am trying to see if
all

Regarding the instability, I can tell you I had issues on RHEL 5.7
because the versions of couchdb and erlang were old. Newer packages
probably would have fixed it, but I upgraded to RHEL 6.1 which also
had newer versions and things were happier. Doesn't sound exactly like
your instability, but worth considering.

Regarding the performance issues, I hope that Josh was joking. 160
nodes is nothing. Are they converging every 30 minutes? Do you have a
reasonable splay? Are your recipes very search heavy? It could be a
lot of things, but I'd start with considering the concurrency on the
server API. Are you running a single Thin process for the API server?
If so, consider running multiple processes with proxy balancer or some
such in front of them. Alternatively switch the server to run in
unicorn with nginx in front of it. I've been happy with unicorn so
far.

I don't think you should be there yet, but 4gb is probably not gonna
be enough forever. Eventually solr will want more heap and you'll need
memory as you add api server workers and couch will take whatever's
left. Which leads back to either adding memory or Josh's point of
splitting components on different servers. That's eventually though,
I'd hope you could get at least a couple hundred nodes with your
current VM and 1000+ with 8gb without too much trouble.

To give you an example, I have a preprod server with about 1000 nodes:
RHEL 6.1 VM
8gb
4 virtual cores
unicorn - 8 api workers, 2 webui workers
solr - 2gb heap
chef 0.10.4

KC

On Sat, Jun 16, 2012 at 7:25 PM, Joshua Timbermanjoshua@opscode.com

wrote:

Are you running all the chef server services on one machine? What is
the
hardware spec of it? 160 nodes is quite a few. Sounds like you may
need to
start scaling out the server and run services on separate systems.

Sascha_Bates · June 17, 2012, 3:11pm

In regards to that thread, I (woohoo!) finally got around to submitting
a pull request that splits the Windows ohai kernel plugin into separate
plugins so it will be easier to cut down on unneeded ohai data for Windows.

On 6/17/12 3:56 AM, Jeremiah Snapp wrote:

I'm just adding my two cents to the great suggestions from MC and Sascha.

As KC suggested you want to consider preventing your nodes from
converging at the same time to reduce the amount of concurrent
requests to the server.

When considering the large amount of windows ohai data you may want to
look at a chef thread from may 29 with the subject "Knife search note
returning a node". It mentions disabling a Windows ohai plugin to
reduce the amount of content.

Refer to http://wiki.opscode.com/display/chef/Disabling+Ohai+Plugins

On Jun 17, 2012 4:31 AM, "Madhurranjan Mohaan"
<maadhuuranjan.m@gmail.com mailto:maadhuuranjan.m@gmail.com> wrote:

Thanks yet again for the response!

@KC- Yes, we're running just one thread and the nodes are
converging every hour. I'll spike out the unicorn + nginx setup on
a new box with Centos 6.2 and get see how that behaves and then
probably move these out to that setup. Thanks for the tip!

@Sascha  - Its a mix of Windows 2003 32 bit server and WIn 2003 64
bit mostly. I ain't sure if the sheer amount of ohai data is
causing this. Any other parameters I should consider?

Ranjan

On Sun, Jun 17, 2012 at 3:26 AM, Sascha Bates
<sascha.bates@gmail.com <mailto:sascha.bates@gmail.com>> wrote:

    Could it be the number of Windows servers and the astonishing
    amount of ohai data collected for Windows?  My understanding
    is that Windows ohai has an awful lot of data. I haven't
    worked with it in a few months so my memory is fading a bit
    and I was chef-solo anyway. 120 Windows nodes might produce a
    lot of data.


    On 6/16/12 3:47 PM, KC Braunschweig wrote:

        On Sat, Jun 16, 2012 at 12:41 PM, Madhurranjan Mohaan
        <maadhuuranjan.m@gmail.com
        <mailto:maadhuuranjan.m@gmail.com>>  wrote:

            Do you think we should scale out ? If yes, what
            services do you think we
            should run on different servers? Also, on my end, I am
            trying to see if all

        Regarding the instability, I can tell you I had issues on
        RHEL 5.7
        because the versions of couchdb and erlang were old. Newer
        packages
        probably would have fixed it, but I upgraded to RHEL 6.1
        which also
        had newer versions and things were happier. Doesn't sound
        exactly like
        your instability, but worth considering.

        Regarding the performance issues, I hope that Josh was
        joking. 160
        nodes is nothing. Are they converging every 30 minutes? Do
        you have a
        reasonable splay? Are your recipes very search heavy? It
        could be a
        lot of things, but I'd start with considering the
        concurrency on the
        server API. Are you running a single Thin process for the
        API server?
        If so, consider running multiple processes with proxy
        balancer or some
        such in front of them. Alternatively switch the server to
        run in
        unicorn with nginx in front of it. I've been happy with
        unicorn so
        far.

        I don't think you should be there yet, but 4gb is probably
        not gonna
        be enough forever. Eventually solr will want more heap and
        you'll need
        memory as you add api server workers and couch will take
        whatever's
        left. Which leads back to either adding memory or Josh's
        point of
        splitting components on different servers. That's
        eventually though,
        I'd hope you could get at least a couple hundred nodes
        with your
        current VM and 1000+ with 8gb without too much trouble.

        To give you an example, I have a preprod server with about
        1000 nodes:
        RHEL 6.1 VM
        8gb
        4 virtual cores
        unicorn - 8 api workers, 2 webui workers
        solr - 2gb heap
        chef 0.10.4

        KC

            On Sat, Jun 16, 2012 at 7:25 PM, Joshua
            Timberman<joshua@opscode.com <mailto:joshua@opscode.com>>
            wrote:

                Are you running all the chef server services on
                one machine? What is the
                hardware spec of it? 160 nodes is quite a few.
                Sounds like you may need to
                start scaling out the server and run services on
                separate systems.

Tim_Smith · June 18, 2012, 4:47pm

You are my hero. Thank you

Tim Smith

Operations Engineer

M: +1 707.738.8132

TW: @tas50

webtrendshttp://www.webtrends.com/

Real-Time Relevance. Remarkable ROI.™

London | Portland | San Francisco | Melbourne | Tokyo

From: Sascha Bates <sascha.bates@gmail.com mailto:sascha.bates@gmail.com>
Reply-To: "sascha.bates@gmail.com mailto:sascha.bates@gmail.com" <sascha.bates@gmail.com mailto:sascha.bates@gmail.com>
Date: Sunday, June 17, 2012 8:11 AM
To: Jeremiah Snapp <jeremiah.snapp@gmail.com mailto:jeremiah.snapp@gmail.com>
Cc: "chef@lists.opscode.com mailto:chef@lists.opscode.com" <chef@lists.opscode.com mailto:chef@lists.opscode.com>, KC Braunschweig <kcbraunschweig@gmail.com mailto:kcbraunschweig@gmail.com>
Subject: [chef] Re: Re: Re: Re: Re: Re: Chef setup has become unstable

In regards to that thread, I (woohoo!) finally got around to submitting a pull request that splits the Windows ohai kernel plugin into separate plugins so it will be easier to cut down on unneeded ohai data for Windows.

On 6/17/12 3:56 AM, Jeremiah Snapp wrote:

I’m just adding my two cents to the great suggestions from MC and Sascha.

As KC suggested you want to consider preventing your nodes from converging at the same time to reduce the amount of concurrent requests to the server.

When considering the large amount of windows ohai data you may want to look at a chef thread from may 29 with the subject “Knife search note returning a node”. It mentions disabling a Windows ohai plugin to reduce the amount of content.

Refer to http://wiki.opscode.com/display/chef/Disabling+Ohai+Plugins

On Jun 17, 2012 4:31 AM, “Madhurranjan Mohaan” <maadhuuranjan.m@gmail.com mailto:maadhuuranjan.m@gmail.com> wrote:
Thanks yet again for the response!

@KC- Yes, we’re running just one thread and the nodes are converging every hour. I’ll spike out the unicorn + nginx setup on a new box with Centos 6.2 and get see how that behaves and then probably move these out to that setup. Thanks for the tip!

@Sascha - Its a mix of Windows 2003 32 bit server and WIn 2003 64 bit mostly. I ain’t sure if the sheer amount of ohai data is causing this. Any other parameters I should consider?

Ranjan

On Sun, Jun 17, 2012 at 3:26 AM, Sascha Bates <sascha.bates@gmail.com mailto:sascha.bates@gmail.com> wrote:
Could it be the number of Windows servers and the astonishing amount of ohai data collected for Windows? My understanding is that Windows ohai has an awful lot of data. I haven’t worked with it in a few months so my memory is fading a bit and I was chef-solo anyway. 120 Windows nodes might produce a lot of data.

On 6/16/12 3:47 PM, KC Braunschweig wrote:
On Sat, Jun 16, 2012 at 12:41 PM, Madhurranjan Mohaan
<maadhuuranjan.m@gmail.com mailto:maadhuuranjan.m@gmail.com> wrote:
Do you think we should scale out ? If yes, what services do you think we
should run on different servers? Also, on my end, I am trying to see if all
Regarding the instability, I can tell you I had issues on RHEL 5.7
because the versions of couchdb and erlang were old. Newer packages
probably would have fixed it, but I upgraded to RHEL 6.1 which also
had newer versions and things were happier. Doesn’t sound exactly like
your instability, but worth considering.

Regarding the performance issues, I hope that Josh was joking. 160
nodes is nothing. Are they converging every 30 minutes? Do you have a
reasonable splay? Are your recipes very search heavy? It could be a
lot of things, but I’d start with considering the concurrency on the
server API. Are you running a single Thin process for the API server?
If so, consider running multiple processes with proxy balancer or some
such in front of them. Alternatively switch the server to run in
unicorn with nginx in front of it. I’ve been happy with unicorn so
far.

I don’t think you should be there yet, but 4gb is probably not gonna
be enough forever. Eventually solr will want more heap and you’ll need
memory as you add api server workers and couch will take whatever’s
left. Which leads back to either adding memory or Josh’s point of
splitting components on different servers. That’s eventually though,
I’d hope you could get at least a couple hundred nodes with your
current VM and 1000+ with 8gb without too much trouble.

To give you an example, I have a preprod server with about 1000 nodes:
RHEL 6.1 VM
8gb
4 virtual cores
unicorn - 8 api workers, 2 webui workers
solr - 2gb heap
chef 0.10.4

KC

On Sat, Jun 16, 2012 at 7:25 PM, Joshua Timberman<joshua@opscode.com mailto:joshua@opscode.com>
wrote:
Are you running all the chef server services on one machine? What is the
hardware spec of it? 160 nodes is quite a few. Sounds like you may need to
start scaling out the server and run services on separate systems.

Jeremiah_Snapp · June 18, 2012, 5:10pm

Great work Sascha!

Jeremiah

On Sun, Jun 17, 2012 at 11:11 AM, Sascha Bates sascha.bates@gmail.comwrote:

In regards to that thread, I (woohoo!) finally got around to submitting a
pull request that splits the Windows ohai kernel plugin into separate
plugins so it will be easier to cut down on unneeded ohai data for Windows.

On 6/17/12 3:56 AM, Jeremiah Snapp wrote:

I'm just adding my two cents to the great suggestions from MC and Sascha.

As KC suggested you want to consider preventing your nodes from converging
at the same time to reduce the amount of concurrent requests to the server.

When considering the large amount of windows ohai data you may want to
look at a chef thread from may 29 with the subject "Knife search note
returning a node". It mentions disabling a Windows ohai plugin to reduce
the amount of content.

Refer to http://wiki.opscode.com/display/chef/Disabling+Ohai+Plugins
On Jun 17, 2012 4:31 AM, "Madhurranjan Mohaan" maadhuuranjan.m@gmail.com
wrote:

Thanks yet again for the response!

@KC- Yes, we're running just one thread and the nodes are converging
every hour. I'll spike out the unicorn + nginx setup on a new box with
Centos 6.2 and get see how that behaves and then probably move these out to
that setup. Thanks for the tip!

@Sascha - Its a mix of Windows 2003 32 bit server and WIn 2003 64 bit
mostly. I ain't sure if the sheer amount of ohai data is causing this. Any
other parameters I should consider?

Ranjan

On Sun, Jun 17, 2012 at 3:26 AM, Sascha Bates sascha.bates@gmail.comwrote:

Could it be the number of Windows servers and the astonishing amount of
ohai data collected for Windows? My understanding is that Windows ohai has
an awful lot of data. I haven't worked with it in a few months so my memory
is fading a bit and I was chef-solo anyway. 120 Windows nodes might produce
a lot of data.

On 6/16/12 3:47 PM, KC Braunschweig wrote:

On Sat, Jun 16, 2012 at 12:41 PM, Madhurranjan Mohaan
maadhuuranjan.m@gmail.com wrote:

Do you think we should scale out ? If yes, what services do you think
we
should run on different servers? Also, on my end, I am trying to see
if all

Regarding the instability, I can tell you I had issues on RHEL 5.7
because the versions of couchdb and erlang were old. Newer packages
probably would have fixed it, but I upgraded to RHEL 6.1 which also
had newer versions and things were happier. Doesn't sound exactly like
your instability, but worth considering.

Regarding the performance issues, I hope that Josh was joking. 160
nodes is nothing. Are they converging every 30 minutes? Do you have a
reasonable splay? Are your recipes very search heavy? It could be a
lot of things, but I'd start with considering the concurrency on the
server API. Are you running a single Thin process for the API server?
If so, consider running multiple processes with proxy balancer or some
such in front of them. Alternatively switch the server to run in
unicorn with nginx in front of it. I've been happy with unicorn so
far.

I don't think you should be there yet, but 4gb is probably not gonna
be enough forever. Eventually solr will want more heap and you'll need
memory as you add api server workers and couch will take whatever's
left. Which leads back to either adding memory or Josh's point of
splitting components on different servers. That's eventually though,
I'd hope you could get at least a couple hundred nodes with your
current VM and 1000+ with 8gb without too much trouble.

To give you an example, I have a preprod server with about 1000 nodes:
RHEL 6.1 VM
8gb
4 virtual cores
unicorn - 8 api workers, 2 webui workers
solr - 2gb heap
chef 0.10.4

KC

On Sat, Jun 16, 2012 at 7:25 PM, Joshua Timbermanjoshua@opscode.com

wrote:

Are you running all the chef server services on one machine? What is
the
hardware spec of it? 160 nodes is quite a few. Sounds like you may
need to
start scaling out the server and run services on separate systems.

AJ_Christensen · June 18, 2012, 9:50pm

I noticed this thread yesterday, but wanted to reiterate:

Great work Sascha, Many bugs have been caused/exacerbated by the
exceedingly large amount of data some of the win32 plugins for ohai
generate!

Thank you!

--AJ

On 19 June 2012 05:10, Jeremiah Snapp jeremiah.snapp@gmail.com wrote:

Great work Sascha!

Jeremiah

On Sun, Jun 17, 2012 at 11:11 AM, Sascha Bates sascha.bates@gmail.com
wrote:

In regards to that thread, I (woohoo!) finally got around to submitting a
pull request that splits the Windows ohai kernel plugin into separate
plugins so it will be easier to cut down on unneeded ohai data for Windows.

On 6/17/12 3:56 AM, Jeremiah Snapp wrote:

I'm just adding my two cents to the great suggestions from MC and Sascha.

As KC suggested you want to consider preventing your nodes from converging
at the same time to reduce the amount of concurrent requests to the server.

When considering the large amount of windows ohai data you may want to
look at a chef thread from may 29 with the subject "Knife search note
returning a node". It mentions disabling a Windows ohai plugin to reduce
the amount of content.

Refer to http://wiki.opscode.com/display/chef/Disabling+Ohai+Plugins

On Jun 17, 2012 4:31 AM, "Madhurranjan Mohaan" maadhuuranjan.m@gmail.com
wrote:

Thanks yet again for the response!

@KC- Yes, we're running just one thread and the nodes are converging
every hour. I'll spike out the unicorn + nginx setup on a new box with
Centos 6.2 and get see how that behaves and then probably move these out to
that setup. Thanks for the tip!

@Sascha - Its a mix of Windows 2003 32 bit server and WIn 2003 64 bit
mostly. I ain't sure if the sheer amount of ohai data is causing this. Any
other parameters I should consider?

Ranjan

On Sun, Jun 17, 2012 at 3:26 AM, Sascha Bates sascha.bates@gmail.com
wrote:

Could it be the number of Windows servers and the astonishing amount of
ohai data collected for Windows? My understanding is that Windows ohai has
an awful lot of data. I haven't worked with it in a few months so my memory
is fading a bit and I was chef-solo anyway. 120 Windows nodes might produce
a lot of data.

On 6/16/12 3:47 PM, KC Braunschweig wrote:

On Sat, Jun 16, 2012 at 12:41 PM, Madhurranjan Mohaan
maadhuuranjan.m@gmail.com wrote:

Do you think we should scale out ? If yes, what services do you think
we
should run on different servers? Also, on my end, I am trying to see
if all

Regarding the instability, I can tell you I had issues on RHEL 5.7
because the versions of couchdb and erlang were old. Newer packages
probably would have fixed it, but I upgraded to RHEL 6.1 which also
had newer versions and things were happier. Doesn't sound exactly like
your instability, but worth considering.

Regarding the performance issues, I hope that Josh was joking. 160
nodes is nothing. Are they converging every 30 minutes? Do you have a
reasonable splay? Are your recipes very search heavy? It could be a
lot of things, but I'd start with considering the concurrency on the
server API. Are you running a single Thin process for the API server?
If so, consider running multiple processes with proxy balancer or some
such in front of them. Alternatively switch the server to run in
unicorn with nginx in front of it. I've been happy with unicorn so
far.

I don't think you should be there yet, but 4gb is probably not gonna
be enough forever. Eventually solr will want more heap and you'll need
memory as you add api server workers and couch will take whatever's
left. Which leads back to either adding memory or Josh's point of
splitting components on different servers. That's eventually though,
I'd hope you could get at least a couple hundred nodes with your
current VM and 1000+ with 8gb without too much trouble.

To give you an example, I have a preprod server with about 1000 nodes:
RHEL 6.1 VM
8gb
4 virtual cores
unicorn - 8 api workers, 2 webui workers
solr - 2gb heap
chef 0.10.4

KC

On Sat, Jun 16, 2012 at 7:25 PM, Joshua Timbermanjoshua@opscode.com
wrote:

Are you running all the chef server services on one machine? What is
the
hardware spec of it? 160 nodes is quite a few. Sounds like you may
need to
start scaling out the server and run services on separate systems.

Sascha_Bates · June 18, 2012, 10:19pm

For real? Guys, I find this kind of hilarious. I didn't actually write
more than about 3 lines of original code and then rearranged everything.
I thought this had to be lamest ever contribution in the history of
contributions.

I am so thrilled that this will made a diff for folks.

Sascha

On 6/18/12 4:50 PM, AJ Christensen wrote:

I noticed this thread yesterday, but wanted to reiterate:

Great work Sascha, Many bugs have been caused/exacerbated by the
exceedingly large amount of data some of the win32 plugins for ohai
generate!

Thank you!

--AJ

On 19 June 2012 05:10, Jeremiah Snappjeremiah.snapp@gmail.com wrote:

Great work Sascha!

Jeremiah

On Sun, Jun 17, 2012 at 11:11 AM, Sascha Batessascha.bates@gmail.com
wrote:

In regards to that thread, I (woohoo!) finally got around to submitting a
pull request that splits the Windows ohai kernel plugin into separate
plugins so it will be easier to cut down on unneeded ohai data for Windows.

On 6/17/12 3:56 AM, Jeremiah Snapp wrote:

I'm just adding my two cents to the great suggestions from MC and Sascha.

As KC suggested you want to consider preventing your nodes from converging
at the same time to reduce the amount of concurrent requests to the server.

When considering the large amount of windows ohai data you may want to
look at a chef thread from may 29 with the subject "Knife search note
returning a node". It mentions disabling a Windows ohai plugin to reduce
the amount of content.

Refer to http://wiki.opscode.com/display/chef/Disabling+Ohai+Plugins

On Jun 17, 2012 4:31 AM, "Madhurranjan Mohaan"maadhuuranjan.m@gmail.com
wrote:

Thanks yet again for the response!

@KC- Yes, we're running just one thread and the nodes are converging
every hour. I'll spike out the unicorn + nginx setup on a new box with
Centos 6.2 and get see how that behaves and then probably move these out to
that setup. Thanks for the tip!

@Sascha - Its a mix of Windows 2003 32 bit server and WIn 2003 64 bit
mostly. I ain't sure if the sheer amount of ohai data is causing this. Any
other parameters I should consider?

Ranjan

On Sun, Jun 17, 2012 at 3:26 AM, Sascha Batessascha.bates@gmail.com
wrote:

Could it be the number of Windows servers and the astonishing amount of
ohai data collected for Windows? My understanding is that Windows ohai has
an awful lot of data. I haven't worked with it in a few months so my memory
is fading a bit and I was chef-solo anyway. 120 Windows nodes might produce
a lot of data.

On 6/16/12 3:47 PM, KC Braunschweig wrote:

On Sat, Jun 16, 2012 at 12:41 PM, Madhurranjan Mohaan
maadhuuranjan.m@gmail.com wrote:

Do you think we should scale out ? If yes, what services do you think
we
should run on different servers? Also, on my end, I am trying to see
if all
Regarding the instability, I can tell you I had issues on RHEL 5.7
because the versions of couchdb and erlang were old. Newer packages
probably would have fixed it, but I upgraded to RHEL 6.1 which also
had newer versions and things were happier. Doesn't sound exactly like
your instability, but worth considering.

Regarding the performance issues, I hope that Josh was joking. 160
nodes is nothing. Are they converging every 30 minutes? Do you have a
reasonable splay? Are your recipes very search heavy? It could be a
lot of things, but I'd start with considering the concurrency on the
server API. Are you running a single Thin process for the API server?
If so, consider running multiple processes with proxy balancer or some
such in front of them. Alternatively switch the server to run in
unicorn with nginx in front of it. I've been happy with unicorn so
far.

I don't think you should be there yet, but 4gb is probably not gonna
be enough forever. Eventually solr will want more heap and you'll need
memory as you add api server workers and couch will take whatever's
left. Which leads back to either adding memory or Josh's point of
splitting components on different servers. That's eventually though,
I'd hope you could get at least a couple hundred nodes with your
current VM and 1000+ with 8gb without too much trouble.

To give you an example, I have a preprod server with about 1000 nodes:
RHEL 6.1 VM
8gb
4 virtual cores
unicorn - 8 api workers, 2 webui workers
solr - 2gb heap
chef 0.10.4

KC

On Sat, Jun 16, 2012 at 7:25 PM, Joshua Timbermanjoshua@opscode.com
wrote:

Are you running all the chef server services on one machine? What is
the
hardware spec of it? 160 nodes is quite a few. Sounds like you may
need to
start scaling out the server and run services on separate systems.

Wes_Morgan · June 18, 2012, 10:57pm

Any amateur programmer can make something work by throwing lots of code at it.

It takes real genius to solve a problem in a simple and effective manner.

Wes

On Jun 18, 2012, at 6:19 PM, Sascha Bates sascha.bates@gmail.com wrote:

For real? Guys, I find this kind of hilarious. I didn't actually write more than about 3 lines of original code and then rearranged everything. I thought this had to be lamest ever contribution in the history of contributions.

I am so thrilled that this will made a diff for folks.

Sascha

On 6/18/12 4:50 PM, AJ Christensen wrote:

I noticed this thread yesterday, but wanted to reiterate:

Great work Sascha, Many bugs have been caused/exacerbated by the
exceedingly large amount of data some of the win32 plugins for ohai
generate!

Thank you!

--AJ

On 19 June 2012 05:10, Jeremiah Snappjeremiah.snapp@gmail.com wrote:

Great work Sascha!

Jeremiah

On Sun, Jun 17, 2012 at 11:11 AM, Sascha Batessascha.bates@gmail.com
wrote:

In regards to that thread, I (woohoo!) finally got around to submitting a
pull request that splits the Windows ohai kernel plugin into separate
plugins so it will be easier to cut down on unneeded ohai data for Windows.

On 6/17/12 3:56 AM, Jeremiah Snapp wrote:

I'm just adding my two cents to the great suggestions from MC and Sascha.

As KC suggested you want to consider preventing your nodes from converging
at the same time to reduce the amount of concurrent requests to the server.

When considering the large amount of windows ohai data you may want to
look at a chef thread from may 29 with the subject "Knife search note
returning a node". It mentions disabling a Windows ohai plugin to reduce
the amount of content.

Refer to http://wiki.opscode.com/display/chef/Disabling+Ohai+Plugins

On Jun 17, 2012 4:31 AM, "Madhurranjan Mohaan"maadhuuranjan.m@gmail.com
wrote:

Thanks yet again for the response!

@KC- Yes, we're running just one thread and the nodes are converging
every hour. I'll spike out the unicorn + nginx setup on a new box with
Centos 6.2 and get see how that behaves and then probably move these out to
that setup. Thanks for the tip!

@Sascha - Its a mix of Windows 2003 32 bit server and WIn 2003 64 bit
mostly. I ain't sure if the sheer amount of ohai data is causing this. Any
other parameters I should consider?

Ranjan

On Sun, Jun 17, 2012 at 3:26 AM, Sascha Batessascha.bates@gmail.com
wrote:

Could it be the number of Windows servers and the astonishing amount of
ohai data collected for Windows? My understanding is that Windows ohai has
an awful lot of data. I haven't worked with it in a few months so my memory
is fading a bit and I was chef-solo anyway. 120 Windows nodes might produce
a lot of data.

On 6/16/12 3:47 PM, KC Braunschweig wrote:

On Sat, Jun 16, 2012 at 12:41 PM, Madhurranjan Mohaan
maadhuuranjan.m@gmail.com wrote:

Do you think we should scale out ? If yes, what services do you think
we
should run on different servers? Also, on my end, I am trying to see
if all
Regarding the instability, I can tell you I had issues on RHEL 5.7
because the versions of couchdb and erlang were old. Newer packages
probably would have fixed it, but I upgraded to RHEL 6.1 which also
had newer versions and things were happier. Doesn't sound exactly like
your instability, but worth considering.

Regarding the performance issues, I hope that Josh was joking. 160
nodes is nothing. Are they converging every 30 minutes? Do you have a
reasonable splay? Are your recipes very search heavy? It could be a
lot of things, but I'd start with considering the concurrency on the
server API. Are you running a single Thin process for the API server?
If so, consider running multiple processes with proxy balancer or some
such in front of them. Alternatively switch the server to run in
unicorn with nginx in front of it. I've been happy with unicorn so
far.

I don't think you should be there yet, but 4gb is probably not gonna
be enough forever. Eventually solr will want more heap and you'll need
memory as you add api server workers and couch will take whatever's
left. Which leads back to either adding memory or Josh's point of
splitting components on different servers. That's eventually though,
I'd hope you could get at least a couple hundred nodes with your
current VM and 1000+ with 8gb without too much trouble.

To give you an example, I have a preprod server with about 1000 nodes:
RHEL 6.1 VM
8gb
4 virtual cores
unicorn - 8 api workers, 2 webui workers
solr - 2gb heap
chef 0.10.4

KC

On Sat, Jun 16, 2012 at 7:25 PM, Joshua Timbermanjoshua@opscode.com
wrote:

Are you running all the chef server services on one machine? What is
the
hardware spec of it? 160 nodes is quite a few. Sounds like you may
need to
start scaling out the server and run services on separate systems.

Topic		Replies	Views
Chef stability? Chef Infra (archive)	12	429	November 18, 2010
Resource usage for chef server Chef Infra (archive)	4	322	July 6, 2009
Right sizing Chef11 Server Chef Infra (archive)	6	340	September 3, 2013
Feelings on chef Chef Infra (archive)	16	407	May 7, 2010
Server load Chef Infra (archive)	2	318	July 2, 2010

Chef setup has become unstable

Related topics