Chef stability?

Allan_Carroll · November 17, 2010, 6:09pm

Hi,

I’ve been working the past few days on tweaking my chef scripts to go into
production on EC2 and struggling to get anything I feel good about trusting.
Chef looks like a great tool with a strong community. I’m hoping that there’s
some Chef way of looking at the world I haven’t been exposed to that you can
all enlighten me on.

I’m running Ubuntu 10.10 on EC2 with the version of chef from the Opscode Lucid
repo (0.9.8).

A few things going on:

I can’t seem to keep chef-solr or chef-solr-indexer from crashing. I keep
having to restart them for some reason. I’m using It makes everything feel
really flakey, but I’m not convinced that’s the only thing I’m running into.

Sometimes the webui (and knife) show the status of all the nodes and sometimes
it refuses saying that I have no nodes (even though the node list shows there
are some there). The error in the logs is only the same 500 internal server
error: connection refused that I see for lots of things.

Running chef-client by hand on a machine causes a different result than letting
the timer driven version work. Like it forces the client to reevaluate all the
data bags and search results and actually apply them.

Sometimes the clients get new data/nodes and update everything fine, sometimes
they don’t.

Yesterday I started 8 boxes to bring a whole cluster up. On a few of them, Chef
just randomly stopped working. Running chef-client by hand finished building
the box correctly. One of them built part of a configuration file using data
from a node that I had deleted off the Chef server a few hours earlier and then
could never get out of that state. Deleting the configuration file and
rerunning client fixed it.

Anyway, all of these small, but annoying, little glitches give me a really bad
feeling about trusting Chef to manage my production infrastructure. Of the
tools I’ve looked at, it’s the most promising.

I’d really like to given the promise of such powerful ability when it works,
the time that I’ve put into it, and the time it will save. Is anyone using Chef
at a large scale? Does it take handholding and massaging along the way, and
that’s just the price for cutting-edge technology that will be solved as the
code matures?

Thanks,
Allan

Adam_Jacob · November 17, 2010, 6:16pm

On Wed, Nov 17, 2010 at 10:09 AM, allanca@gmail.com wrote:

I've been working the past few days on tweaking my chef scripts to go into
production on EC2 and struggling to get anything I feel good about trusting.
Chef looks like a great tool with a strong community. I'm hoping that there's
some Chef way of looking at the world I haven't been exposed to that you can
all enlighten me on.

I'm running Ubuntu 10.10 on EC2 with the version of chef from the Opscode Lucid
repo (0.9.8).

A few things going on:

I can't seem to keep chef-solr or chef-solr-indexer from crashing. I keep
having to restart them for some reason. I'm using It makes everything feel
really flakey, but I'm not convinced that's the only thing I'm running into.

This is going to be the source of several problems - can you send us a
gist of what you get in the logs when these crash?

Sometimes the webui (and knife) show the status of all the nodes and sometimes
it refuses saying that I have no nodes (even though the node list shows there
are some there). The error in the logs is only the same 500 internal server
error: connection refused that I see for lots of things.

Those pages both use search - if you are seeing consistent failures of
Solr, thats the source of these issues.

Running chef-client by hand on a machine causes a different result than letting
the timer driven version work. Like it forces the client to reevaluate all the
data bags and search results and actually apply them.

In what way? The code paths here are identical for the most part. If
you're using data bags and search in the recipes, and you are seeing
failures of Solr, I would wager that these differences are actually
just a representation of the search service not being stable for you.

Sometimes the clients get new data/nodes and update everything fine, sometimes
they don't.

Again, if it's data that comes from search, that's your issue.

Yesterday I started 8 boxes to bring a whole cluster up. On a few of them, Chef
just randomly stopped working. Running chef-client by hand finished building
the box correctly. One of them built part of a configuration file using data
from a node that I had deleted off the Chef server a few hours earlier and then
could never get out of that state. Deleting the configuration file and
rerunning client fixed it.

All the symptoms you talk about sound search related - so we should
focus there.

Anyway, all of these small, but annoying, little glitches give me a really bad
feeling about trusting Chef to manage my production infrastructure. Of the
tools I've looked at, it's the most promising.

Sorry to hear that, but it's been quite stable for us (and for lots of
other folks). We'll get you fixed up.

I'd really like to given the promise of such powerful ability when it works,
the time that I've put into it, and the time it will save. Is anyone using Chef
at a large scale? Does it take handholding and massaging along the way, and
that's just the price for cutting-edge technology that will be solved as the
code matures?

There are people using Chef at the scale of many thousands of systems,
and Opscode manages a production multi-tenant infrastructure that is
also quite significant, using many of the same components that are in
the open source Chef.

Happy to help - hook us up with the logs.

Best,
Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com

Allan_Carroll · November 17, 2010, 7:37pm

Whew. That makes it seem tractable. Thanks for helping zero in on this.

Here's what I dug up:

solr-indexer.log has no real clues.

Lots of these:

INFO: Indexing node 37192f37-447a-41c7-8480-c048c878743e from chef status error Connection refused - connect(2)}

and lots of these:

INFO: Indexing cookbook_version 2bd0feeb-3e32-4bb2-867c-41e0cfa12806 from chef status ok}

solr.log also doesn't seem to have anything interesting, but here's the last set of output before it went away last time:

gist.github.com

https://gist.github.com/allanca/703913

gistfile1.txt

Nov 17, 2010 7:49:40 AM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)
Nov 17, 2010 7:49:40 AM org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=2
        commit{dir=/var/cache/chef/solr/data/index,segFN=segments_8d,version=1289922551847,generation=301,filenames=[_8x.fdx, _8w.frq, _8w.tii, _8x.fdt, _8w.tis, _8x.tii, segments_8d, _8x.fnm, _8w.fdt, _8w_1.del, _8x.nrm, _8x.tis, _8w.fdx, _8x.frq, _8w.fnm]
        commit{dir=/var/cache/chef/solr/data/index,segFN=segments_8e,version=1289922551848,generation=302,filenames=[_8x.fdx, _8y.tis, _8w.frq, _8w.tii, _8y.frq, _8x.
fdt, _8w.tis, _8y.fdx, _8x.tii, _8y.tii, _8x_1.del, _8y.fdt, segments_8e, _8x.fnm, _8w.fdt, _8y.fnm, _8w_1.del, _8x.nrm, _8x.tis, _8w.fdx, _8x.frq, _8w.fnm, _8y.nrm]
Nov 17, 2010 7:49:40 AM org.apache.solr.core.SolrDeletionPolicy updateCommitsINFO: newest commit = 1289922551848
Nov 17, 2010 7:49:40 AM org.apache.solr.search.SolrIndexSearcher <init>
INFO: Opening Searcher@66a2420e main

This file has been truncated. show original

Here's a typical failure from the server log:

gist.github.com

https://gist.github.com/allanca/703912

gistfile1.txt

merb : chef-server (api) : worker (port 4000) ~ Started request handling: Wed Nov 17 19:26:35 +0000 2010
merb : chef-server (api) : worker (port 4000) ~ Params: {"format"=>nil, "action"=>"show", "id"=>"node", "q"=>"role:monitoring AND app_environment:production", "start"=>"0", "rows"=>"1000", "controller"=>"search", "sort"=>"X_CHEF_id_CHEF_X asc"}
merb : chef-server (api) : worker (port 4000) ~ Connection refused - connect(2) - (Errno::ECONNREFUSED)
/usr/lib/ruby/1.8/net/http.rb:560:in `initialize'
/usr/lib/ruby/1.8/net/http.rb:560:in `open'
/usr/lib/ruby/1.8/net/http.rb:560:in `connect'
/usr/lib/ruby/1.8/timeout.rb:53:in `timeout'
/usr/lib/ruby/1.8/timeout.rb:101:in `timeout'
/usr/lib/ruby/1.8/net/http.rb:560:in `connect'
/usr/lib/ruby/1.8/net/http.rb:553:in `do_start'

This file has been truncated. show original

On Nov 17, 2010, at 11:16 AM, Adam Jacob wrote:

On Wed, Nov 17, 2010 at 10:09 AM, allanca@gmail.com wrote:

I've been working the past few days on tweaking my chef scripts to go into
production on EC2 and struggling to get anything I feel good about trusting.
Chef looks like a great tool with a strong community. I'm hoping that there's
some Chef way of looking at the world I haven't been exposed to that you can
all enlighten me on.

I'm running Ubuntu 10.10 on EC2 with the version of chef from the Opscode Lucid
repo (0.9.8).

A few things going on:

I can't seem to keep chef-solr or chef-solr-indexer from crashing. I keep
having to restart them for some reason. I'm using It makes everything feel
really flakey, but I'm not convinced that's the only thing I'm running into.

This is going to be the source of several problems - can you send us a
gist of what you get in the logs when these crash?

Sometimes the webui (and knife) show the status of all the nodes and sometimes
it refuses saying that I have no nodes (even though the node list shows there
are some there). The error in the logs is only the same 500 internal server
error: connection refused that I see for lots of things.

Those pages both use search - if you are seeing consistent failures of
Solr, thats the source of these issues.

Running chef-client by hand on a machine causes a different result than letting
the timer driven version work. Like it forces the client to reevaluate all the
data bags and search results and actually apply them.

In what way? The code paths here are identical for the most part. If
you're using data bags and search in the recipes, and you are seeing
failures of Solr, I would wager that these differences are actually
just a representation of the search service not being stable for you.

Sometimes the clients get new data/nodes and update everything fine, sometimes
they don't.

Again, if it's data that comes from search, that's your issue.

Yesterday I started 8 boxes to bring a whole cluster up. On a few of them, Chef
just randomly stopped working. Running chef-client by hand finished building
the box correctly. One of them built part of a configuration file using data
from a node that I had deleted off the Chef server a few hours earlier and then
could never get out of that state. Deleting the configuration file and
rerunning client fixed it.

All the symptoms you talk about sound search related - so we should
focus there.

Anyway, all of these small, but annoying, little glitches give me a really bad
feeling about trusting Chef to manage my production infrastructure. Of the
tools I've looked at, it's the most promising.

Sorry to hear that, but it's been quite stable for us (and for lots of
other folks). We'll get you fixed up.

I'd really like to given the promise of such powerful ability when it works,
the time that I've put into it, and the time it will save. Is anyone using Chef
at a large scale? Does it take handholding and massaging along the way, and
that's just the price for cutting-edge technology that will be solved as the
code matures?

There are people using Chef at the scale of many thousands of systems,
and Opscode manages a production multi-tenant infrastructure that is
also quite significant, using many of the same components that are in
the open source Chef.

Happy to help - hook us up with the logs.

Best,
Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com

Blake_Barnett · November 17, 2010, 11:52pm

I found that solr would crash reliably if the machine had a shortage of memory. If I increased the RAM allocated to the VM to ~2GB, it behaved much more reliably.

-Blake

On Nov 18, 2010, at 4:37 AM, Allan Carroll wrote:

Whew. That makes it seem tractable. Thanks for helping zero in on this.

Here's what I dug up:

solr-indexer.log has no real clues.

Lots of these:

INFO: Indexing node 37192f37-447a-41c7-8480-c048c878743e from chef status error Connection refused - connect(2)}

and lots of these:

INFO: Indexing cookbook_version 2bd0feeb-3e32-4bb2-867c-41e0cfa12806 from chef status ok}

solr.log also doesn't seem to have anything interesting, but here's the last set of output before it went away last time:

solr.log output · GitHub

Here's a typical failure from the server log:

Typical server.log failure · GitHub

On Nov 17, 2010, at 11:16 AM, Adam Jacob wrote:

On Wed, Nov 17, 2010 at 10:09 AM, allanca@gmail.com wrote:

I've been working the past few days on tweaking my chef scripts to go into
production on EC2 and struggling to get anything I feel good about trusting.
Chef looks like a great tool with a strong community. I'm hoping that there's
some Chef way of looking at the world I haven't been exposed to that you can
all enlighten me on.

I'm running Ubuntu 10.10 on EC2 with the version of chef from the Opscode Lucid
repo (0.9.8).

A few things going on:

I can't seem to keep chef-solr or chef-solr-indexer from crashing. I keep
having to restart them for some reason. I'm using It makes everything feel
really flakey, but I'm not convinced that's the only thing I'm running into.

This is going to be the source of several problems - can you send us a
gist of what you get in the logs when these crash?

Sometimes the webui (and knife) show the status of all the nodes and sometimes
it refuses saying that I have no nodes (even though the node list shows there
are some there). The error in the logs is only the same 500 internal server
error: connection refused that I see for lots of things.

Those pages both use search - if you are seeing consistent failures of
Solr, thats the source of these issues.

Running chef-client by hand on a machine causes a different result than letting
the timer driven version work. Like it forces the client to reevaluate all the
data bags and search results and actually apply them.

In what way? The code paths here are identical for the most part. If
you're using data bags and search in the recipes, and you are seeing
failures of Solr, I would wager that these differences are actually
just a representation of the search service not being stable for you.

Sometimes the clients get new data/nodes and update everything fine, sometimes
they don't.

Again, if it's data that comes from search, that's your issue.

Yesterday I started 8 boxes to bring a whole cluster up. On a few of them, Chef
just randomly stopped working. Running chef-client by hand finished building
the box correctly. One of them built part of a configuration file using data
from a node that I had deleted off the Chef server a few hours earlier and then
could never get out of that state. Deleting the configuration file and
rerunning client fixed it.

All the symptoms you talk about sound search related - so we should
focus there.

Anyway, all of these small, but annoying, little glitches give me a really bad
feeling about trusting Chef to manage my production infrastructure. Of the
tools I've looked at, it's the most promising.

Sorry to hear that, but it's been quite stable for us (and for lots of
other folks). We'll get you fixed up.

I'd really like to given the promise of such powerful ability when it works,
the time that I've put into it, and the time it will save. Is anyone using Chef
at a large scale? Does it take handholding and massaging along the way, and
that's just the price for cutting-edge technology that will be solved as the
code matures?

There are people using Chef at the scale of many thousands of systems,
and Opscode manages a production multi-tenant infrastructure that is
also quite significant, using many of the same components that are in
the open source Chef.

Happy to help - hook us up with the logs.

Best,
Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com

Allan_Carroll · November 18, 2010, 12:27am

That's likely the same problem I'm having. I've been trying to run my Chef server off of a machine with 700GB (EC2 micro instance).

This begs the larger question: what size of machine is recommended for running Chef? Seems like a pretty beefy system with all the parts running.

-Allan

On Nov 17, 2010, at 4:52 PM, Blake Barnett wrote:

I found that solr would crash reliably if the machine had a shortage of memory. If I increased the RAM allocated to the VM to ~2GB, it behaved much more reliably.

-Blake

On Nov 18, 2010, at 4:37 AM, Allan Carroll wrote:

Whew. That makes it seem tractable. Thanks for helping zero in on this.

Here's what I dug up:

solr-indexer.log has no real clues.

Lots of these:

INFO: Indexing node 37192f37-447a-41c7-8480-c048c878743e from chef status error Connection refused - connect(2)}

and lots of these:

INFO: Indexing cookbook_version 2bd0feeb-3e32-4bb2-867c-41e0cfa12806 from chef status ok}

solr.log also doesn't seem to have anything interesting, but here's the last set of output before it went away last time:

solr.log output · GitHub

Here's a typical failure from the server log:

Typical server.log failure · GitHub

On Nov 17, 2010, at 11:16 AM, Adam Jacob wrote:

On Wed, Nov 17, 2010 at 10:09 AM, allanca@gmail.com wrote:

I've been working the past few days on tweaking my chef scripts to go into
production on EC2 and struggling to get anything I feel good about trusting.
Chef looks like a great tool with a strong community. I'm hoping that there's
some Chef way of looking at the world I haven't been exposed to that you can
all enlighten me on.

I'm running Ubuntu 10.10 on EC2 with the version of chef from the Opscode Lucid
repo (0.9.8).

A few things going on:

I can't seem to keep chef-solr or chef-solr-indexer from crashing. I keep
having to restart them for some reason. I'm using It makes everything feel
really flakey, but I'm not convinced that's the only thing I'm running into.

This is going to be the source of several problems - can you send us a
gist of what you get in the logs when these crash?

Sometimes the webui (and knife) show the status of all the nodes and sometimes
it refuses saying that I have no nodes (even though the node list shows there
are some there). The error in the logs is only the same 500 internal server
error: connection refused that I see for lots of things.

Those pages both use search - if you are seeing consistent failures of
Solr, thats the source of these issues.

Running chef-client by hand on a machine causes a different result than letting
the timer driven version work. Like it forces the client to reevaluate all the
data bags and search results and actually apply them.

In what way? The code paths here are identical for the most part. If
you're using data bags and search in the recipes, and you are seeing
failures of Solr, I would wager that these differences are actually
just a representation of the search service not being stable for you.

Sometimes the clients get new data/nodes and update everything fine, sometimes
they don't.

Again, if it's data that comes from search, that's your issue.

Yesterday I started 8 boxes to bring a whole cluster up. On a few of them, Chef
just randomly stopped working. Running chef-client by hand finished building
the box correctly. One of them built part of a configuration file using data
from a node that I had deleted off the Chef server a few hours earlier and then
could never get out of that state. Deleting the configuration file and
rerunning client fixed it.

All the symptoms you talk about sound search related - so we should
focus there.

Anyway, all of these small, but annoying, little glitches give me a really bad
feeling about trusting Chef to manage my production infrastructure. Of the
tools I've looked at, it's the most promising.

Sorry to hear that, but it's been quite stable for us (and for lots of
other folks). We'll get you fixed up.

I'd really like to given the promise of such powerful ability when it works,
the time that I've put into it, and the time it will save. Is anyone using Chef
at a large scale? Does it take handholding and massaging along the way, and
that's just the price for cutting-edge technology that will be solved as the
code matures?

There are people using Chef at the scale of many thousands of systems,
and Opscode manages a production multi-tenant infrastructure that is
also quite significant, using many of the same components that are in
the open source Chef.

Happy to help - hook us up with the logs.

Best,
Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com

Adam_Jacob · November 18, 2010, 12:55am

On Wed, Nov 17, 2010 at 4:27 PM, Allan Carroll allanca@gmail.com wrote:

That's likely the same problem I'm having. I've been trying to run my Chef
server off of a machine with 700GB (EC2 micro instance).
This begs the larger question: what size of machine is recommended for
running Chef? Seems like a pretty beefy system with all the parts running.

Much of this will depend on what you are doing with it. Solr is going
to want more ram as you add more indexed objects, and as the frequency
with which you search increases. CouchDB tends to run quite leanly,
and relies on OS file caching.

I've happily run a Chef server for nodes that check in every half on
hour for around a 100 systems on an EC2 small instance.

Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com

Leinartas_Michael · November 18, 2010, 1:03am

FWIW I was running chef-server 0.9.8 (and friends - rabbitmq, couchdb, solr) along with hosting a yum repo and an openvpn endpoint on a rackspace 512MB instance and having similar problems with chef-solr dying quite often once I reached 25 nodes or so. Updating to a 1GB instance solved it and I’m up to 50 nodes without trouble so far.

From: Allan Carroll allanca@gmail.com
Reply-To: "chef@lists.opscode.com" chef@lists.opscode.com
Date: Wed, 17 Nov 2010 18:27:35 -0600
To: "chef@lists.opscode.com" chef@lists.opscode.com
Subject: [chef] Re: Chef Server Hardware Reqs (was Re: Chef stability?)

That’s likely the same problem I’m having. I’ve been trying to run my Chef server off of a machine with 700GB (EC2 micro instance).

This begs the larger question: what size of machine is recommended for running Chef? Seems like a pretty beefy system with all the parts running.

-Allan

On Nov 17, 2010, at 4:52 PM, Blake Barnett wrote:

I found that solr would crash reliably if the machine had a shortage of memory. If I increased the RAM allocated to the VM to ~2GB, it behaved much more reliably.

-Blake

On Nov 18, 2010, at 4:37 AM, Allan Carroll wrote:

Whew. That makes it seem tractable. Thanks for helping zero in on this.

Here’s what I dug up:

solr-indexer.log has no real clues.

Lots of these:

INFO: Indexing node 37192f37-447a-41c7-8480-c048c878743e from chef status error Connection refused - connect(2)}

and lots of these:

INFO: Indexing cookbook_version 2bd0feeb-3e32-4bb2-867c-41e0cfa12806 from chef status ok}

solr.log also doesn’t seem to have anything interesting, but here’s the last set of output before it went away last time:

gist.github.com

https://gist.github.com/allanca/703913

gistfile1.txt

Nov 17, 2010 7:49:40 AM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)
Nov 17, 2010 7:49:40 AM org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=2
        commit{dir=/var/cache/chef/solr/data/index,segFN=segments_8d,version=1289922551847,generation=301,filenames=[_8x.fdx, _8w.frq, _8w.tii, _8x.fdt, _8w.tis, _8x.tii, segments_8d, _8x.fnm, _8w.fdt, _8w_1.del, _8x.nrm, _8x.tis, _8w.fdx, _8x.frq, _8w.fnm]
        commit{dir=/var/cache/chef/solr/data/index,segFN=segments_8e,version=1289922551848,generation=302,filenames=[_8x.fdx, _8y.tis, _8w.frq, _8w.tii, _8y.frq, _8x.
fdt, _8w.tis, _8y.fdx, _8x.tii, _8y.tii, _8x_1.del, _8y.fdt, segments_8e, _8x.fnm, _8w.fdt, _8y.fnm, _8w_1.del, _8x.nrm, _8x.tis, _8w.fdx, _8x.frq, _8w.fnm, _8y.nrm]
Nov 17, 2010 7:49:40 AM org.apache.solr.core.SolrDeletionPolicy updateCommitsINFO: newest commit = 1289922551848
Nov 17, 2010 7:49:40 AM org.apache.solr.search.SolrIndexSearcher <init>
INFO: Opening Searcher@66a2420e main

This file has been truncated. show original

Here’s a typical failure from the server log:

gist.github.com

https://gist.github.com/allanca/703912

gistfile1.txt

merb : chef-server (api) : worker (port 4000) ~ Started request handling: Wed Nov 17 19:26:35 +0000 2010
merb : chef-server (api) : worker (port 4000) ~ Params: {"format"=>nil, "action"=>"show", "id"=>"node", "q"=>"role:monitoring AND app_environment:production", "start"=>"0", "rows"=>"1000", "controller"=>"search", "sort"=>"X_CHEF_id_CHEF_X asc"}
merb : chef-server (api) : worker (port 4000) ~ Connection refused - connect(2) - (Errno::ECONNREFUSED)
/usr/lib/ruby/1.8/net/http.rb:560:in `initialize'
/usr/lib/ruby/1.8/net/http.rb:560:in `open'
/usr/lib/ruby/1.8/net/http.rb:560:in `connect'
/usr/lib/ruby/1.8/timeout.rb:53:in `timeout'
/usr/lib/ruby/1.8/timeout.rb:101:in `timeout'
/usr/lib/ruby/1.8/net/http.rb:560:in `connect'
/usr/lib/ruby/1.8/net/http.rb:553:in `do_start'

This file has been truncated. show original

On Nov 17, 2010, at 11:16 AM, Adam Jacob wrote:

On Wed, Nov 17, 2010 at 10:09 AM, allanca@gmail.com wrote:
I’ve been working the past few days on tweaking my chef scripts to go into
production on EC2 and struggling to get anything I feel good about trusting.
Chef looks like a great tool with a strong community. I’m hoping that there’s
some Chef way of looking at the world I haven’t been exposed to that you can
all enlighten me on.

I’m running Ubuntu 10.10 on EC2 with the version of chef from the Opscode Lucid
repo (0.9.8).

A few things going on:

I can’t seem to keep chef-solr or chef-solr-indexer from crashing. I keep
having to restart them for some reason. I’m using It makes everything feel
really flakey, but I’m not convinced that’s the only thing I’m running into.

This is going to be the source of several problems - can you send us a
gist of what you get in the logs when these crash?

Sometimes the webui (and knife) show the status of all the nodes and sometimes
it refuses saying that I have no nodes (even though the node list shows there
are some there). The error in the logs is only the same 500 internal server
error: connection refused that I see for lots of things.

Those pages both use search - if you are seeing consistent failures of
Solr, thats the source of these issues.

Running chef-client by hand on a machine causes a different result than letting
the timer driven version work. Like it forces the client to reevaluate all the
data bags and search results and actually apply them.

In what way? The code paths here are identical for the most part. If
you’re using data bags and search in the recipes, and you are seeing
failures of Solr, I would wager that these differences are actually
just a representation of the search service not being stable for you.

Sometimes the clients get new data/nodes and update everything fine, sometimes
they don’t.

Again, if it’s data that comes from search, that’s your issue.

Yesterday I started 8 boxes to bring a whole cluster up. On a few of them, Chef
just randomly stopped working. Running chef-client by hand finished building
the box correctly. One of them built part of a configuration file using data
from a node that I had deleted off the Chef server a few hours earlier and then
could never get out of that state. Deleting the configuration file and
rerunning client fixed it.

All the symptoms you talk about sound search related - so we should
focus there.

Anyway, all of these small, but annoying, little glitches give me a really bad
feeling about trusting Chef to manage my production infrastructure. Of the
tools I’ve looked at, it’s the most promising.

Sorry to hear that, but it’s been quite stable for us (and for lots of
other folks). We’ll get you fixed up.

I’d really like to given the promise of such powerful ability when it works,
the time that I’ve put into it, and the time it will save. Is anyone using Chef
at a large scale? Does it take handholding and massaging along the way, and
that’s just the price for cutting-edge technology that will be solved as the
code matures?

There are people using Chef at the scale of many thousands of systems,
and Opscode manages a production multi-tenant infrastructure that is
also quite significant, using many of the same components that are in
the open source Chef.

Happy to help - hook us up with the logs.

Best,
Adam

–
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com

Paul_Paradise · November 18, 2010, 6:13am

On Wed, Nov 17, 2010 at 10:16 AM, Adam Jacob adam@opscode.com wrote:

On Wed, Nov 17, 2010 at 10:09 AM, allanca@gmail.com wrote:

Running chef-client by hand on a machine causes a different result than
letting
the timer driven version work. Like it forces the client to reevaluate
all the
data bags and search results and actually apply them.

In what way? The code paths here are identical for the most part. If
you're using data bags and search in the recipes, and you are seeing
failures of Solr, I would wager that these differences are actually
just a representation of the search service not being stable for you.

This is almost undoubtedly unrelated to Allan's issues, but one repro-able
example where chef runs differ when running in a daemon vs. single-shot is
if your cookbook is complex and starts touching objects that outlive the
lifespan of a single chef run -
http://tickets.opscode.com/browse/COOK-397for example.

-Paul

Sean_OMeara · November 18, 2010, 9:01am

thou shalt provide enough ram

-?

On Wed, Nov 17, 2010 at 8:03 PM, Leinartas, Michael
MICHAEL.LEINARTAS@orbitz.com wrote:

FWIW I was running chef-server 0.9.8 (and friends - rabbitmq, couchdb, solr)
along with hosting a yum repo and an openvpn endpoint on a rackspace 512MB
instance and having similar problems with chef-solr dying quite often once I
reached 25 nodes or so. Updating to a 1GB instance solved it and I'm up to
50 nodes without trouble so far.

From: Allan Carroll allanca@gmail.com
Reply-To: "chef@lists.opscode.com" chef@lists.opscode.com
Date: Wed, 17 Nov 2010 18:27:35 -0600
To: "chef@lists.opscode.com" chef@lists.opscode.com
Subject: [chef] Re: Chef Server Hardware Reqs (was Re: Chef stability?)

That's likely the same problem I'm having. I've been trying to run my Chef
server off of a machine with 700GB (EC2 micro instance).

This begs the larger question: what size of machine is recommended for
running Chef? Seems like a pretty beefy system with all the parts running.

-Allan

On Nov 17, 2010, at 4:52 PM, Blake Barnett wrote:

I found that solr would crash reliably if the machine had a shortage of
memory. If I increased the RAM allocated to the VM to ~2GB, it behaved much
more reliably.

-Blake

On Nov 18, 2010, at 4:37 AM, Allan Carroll wrote:

Whew. That makes it seem tractable. Thanks for helping zero in on this.

Here's what I dug up:

solr-indexer.log has no real clues.

Lots of these:

INFO: Indexing node 37192f37-447a-41c7-8480-c048c878743e from chef status
error Connection refused - connect(2)}

and lots of these:

INFO: Indexing cookbook_version 2bd0feeb-3e32-4bb2-867c-41e0cfa12806 from
chef status ok}

solr.log also doesn't seem to have anything interesting, but here's the last
set of output before it went away last time:

solr.log output · GitHub

Here's a typical failure from the server log:

Typical server.log failure · GitHub

On Nov 17, 2010, at 11:16 AM, Adam Jacob wrote:

On Wed, Nov 17, 2010 at 10:09 AM, allanca@gmail.com wrote:

I've been working the past few days on tweaking my chef scripts to go into
production on EC2 and struggling to get anything I feel good about trusting.
Chef looks like a great tool with a strong community. I'm hoping that
there's
some Chef way of looking at the world I haven't been exposed to that you can
all enlighten me on.

I'm running Ubuntu 10.10 on EC2 with the version of chef from the Opscode
Lucid
repo (0.9.8).

A few things going on:

I can't seem to keep chef-solr or chef-solr-indexer from crashing. I keep
having to restart them for some reason. I'm using It makes everything feel
really flakey, but I'm not convinced that's the only thing I'm running into.

This is going to be the source of several problems - can you send us a
gist of what you get in the logs when these crash?

Sometimes the webui (and knife) show the status of all the nodes and
sometimes
it refuses saying that I have no nodes (even though the node list shows
there
are some there). The error in the logs is only the same 500 internal server
error: connection refused that I see for lots of things.

Those pages both use search - if you are seeing consistent failures of
Solr, thats the source of these issues.

Running chef-client by hand on a machine causes a different result than
letting
the timer driven version work. Like it forces the client to reevaluate all
the
data bags and search results and actually apply them.

In what way? The code paths here are identical for the most part. If
you're using data bags and search in the recipes, and you are seeing
failures of Solr, I would wager that these differences are actually
just a representation of the search service not being stable for you.

Sometimes the clients get new data/nodes and update everything fine,
sometimes
they don't.

Again, if it's data that comes from search, that's your issue.

Yesterday I started 8 boxes to bring a whole cluster up. On a few of them,
Chef
just randomly stopped working. Running chef-client by hand finished building
the box correctly. One of them built part of a configuration file using data
from a node that I had deleted off the Chef server a few hours earlier and
then
could never get out of that state. Deleting the configuration file and
rerunning client fixed it.

All the symptoms you talk about sound search related - so we should
focus there.

Anyway, all of these small, but annoying, little glitches give me a really
bad
feeling about trusting Chef to manage my production infrastructure. Of the
tools I've looked at, it's the most promising.

Sorry to hear that, but it's been quite stable for us (and for lots of
other folks). We'll get you fixed up.

I'd really like to given the promise of such powerful ability when it works,
the time that I've put into it, and the time it will save. Is anyone using
Chef
at a large scale? Does it take handholding and massaging along the way, and
that's just the price for cutting-edge technology that will be solved as the
code matures?

There are people using Chef at the scale of many thousands of systems,
and Opscode manages a production multi-tenant infrastructure that is
also quite significant, using many of the same components that are in
the open source Chef.

Happy to help - hook us up with the logs.

Best,
Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com

Chris_Read · November 18, 2010, 9:30am

I was having solr crashes with 2GB RAM - turned out I needed to also
increase the heap size to 512MB RAM to get things stable.

Chris

On Thu, Nov 18, 2010 at 1:03 AM, Leinartas, Michael <
MICHAEL.LEINARTAS@orbitz.com> wrote:

FWIW I was running chef-server 0.9.8 (and friends - rabbitmq, couchdb,
solr) along with hosting a yum repo and an openvpn endpoint on a rackspace
512MB instance and having similar problems with chef-solr dying quite often
once I reached 25 nodes or so. Updating to a 1GB instance solved it and I'm
up to 50 nodes without trouble so far.

*From: *Allan Carroll allanca@gmail.com
*Reply-To: *"chef@lists.opscode.com" chef@lists.opscode.com
*Date: *Wed, 17 Nov 2010 18:27:35 -0600
*To: *"chef@lists.opscode.com" chef@lists.opscode.com
*Subject: *[chef] Re: Chef Server Hardware Reqs (was Re: Chef stability?)

That's likely the same problem I'm having. I've been trying to run my Chef
server off of a machine with 700GB (EC2 micro instance).

This begs the larger question: what size of machine is recommended for
running Chef? Seems like a pretty beefy system with all the parts running.

-Allan

On Nov 17, 2010, at 4:52 PM, Blake Barnett wrote:

I found that solr would crash reliably if the machine had a shortage of
memory. If I increased the RAM allocated to the VM to ~2GB, it behaved much
more reliably.

-Blake

On Nov 18, 2010, at 4:37 AM, Allan Carroll wrote:

Whew. That makes it seem tractable. Thanks for helping zero in on this.

Here's what I dug up:

solr-indexer.log has no real clues.

Lots of these:

INFO: Indexing node 37192f37-447a-41c7-8480-c048c878743e from chef status
error Connection refused - connect(2)}

and lots of these:

INFO: Indexing cookbook_version 2bd0feeb-3e32-4bb2-867c-41e0cfa12806 from
chef status ok}

solr.log also doesn't seem to have anything interesting, but here's the
last set of output before it went away last time:

solr.log output · GitHub

Here's a typical failure from the server log:

Typical server.log failure · GitHub

On Nov 17, 2010, at 11:16 AM, Adam Jacob wrote:

On Wed, Nov 17, 2010 at 10:09 AM, allanca@gmail.com wrote:

I've been working the past few days on tweaking my chef scripts to go into
production on EC2 and struggling to get anything I feel good about
trusting.
Chef looks like a great tool with a strong community. I'm hoping that
there's
some Chef way of looking at the world I haven't been exposed to that you
can
all enlighten me on.

I'm running Ubuntu 10.10 on EC2 with the version of chef from the Opscode
Lucid
repo (0.9.8).

A few things going on:

I can't seem to keep chef-solr or chef-solr-indexer from crashing. I keep
having to restart them for some reason. I'm using It makes everything feel
really flakey, but I'm not convinced that's the only thing I'm running
into.

This is going to be the source of several problems - can you send us a
gist of what you get in the logs when these crash?

Sometimes the webui (and knife) show the status of all the nodes and
sometimes
it refuses saying that I have no nodes (even though the node list shows
there
are some there). The error in the logs is only the same 500 internal server
error: connection refused that I see for lots of things.

Those pages both use search - if you are seeing consistent failures of
Solr, thats the source of these issues.

Running chef-client by hand on a machine causes a different result than
letting
the timer driven version work. Like it forces the client to reevaluate all
the
data bags and search results and actually apply them.

In what way? The code paths here are identical for the most part. If
you're using data bags and search in the recipes, and you are seeing
failures of Solr, I would wager that these differences are actually
just a representation of the search service not being stable for you.

Sometimes the clients get new data/nodes and update everything fine,
sometimes
they don't.

Again, if it's data that comes from search, that's your issue.

Yesterday I started 8 boxes to bring a whole cluster up. On a few of them,
Chef
just randomly stopped working. Running chef-client by hand finished
building
the box correctly. One of them built part of a configuration file using
data
from a node that I had deleted off the Chef server a few hours earlier and
then
could never get out of that state. Deleting the configuration file and
rerunning client fixed it.

All the symptoms you talk about sound search related - so we should
focus there.

Anyway, all of these small, but annoying, little glitches give me a really
bad
feeling about trusting Chef to manage my production infrastructure. Of the
tools I've looked at, it's the most promising.

Sorry to hear that, but it's been quite stable for us (and for lots of
other folks). We'll get you fixed up.

I'd really like to given the promise of such powerful ability when it
works,
the time that I've put into it, and the time it will save. Is anyone using
Chef
at a large scale? Does it take handholding and massaging along the way, and
that's just the price for cutting-edge technology that will be solved as
the
code matures?

There are people using Chef at the scale of many thousands of systems,
and Opscode manages a production multi-tenant infrastructure that is
also quite significant, using many of the same components that are in
the open source Chef.

Happy to help - hook us up with the logs.

Best,
Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com

Allan_Carroll · November 18, 2010, 4:01pm

I moved to a larger instance and increased the heap size and have had much better results with solr too. Looks like it just needs more space.

On Nov 18, 2010, at 2:30 AM, Chris Read wrote:

I was having solr crashes with 2GB RAM - turned out I needed to also increase the heap size to 512MB RAM to get things stable.

Chris

On Thu, Nov 18, 2010 at 1:03 AM, Leinartas, Michael MICHAEL.LEINARTAS@orbitz.com wrote:
FWIW I was running chef-server 0.9.8 (and friends - rabbitmq, couchdb, solr) along with hosting a yum repo and an openvpn endpoint on a rackspace 512MB instance and having similar problems with chef-solr dying quite often once I reached 25 nodes or so. Updating to a 1GB instance solved it and I'm up to 50 nodes without trouble so far.

From: Allan Carroll allanca@gmail.com
Reply-To: "chef@lists.opscode.com" chef@lists.opscode.com
Date: Wed, 17 Nov 2010 18:27:35 -0600
To: "chef@lists.opscode.com" chef@lists.opscode.com
Subject: [chef] Re: Chef Server Hardware Reqs (was Re: Chef stability?)

That's likely the same problem I'm having. I've been trying to run my Chef server off of a machine with 700GB (EC2 micro instance).

This begs the larger question: what size of machine is recommended for running Chef? Seems like a pretty beefy system with all the parts running.

-Allan

On Nov 17, 2010, at 4:52 PM, Blake Barnett wrote:

I found that solr would crash reliably if the machine had a shortage of memory. If I increased the RAM allocated to the VM to ~2GB, it behaved much more reliably.

-Blake

On Nov 18, 2010, at 4:37 AM, Allan Carroll wrote:

Whew. That makes it seem tractable. Thanks for helping zero in on this.

Here's what I dug up:

solr-indexer.log has no real clues.

Lots of these:

INFO: Indexing node 37192f37-447a-41c7-8480-c048c878743e from chef status error Connection refused - connect(2)}

and lots of these:

INFO: Indexing cookbook_version 2bd0feeb-3e32-4bb2-867c-41e0cfa12806 from chef status ok}

solr.log also doesn't seem to have anything interesting, but here's the last set of output before it went away last time:

solr.log output · GitHub

Here's a typical failure from the server log:

Typical server.log failure · GitHub

On Nov 17, 2010, at 11:16 AM, Adam Jacob wrote:

On Wed, Nov 17, 2010 at 10:09 AM, allanca@gmail.com wrote:
I've been working the past few days on tweaking my chef scripts to go into
production on EC2 and struggling to get anything I feel good about trusting.
Chef looks like a great tool with a strong community. I'm hoping that there's
some Chef way of looking at the world I haven't been exposed to that you can
all enlighten me on.

I'm running Ubuntu 10.10 on EC2 with the version of chef from the Opscode Lucid
repo (0.9.8).

A few things going on:

I can't seem to keep chef-solr or chef-solr-indexer from crashing. I keep
having to restart them for some reason. I'm using It makes everything feel
really flakey, but I'm not convinced that's the only thing I'm running into.

This is going to be the source of several problems - can you send us a
gist of what you get in the logs when these crash?

Sometimes the webui (and knife) show the status of all the nodes and sometimes
it refuses saying that I have no nodes (even though the node list shows there
are some there). The error in the logs is only the same 500 internal server
error: connection refused that I see for lots of things.

Those pages both use search - if you are seeing consistent failures of
Solr, thats the source of these issues.

Running chef-client by hand on a machine causes a different result than letting
the timer driven version work. Like it forces the client to reevaluate all the
data bags and search results and actually apply them.

In what way? The code paths here are identical for the most part. If
you're using data bags and search in the recipes, and you are seeing
failures of Solr, I would wager that these differences are actually
just a representation of the search service not being stable for you.

Sometimes the clients get new data/nodes and update everything fine, sometimes
they don't.

Again, if it's data that comes from search, that's your issue.

Yesterday I started 8 boxes to bring a whole cluster up. On a few of them, Chef
just randomly stopped working. Running chef-client by hand finished building
the box correctly. One of them built part of a configuration file using data
from a node that I had deleted off the Chef server a few hours earlier and then
could never get out of that state. Deleting the configuration file and
rerunning client fixed it.

All the symptoms you talk about sound search related - so we should
focus there.

Anyway, all of these small, but annoying, little glitches give me a really bad
feeling about trusting Chef to manage my production infrastructure. Of the
tools I've looked at, it's the most promising.

Sorry to hear that, but it's been quite stable for us (and for lots of
other folks). We'll get you fixed up.

I'd really like to given the promise of such powerful ability when it works,
the time that I've put into it, and the time it will save. Is anyone using Chef
at a large scale? Does it take handholding and massaging along the way, and
that's just the price for cutting-edge technology that will be solved as the
code matures?

There are people using Chef at the scale of many thousands of systems,
and Opscode manages a production multi-tenant infrastructure that is
also quite significant, using many of the same components that are in
the open source Chef.

Happy to help - hook us up with the logs.

Best,
Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com

Adam_Jacob · November 18, 2010, 7:43pm

On Thu, Nov 18, 2010 at 1:30 AM, Chris Read chris.read@gmail.com wrote:

I was having solr crashes with 2GB RAM - turned out I needed to also
increase the heap size to 512MB RAM to get things stable.

Right - just having the RAM isn't enough, you need to tune the JVM as well.

For our (very large in comparison to almost everyone else, and way
over-provisioned) setups config file:

solr_java_opts "-XX:MaxPermSize=1024m"
solr_heap_size "6144M"
solr_java_opts "-server"

That machine has 8GB of RAM.

Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com

Gilles_Devaux · November 18, 2010, 10:00pm

Maybe it's unrelated but if you don't install chef-server through the
recipe you have to compact the couchdb indexes yourself or you'll
waste tons of mem: http://wiki.apache.org/couchdb/Compaction

Here is the script I use in a cron: Compact chef's couchdb indexes · GitHub

--Gilles

On Thu, Nov 18, 2010 at 11:43 AM, Adam Jacob adam@opscode.com wrote:

On Thu, Nov 18, 2010 at 1:30 AM, Chris Read chris.read@gmail.com wrote:

I was having solr crashes with 2GB RAM - turned out I needed to also
increase the heap size to 512MB RAM to get things stable.

Right - just having the RAM isn't enough, you need to tune the JVM as well.

For our (very large in comparison to almost everyone else, and way
over-provisioned) setups config file:

solr_java_opts "-XX:MaxPermSize=1024m"
solr_heap_size "6144M"
solr_java_opts "-server"

That machine has 8GB of RAM.

Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: adam@opscode.com

Topic		Replies	Views
Feelings on chef Chef Infra (archive)	16	407	May 7, 2010
A few questions Chef Infra (archive)	2	288	April 17, 2013
Chef Infra Client 15.10.12 Released! Chef Release Announcements	1	715	April 23, 2020
Chef Server Chef Infra (archive)	7	413	January 4, 2012
Re: Re: Re: Chef 11 in production Chef Infra (archive)	4	345	April 22, 2013

Chef stability?

Related topics