Couchdb crashing issues

Hoping someone has had a similar experience. I’ve been having couchdb
on my chef server crashing overnight. It seems like something around
the time of log rotation seems to cause an erlang heartbeat timeout
and couch doesn’t always successfully respawn, even with
COUCHDB_RESPAWN_TIMEOUT=5 set in sysconfig. When it fails, it usually
looks like this in the stderr log:

heart_beat_kill_pid = 31808
heart_beat_timeout = 11
heart: Mon Aug 15 16:54:21 2011: Erlang has closed.
/usr/bin/couchdb: line 277: echo: write error: Broken pipe
heart: Mon Aug 15 16:54:22 2011: Executed “/usr/bin/couchdb -k”. Terminating.

When it succeeds in respawning, it seems to look like this:

heart: Tue Aug 16 04:18:25 2011: heart-beat time-out.
heart: Tue Aug 16 04:18:26 2011: Executed “/usr/bin/couchdb -k”. Terminating.
heart_beat_kill_pid = 4708
heart_beat_timeout = 11

I’m not convinced logrotation has anything to do with it though
because (1) I couldn’t reproduce the failure by triggering a manual
log rotation (2) couchdb logrotation does a copy/truncate so it
shouldn’t be apparent to the process at all.

I also saw some mention after a bit of googling that someone had a
respawning problem hours after db compaction. I have been running
compaction using chef-solo and the chef-server::default recipe from
Opscode but don’t have enough data yet to draw any correlation.

This is on RHEL 5.6 with couchdb-0.11.2-2.el5 from Fedora EPEL.

Anyone see anything like this?

Thanks,

KC Braunschweig

What version of erlang are you using, KC?

https://issues.apache.org/jira/browse/COUCHDB-275

Is the relevant bug, and you should see this fixed by using Erlang R13B.

Similarly, you probably want to move to a more recent couchdb.

Adam

--
Opscode, Inc.
Adam Jacob, Chief Product Officer
T: (206) 619-7151 E: adam@opscode.com

On Tuesday, August 16, 2011 at 11:10 AM, KC Braunschweig wrote:

Hoping someone has had a similar experience. I've been having couchdb
on my chef server crashing overnight. It seems like something around
the time of log rotation seems to cause an erlang heartbeat timeout
and couch doesn't always successfully respawn, even with
COUCHDB_RESPAWN_TIMEOUT=5 set in sysconfig. When it fails, it usually
looks like this in the stderr log:

heart_beat_kill_pid = 31808
heart_beat_timeout = 11
heart: Mon Aug 15 16:54:21 2011: Erlang has closed.
/usr/bin/couchdb: line 277: echo: write error: Broken pipe
heart: Mon Aug 15 16:54:22 2011: Executed "/usr/bin/couchdb -k". Terminating.

When it succeeds in respawning, it seems to look like this:

heart: Tue Aug 16 04:18:25 2011: heart-beat time-out.
heart: Tue Aug 16 04:18:26 2011: Executed "/usr/bin/couchdb -k". Terminating.
heart_beat_kill_pid = 4708
heart_beat_timeout = 11

I'm not convinced logrotation has anything to do with it though
because (1) I couldn't reproduce the failure by triggering a manual
log rotation (2) couchdb logrotation does a copy/truncate so it
shouldn't be apparent to the process at all.

I also saw some mention after a bit of googling that someone had a
respawning problem hours after db compaction. I have been running
compaction using chef-solo and the chef-server::default recipe from
Opscode but don't have enough data yet to draw any correlation.

This is on RHEL 5.6 with couchdb-0.11.2-2.el5 from Fedora EPEL.

Anyone see anything like this?

Thanks,

KC Braunschweig

erlang-R12B-5.10.el5

Ah, ya that looks likely. Will move to RHEL 6.1 soon which will provide:

erlang-R14B-0.5
couchdb-1.0.2-8

Hopefully that'll resolve it. Wondering if there shouldn't be erlang

= R13B on the server requirements doc?
(http://wiki.opscode.com/display/chef/Installation) and/or mention of
this bug on the couchdb administration page?
(http://wiki.opscode.com/display/chef/CouchDB+Administration+for+Chef+Server)

Thanks, will report back after trial w/ newer versions.

KC

On Tue, Aug 16, 2011 at 11:38 AM, Adam Jacob adam@opscode.com wrote:

What version of erlang are you using, KC?

[COUCHDB-275] couch crashes erlang vm under heavy load - ASF JIRA

Is the relevant bug, and you should see this fixed by using Erlang R13B.

Similarly, you probably want to move to a more recent couchdb.