`core/postgresql` stuck waiting for election

I think departure is not working. I wrote some additional shell functions to streamline querying the ring, and I see this after departing the old Postgres primary member_id and then rebuilding all 3 hosts that should run core/postgresql:

service_group_leaders 'ssh ubuntu@staging-permanent-peer-0.domain.tld' 'postgresql.staging'
Hostname:  ip-172-31-12-169
Address:   172.31.12.169
Alive:     false
Leader:    true
Departed:  true
MemberID:  9df128caafd34ad3873c3e4c08596b7a
======

service_group_members 'ssh ubuntu@staging-permanent-peer-0.domain.tld' 'postgresql.staging'
Hostname:  ip-172-31-6-32
Leader:    false
Departed:  false
MemberID:  572f4d4d34164be9a91d2b09f247ffb1
======
Hostname:  ip-172-31-13-204
Leader:    false
Departed:  false
MemberID:  68813a17f4044220b39685cf7a6c63f4
======
Hostname:  ip-172-31-15-72
Leader:    false
Departed:  false
MemberID:  f1b82aa4459e4cc4a16ea9bbef24af4e
======
1 Like

Hello again! Thanks for your patience. Iā€™ve done enough reading of the elections code now to have a decent general understanding of how itā€™s supposed to work and I see some places where it may be going wrong. However, that code also has woefully little logging which makes it hard to tell exactly whatā€™s going wrong.

The thing that jumps out at me now is that (as you said) itā€™s likely a problem with membership, and that makes sense with the log message you posted in the first message:

postgresql.staging(SR): Waiting to execute hooks; election in progress, and we have no quorum.

In order for an election to occur, of all the nodes in the service group (that is, ones that have added Service rumors) which are not Departed (meaning Alive, Suspect or Confirmed), we must have a majority which are Alive. I think the way we handle departing nodes needs to be changed, and I also think we need a mechanism for leaving a service group. These issues would explain why your election isnā€™t proceeding due to quorum issues.

I also looked into whether we have an issue with the suitability hook. Originally, I thought the fact that it wasnā€™t returning distinct values of different nodes was suspicious. And while that still seems wrong, it shouldnā€™t cause an unbreakable tie since we fall back to using the member ID. You pointed out this log message:

postgresql.staging hook[init]:(HK): Waiting for leader to become available before initializing

If the suitability hook depends on the service itself, but the service canā€™t run because itā€™s waiting on elections, that could certainly deadlock things. However, it looks here like the init hook hasnā€™t succeeded, so based on this code:

    pub fn suitability(&self) -> Option<u64> {
        if !self.initialized {
            return None;
        }
        self.hooks.suitability.as_ref().and_then(|hook| {
            hook.run(
                &self.service_group,
                &self.pkg,
                self.svc_encrypted_password.as_ref(),
            )
        })
    }

The service shouldnā€™t be initialized and I donā€™t think the suitability hook should be getting called at all (again, more logging could help to know), but based on the postgres init hook code, Iā€™d expect it to exit with a status of 1

In that case, there should be a log from this line containing Initialization failed. Do you see that? If so, suitability is not the issue and we probably just need to address the membership/quorum problems that are preventing the election from getting started.

I do indeed see multiple occurrences of Initialization failed in the logs we had saved from the broken Postgres ring:

null_resource.postgresql_services[1] (remote-exec): Oct 17 16:17:46 ip-172-31-5-152 hab[11107]: postgresql.staging(HK): Initialization failed! 'init' exited with status code 1

@bixu: have you tried hab sup departing the known-dead member IDs? Weā€™ll definitely work on fixing up our membership issues, but this might suffice as a workaround in the meantime.

Iā€™ll add more logging to the elections code to make these kinds of issues easier to diagnose in the future.

We did indeed write some code to handle departures. However, that didnā€™t seem to have the effect we wanted. Not that it had a bad effect, but the issues that we are debugging here were present even after the departure code was added.

Couple thoughts about the Postgres plan:

Regarding the suitability hook, thereā€™s definitely a clear bug in local_xlog_position where it should return an integer value even if the psql command is unsuccessful - probably a 0.

I think it absolutely should still be based on the latest xlog position, this shouldnā€™t be changed. The idea is that you donā€™t accidentally elect a new leader that has older data - that can be disastrous. If you have two members with the same xlog position they are arguably equally qualified to become the leader.

One thing we could do is add some number ( 1?) to the suitability value if it is the current leader, making it more likely that you wonā€™t arbitrarily switch leaders in case of a topology change. Thoughts?

Regarding the init hook bombing out if a leader isnā€™t ready here, we may need to modify this behavior for an already established cluster - Iā€™m not sure. While it makes sense during initial cluster setup (keep retrying the follower setup until the leader is ready), itā€™s clearly impeding re-election. Iā€™m open to ideas on what we can do here.

pg_controldata could be used instead of a psql connection - it can report WAL location even if the server is down.

There actually should be some checks for the systemid AND timeline in there somewhere as well.

1 Like

Great idea @jamessewell , pg_controldata is way better than depending on PG to be up!

Would you be interested in pairing up on implementing these checks? It seems like you have quite a bit of expertise on this topic!

I don't think you'd ever change leaders as long as the existing leader continues to be Alive, but I'll confirm.

That one I need to give some more thought to. I believe leader election should only require that the service is loaded (not that the init hook as completed). This may mean the suitability hook gives an error, but that can be dealt with.

I really donā€™t agree with this approach - leader election should only use instances which are passing monitoring checks (although I know this is hard at the moment as monitoring isnā€™t first class)

Sure happy to help.

This diagram from Patroni is obviously way above what we could or even can do in Habitat - but itā€™s a good reference for PostgreSQL cluster best practice.

Patroni core loop diagram

bodymindarts (heā€™s not active in Habitat anymore) originally wrote the PostgreSQL core plan (very optimistically) as a drop in replacement for Patroni.

Lack of application monitoring integration and lack of a strongly consistent ring mean that we wonā€™t reach parity with Patroni any time soon - but we can get a lot closer!

There is definitely a bug here that https://github.com/habitat-sh/habitat/pull/5859 should address at least part of. Iā€™ll post again when the next release (planned for the week of Nov 26) comes out.

@baumanj, nice work on the fix in 0.69.0. Iā€™m still testing, but it seems that the bad behavior we were seeing is now corrected and a dead leader will cause an election restart instead of hang. Thanks!

Thanks for bringing the issue to our attention and documenting your experience so well, @bixu!