Hello again! Thanks for your patience. Iāve done enough reading of the elections code now to have a decent general understanding of how itās supposed to work and I see some places where it may be going wrong. However, that code also has woefully little logging which makes it hard to tell exactly whatās going wrong.
The thing that jumps out at me now is that (as you said) itās likely a problem with membership, and that makes sense with the log message you posted in the first message:
postgresql.staging(SR): Waiting to execute hooks; election in progress, and we have no quorum.
In order for an election to occur, of all the nodes in the service group (that is, ones that have added Service
rumors) which are not Departed
(meaning Alive
, Suspect
or Confirmed
), we must have a majority which are Alive
. I think the way we handle departing nodes needs to be changed, and I also think we need a mechanism for leaving a service group. These issues would explain why your election isnāt proceeding due to quorum issues.
I also looked into whether we have an issue with the suitability hook. Originally, I thought the fact that it wasnāt returning distinct values of different nodes was suspicious. And while that still seems wrong, it shouldnāt cause an unbreakable tie since we fall back to using the member ID. You pointed out this log message:
postgresql.staging hook[init]:(HK): Waiting for leader to become available before initializing
If the suitability hook depends on the service itself, but the service canāt run because itās waiting on elections, that could certainly deadlock things. However, it looks here like the init
hook hasnāt succeeded, so based on this code:
pub fn suitability(&self) -> Option<u64> {
if !self.initialized {
return None;
}
self.hooks.suitability.as_ref().and_then(|hook| {
hook.run(
&self.service_group,
&self.pkg,
self.svc_encrypted_password.as_ref(),
)
})
}
The service shouldnāt be initialized and I donāt think the suitability hook should be getting called at all (again, more logging could help to know), but based on the postgres init hook code, Iād expect it to exit with a status of 1
In that case, there should be a log from this line containing Initialization failed
. Do you see that? If so, suitability
is not the issue and we probably just need to address the membership/quorum problems that are preventing the election from getting started.