`core/postgresql` stuck waiting for election

I have a 3-node topology set up to exercise Postgresql leader election using Habitat. The initial setup looked good (a leader and two followers were automatically created), but upon rebuilding one of the followers I see this in the Supervisor logs and my services hang:

Oct 08 15:28:08 staging-fullstack-0 hab[13354]: postgresql.staging(SR): Waiting to execute hooks; election in progress, and we have no quorum.
Oct 08 15:28:09 staging-fullstack-0 hab[13354]: postgresql.staging(HK): Hooks compiled
Oct 08 15:28:10 staging-fullstack-0 hab[13354]: postgresql.staging(HK): Hooks compiled
Oct 08 15:28:11 staging-fullstack-0 hab[13354]: postgresql.staging(HK): Hooks compiled
Oct 08 15:28:28 staging-fullstack-0 hab[13354]: postgresql.staging(HK): Hooks compiled
Oct 08 15:28:29 staging-fullstack-0 hab[13354]: postgresql.staging(HK): Hooks compiled

How to debug, or what have I done wrong?

You should set RUST_LOG=debug when starting the supervisors

If you want a little less noise, you can also set

RUST_LOG=debug,tokio_core::reactor=info

@elliott-davis / @baumanj, I see a large number of messages like:

Oct 09 12:39:32 staging-fullstack-0 hab[22784]: DEBUG 2018-10-09T12:39:32Z: habitat_butterfly::rumor::election: Equal: Election { from_id: "d19c9c9a31f84c68a2e081dbb58d2770", member_id: "d19c9c9a31f84c68a2e081dbb58d2770", service_group: ServiceGroup("postgresql.staging"), term: 1, suitability: 83887240, status: Running, votes: ["d19c9c9a31f84c68a2e081dbb58d2770", "a0a6869826124dc18a6d538f015160a9", "7b2f61063fbc4775bb0727540b86ad6b"] } Election { from_id: "d19c9c9a31f84c68a2e081dbb58d2770", member_id: "d19c9c9a31f84c68a2e081dbb58d2770", service_group: ServiceGroup("postgresql.staging"), term: 1, suitability: 83887240, status: Running, votes: ["d19c9c9a31f84c68a2e081dbb58d2770", "a0a6869826124dc18a6d538f015160a9", "7b2f61063fbc4775bb0727540b86ad6b"] }

That is incredibly weird, are you still seeing that?

Yes. In general it’s proving hard to debug leader-election behavior - are there any standard tricks the Habitat core folks use for this?

Actually the supervisor is still an area of the codebase i haven’t touched or done much WRT debugging. I think for this we need to summon the venerable @christophermaier and @baumanj!

Thanks for your patience, @bixu; I know it’s taken a bit to get back to you.

The message that you’re seeing a large number of comes from this line in Election's implementation of the Rumor trait’s merge operation. All that implies is that the election struct we’re seeing is equal to “self”; that is, we’re seeing our own vote gossiped back to us.

Are you seeing any other messages in the habitat_butterfly::rumor::election namespace? It would be especially helpful to be able to compare the logs of each of the 3 nodes.

One thing that you can interrogate directly is the value of the the suitability hook. When you run /hab/svc/postgresql/hooks/suitability, it’s important that you get different values on the different nodes.

Hi @baumanj,
I think you hit on something important here - the values returned for “suitability” are all the same when the ring/mesh has just been built:

General-Contact-Unit:staging blake$ pssh -h "$HOME/.postgresql.staging.ips" -l ubuntu --inline 'sudo hab pkg exec core/postgresql bash /hab/svc/postgresql/hooks/su
itability'
[1] 12:28:05 [SUCCESS] 172.31.12.169
67109184
[2] 12:28:05 [SUCCESS] 172.31.0.149
67109184
[3] 12:28:05 [SUCCESS] 172.31.8.202
67109184

After destroying/departing one member of the ring and building again, on that new member I can’t run the suitability hook at all because it uses psql, which can’t connect to the database process due to:

Oct 17 10:44:22 ip-172-31-13-47 hab[11095]: postgresql.staging hook[init]:(HK): Waiting for leader to become available before initializing

So maybe this is a simple race condition?

@eeyun, you and @bodymindarts were the authors of suitability - any thoughts?

@baumanj & @eeyun, looking at this code:

I think maybe we have a bug here? If we need the value returned by suitability to differ, I think we won’t see that in a ring that has quiesced before a host is rebuilt. I know that might be an unlikely scenario, but we need to account for it. Xlog position can certainly be the same across all members of a topology in some cases.

Also I’m seeing no leaders after rebuilding different parts of the ring:

$ curl  --silent "<host>":9631/census | jq -r ".census_groups | .\"postgresql.staging\" | .population | .[] | select(.alive == true) | .leader"
false
false
false

The suitability hook for postgres looked suspicious to me too, @bixu. I think that may benefit from a change, but I also think we have an issue on the rust side of the house if we’re not breaking ties.

The thing that would help me the most here is more complete logs from all 3 nodes from the beginning of the election process to I can trace through how the algorithm is behaving.

@baumanj, I’m going to try to DM you this wad of logs - don’t really think they will be helpful or appropriate here. We can update this thread with any findings, if that’s okay?

Sounds good. The other thing I’d be curious about is if the behavior is fixed if you were to replace the suitability hook with something that would return distinct values on the different nodes.

Just to clarify, you are saying that the Supervisor needs to pick a ring member (break the tie) when all suitability hooks return the same result? Because if this is true, then the behavior of suitability in core/postgresql is correct - it simply returns a int value for log position, and whatever member service has the most recent log position should be elected leader - if all are at the same log position any member can be safely chosen.

Based on the code, it looks like the supervisor should break ties based on member ID, so I’m still not sure why it would be getting stuck.

Nonetheless, it seems like the suitability hook is intended to return distinct values, otherwise why implement one at all? What I was hoping to see by ensuring different suitability values on the different nodes is a narrowing of the theory explaining the behavior. If the election no longer stalls, that implies there’s a bug (not immediately obvious to me) in the tie-breaking code. If it does, that implies there’s likely an issue outside the core voting logic.

I’m going to cast my vote for a Supervisor bug then. The behavior of suitability makes sense to me given the expected behavior of the Supervisor. I’ll continue to try to reproduce as I have spare cycles.

It also looks like leader election for core/postgresql is just fundamentally broken. In local testing with Docker-backed Supervisors the ring can’t survive the departure of the Postgres leader - we drop into

postgresql.default hook[health_check]:(HK): Master pg is unavailable at 172.17.0.5 ! Health State for follower is critical

and remain there. So I’m now at a bit of a loss if pursuing “habitized” Postgres makes sense…

The good news, @baumanj, is that I think I can now reproduce the behavior with local, Docker-backed Studios:

HAB_STUDIO_SUP=false hab studio run 'hab sup run' # this will be our Permanent Peer

# run the following command 3x or more to create a ring of Postgres services:
HAB_STUDIO_SUP=false hab studio run 'hab sup run core/postgresql \
  --topology=leader \
  --peer=172.17.0.2 \
  --peer=172.17.0.3 \
  --peer=172.17.0.4'

Wait for leader election to complete, then kill all containers running the Postgres service:

docker rm -f <postgres container 0 id> <postgres container 1 id> <postgres container 2 id>

Replaced the removed ring members:

# run the following command 3x or more to create a ring of Postgres services:
HAB_STUDIO_SUP=false hab studio run 'hab sup run core/postgresql \
  --topology=leader \
  --peer=172.17.0.2 \
  --peer=172.17.0.3 \
  --peer=172.17.0.4'

You should now see errors just like the one I reported initially:

postgresql.default hook[init]:(HK): Ensuring proper ownership: chgrp -RL 42 /hab/svc/postgresql/var /hab/svc/postgresql/data/pgdata /hab/svc/postgresql/data/archive
postgresql.default hook[init]:(HK): Making sure replication role exists on Master
postgresql.default hook[init]:(HK): Waiting for leader to become available before initializing
postgresql.default(HK): Initialization failed! 'init' exited with status code 1

I think what’s happening is that we don’t have a way for a new Postgres to take over the leader position. Is this a case where we need to wait 72 hours for the old leader’s member ID to depart the ring?

To further debug this in my Staging environment (where see waiting for leader messages like in my troubleshooting example above) , I wrote the following shell function to help clarify the state of leadership in the ring:

service_group_leader() {
  if [ -z $2 ]
  then
    echo ""
    echo "usage: ${FUNCNAME[0]} '<ssh user@host>' '<service.group>'"
    echo ""
    return 1
  fi

  ssh_command=$1
  service_group=$2

  census_data=$($ssh_command 'curl --silent localhost:9631/census')
  leader_data=$(echo $census_data | jq -r ".census_groups | .\"$service_group\" | .population | .[] | select(.leader == true)")

  echo "Leader IP:        $(echo $leader_data | jq -r '.sys | .ip')"
  echo "Leader alive:     $(echo $leader_data | jq -r '.alive')"
  echo "Leader member_id: $(echo $leader_data | jq -r '.member_id')"
}

Querying different members of the ring/mesh (when are we going to start calling it a mesh instead of a ring? :wink: ) showed me that I do indeed have leadership records for a dead member:

service_group_leader 'ssh ubuntu@staging-permanent-peer-0.domain.tld' 'postgresql.staging'
Leader IP:        172.31.12.169
Leader alive:     false
Leader member_id: 9df128caafd34ad3873c3e4c08596b7a

service_group_leader 'ssh ubuntu@staging-postgresql-0.domain.tld' 'postgresql.staging'
Leader IP:        172.31.12.169
Leader alive:     false
Leader member_id: 9df128caafd34ad3873c3e4c08596b7a