Hello,
I think this might be unintended behaviour and deserving of an issue on github, but I wanted to talk through what I’m seeing as maybe I’m just thinking about it wrong…
Anyway, let’s assume we have two services: my/a
and my/b
. Each service runs in leader/follower topology so we need at least three instances of my/a
and my/b
. my/b
binds to my/a
.
So in a sane world, hopefully you’d:
hab svc load my/a --topology leader
hab svc load my/a --topology leader
hab svc load my/a --topology leader
Then:
hab svc load my/b --topology leader --bind super-important-bind:a.default
hab svc load my/b --topology leader --bind super-important-bind:a.default
hab svc load my/b --topology leader --bind super-important-bind:a.default
As expected, you get a:
a.default(SR): Waiting to execute hooks; election in progress, and we have no quorum.
a.default(HK): Hooks compiled
Until we have three instances of my/a
running.
However, when we run an instance of my/b
first
hab svc load my/b --topology leader --bind super-important-bind:a.default
we get:
b.default(SR): Waiting for service binds...
b..default(SR): The specified service group 'a.default' for binding 'super-important-bind' is not (yet?) present in the census data.
This is expected… (sort of… actually, I’d expect electing a leader would be a priority over service bindings… but let’s put a pin in that for a moment)
So if we start three instances of my/b
so that they have a quorum and elect a leader, we’re waiting on service bindings to be gossiped.
It appears that as soon as I launch one instance of my/a
, that binding is gossiped, and core/b
tries to start, even though core/a
is not yet “up”.
What I think should happen is that core/b
would continue to wait for core/a
to have consensus before compiling hooks and attempting to actually start core/b
.
While it isn’t actually causing a problem for me (yet), and seems to really only occur in the initial setup of services (and again, only when services are built “out of order”), I figured if I was running into it maybe someone else had run into it too?
You can replicate what I’m experiencing with pd/tikv containers I’ve built (this assumes you have no more than 1 container running already, adjust peers accordingly):
## Start TiKV first, normally we'd start this second
# 1x - Start TiKV which binds to `qrbd/pd`
docker run --rm --ulimit nofile=82920:82920 --memory-swappiness 0 --sysctl net.ipv4.tcp_syncookies=0 --sysctl net.my.somaxconn=32768 qbrd/tikv --peer 172.17.0.3 --peer 172.17.0.5 --bind pd:pd.default --topology leader
## Start PD
# 1x - Start PD which provides the binds for `qrbd/tikv`
docker run --rm qbrd/pd --peer 172.17.0.3 --peer 172.17.0.5 --topology leader
# 2x - Start th rest of TiKV in the background
docker run -d --rm --ulimit nofile=82920:82920 --memory-swappiness 0 --sysctl net.ipv4.tcp_syncookies=0 --sysctl net.my.somaxconn=32768 qbrd/tikv --peer 172.17.0.3 --peer 172.17.0.5 --bind pd:pd.default --topology leader
# 2x - Start the rest of the PD cluster in the background, everything should come up
docker run -d --rm qbrd/pd --peer 172.17.0.3 --peer 172.17.0.5 --topology leader
Note that if we start one TiKV then One PD we get:
tikv.default(SR): Waiting to execute hooks; election in progress, and we have no quorum.
tikv.default(SR): The group 'pd.default' satisfies the `pd` bind
but if TiKV (my/b
) has quorum before PD (my/a
) then TiKV tries to start, but fails:
tikv.default(E): 2018/08/25 05:48:28.438 subchannel.c:687: [INFO] Connect failed: {"created":"@1535176108.438894026","description":"Failed to connect to remote host: OS Error","errno":111,"file":"/home/jenkins/.cargo/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.2.3/grpc/src/core/lib/iomgr/tcp_client_posix.c","file_line":201,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:172.17.0.6:2379"}
which basically says “couldn’t connect to my service binding”
So.
To summarize. I think that services that run in leader topology and bind to other services that also run in leader topology, should wait for the dependent service to have quorum and elect a leader before initially compiling hooks and attempting to start. I think if the goal is to enable “eventual consistency” then this is a minor but important improvement. However, I also think there’s a case to be made for “well, if service b
depends on service a
make sure that service is up first”.
Thoughts?