Bind is gossiped before service fully up?

Hello,

I think this might be unintended behaviour and deserving of an issue on github, but I wanted to talk through what I’m seeing as maybe I’m just thinking about it wrong…

Anyway, let’s assume we have two services: my/a and my/b. Each service runs in leader/follower topology so we need at least three instances of my/a and my/b. my/b binds to my/a.

So in a sane world, hopefully you’d:

hab svc load my/a --topology leader
hab svc load my/a --topology leader
hab svc load my/a --topology leader

Then:

hab svc load my/b --topology leader --bind super-important-bind:a.default
hab svc load my/b --topology leader --bind super-important-bind:a.default
hab svc load my/b --topology leader --bind super-important-bind:a.default

As expected, you get a:

a.default(SR): Waiting to execute hooks; election in progress, and we have no quorum.
a.default(HK): Hooks compiled

Until we have three instances of my/a running.

However, when we run an instance of my/b first

hab svc load my/b --topology leader --bind super-important-bind:a.default                                                                                                                                          

we get:

b.default(SR): Waiting for service binds...                                                                                                                                                                        
b..default(SR): The specified service group 'a.default' for binding 'super-important-bind' is not (yet?) present in the census data.                                                                               

This is expected… (sort of… actually, I’d expect electing a leader would be a priority over service bindings… but let’s put a pin in that for a moment)

So if we start three instances of my/b so that they have a quorum and elect a leader, we’re waiting on service bindings to be gossiped.

It appears that as soon as I launch one instance of my/a, that binding is gossiped, and core/b tries to start, even though core/a is not yet “up”.

What I think should happen is that core/b would continue to wait for core/a to have consensus before compiling hooks and attempting to actually start core/b.

While it isn’t actually causing a problem for me (yet), and seems to really only occur in the initial setup of services (and again, only when services are built “out of order”), I figured if I was running into it maybe someone else had run into it too?

You can replicate what I’m experiencing with pd/tikv containers I’ve built (this assumes you have no more than 1 container running already, adjust peers accordingly):

## Start TiKV first, normally we'd start this second 
# 1x - Start TiKV which binds to `qrbd/pd`
docker run --rm --ulimit nofile=82920:82920 --memory-swappiness 0   --sysctl net.ipv4.tcp_syncookies=0 --sysctl net.my.somaxconn=32768   qbrd/tikv --peer 172.17.0.3 --peer 172.17.0.5 --bind pd:pd.default --topology leader
## Start PD
# 1x - Start PD which provides the binds for `qrbd/tikv`
docker run --rm qbrd/pd --peer 172.17.0.3 --peer 172.17.0.5 --topology leader

# 2x - Start  th rest of TiKV in the background 
docker run -d --rm --ulimit nofile=82920:82920 --memory-swappiness 0   --sysctl net.ipv4.tcp_syncookies=0 --sysctl net.my.somaxconn=32768   qbrd/tikv --peer 172.17.0.3 --peer 172.17.0.5 --bind pd:pd.default --topology leader

# 2x - Start the rest of the PD cluster in the background, everything should come up
docker run -d --rm qbrd/pd --peer 172.17.0.3 --peer 172.17.0.5 --topology leader

Note that if we start one TiKV then One PD we get:

tikv.default(SR): Waiting to execute hooks; election in progress, and we have no quorum.
tikv.default(SR): The group 'pd.default' satisfies the `pd` bind

but if TiKV (my/b) has quorum before PD (my/a) then TiKV tries to start, but fails:

tikv.default(E): 2018/08/25 05:48:28.438 subchannel.c:687: [INFO] Connect failed: {"created":"@1535176108.438894026","description":"Failed to connect to remote host: OS Error","errno":111,"file":"/home/jenkins/.cargo/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.2.3/grpc/src/core/lib/iomgr/tcp_client_posix.c","file_line":201,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:172.17.0.6:2379"}

which basically says “couldn’t connect to my service binding”

So.

To summarize. I think that services that run in leader topology and bind to other services that also run in leader topology, should wait for the dependent service to have quorum and elect a leader before initially compiling hooks and attempting to start. I think if the goal is to enable “eventual consistency” then this is a minor but important improvement. However, I also think there’s a case to be made for “well, if service b depends on service a make sure that service is up first”.

Thoughts?

I think you’re on to something. This seems like a variation on https://github.com/habitat-sh/habitat/issues/5327, which argues that we shouldn’t broadcast the fact that we’re running a given service until it’s actually running and healthy.

I haven’t reasoned through all the implications of the change you suggest, but it seems reasonable on its surface. Would you mind filing an issue and tagging me on it?

Thanks!

@christophermaier you know… when I start with a forum post, you always ask for an issue, when I start with an issue, there’s always a duplicate… :stuck_out_tongue: lol

Do you think a discrete issue or I can just pile on #5327?

:smile:

I think capturing the leadership election angle of this warrants a distinct issue, particularly since you’ve gathered a fair bit of background information and reproduction steps here. Ultimately, I feel like they should be linked together, or otherwise grouped, though.

@qubitrenegade when you get an issue would you mind linking it back here and I’ll flag it as the solution for future humans

@elliott-davis @christophermaier , ok, sorry it took a bit, real life, ya know... Here's the issue: Bind gossiped before dependent leader/follower service group is fully up · Issue #5599 · habitat-sh/habitat · GitHub

I added these comments to the end of the issue, but perhaps this is a better forum (ha, let's say pun intended) for discussion, but:

  • Should this behavior that I'm experiencing be expected when --binding-mode=relaxed ?
  • Perhaps there should be a demarcation between pkg_binds and pkg_binds_optional ?

Quoting from the documentation:

By setting --binding-mode=relaxed when loading a service, that service can start immediately, whether there are any members of a bound service group present or not.

though relaxed will be the eventual default for Habitat 1.0.0.

Such a service should have configuration and lifecycle hook templates written in such a way that the service can remain operational (though perhaps with reduced functionality) when there are no live members of a bound service group present in the network census.

To my thinking, if I'm defining a pkg_bind that means, I need my service to bind to this other service. I'm thinking, Website (Artifactory) and Database. A website is likely useless without a database, so not starting until that bind is "available" seems like the sane thing to do. (are there really more applications in the world that are able to run in a diminished capacity when dependent service don't exist than not? Seems to me like --binding-mode=relaxed would be the exception and thus shouldn't be the default... but perhaps I'm thinking about things incorrectly? I guess as long as we know what to expect we'll know how to work around it)

Again, I could be totally just thinking about this in the wrong way, and @christophermaier you'll probably come in and point out something glaringly obvious that I overlooked :slight_smile: , so this is really me just thinking "out loud"...

The thinking is that you want to have services that are somewhat resilient. While your entire web application might not function without a connection to a database, there’s probably some functionality it can continue to provide. Similarly, can it deal well with a dependent service going away once it has already started? Otherwise, if one service goes down, it would take down everything that depends on it, which would take down everything that depends on them, etc.

(Note that we aren’t currently doing anything special with a “strict binding” service if a dependent service goes away :grimacing: Whether there is something we could reasonably do is open for discussion, as is the question of whether there might be a meaningful distinct to be made between “starting in the absence of a dependency” vs. “continuing in the absence of a dependency”. I’m not sure if there is, or how we might best model that, but maybe there’s something interesting in there. )

I agree completely that we shouldn’t be advertising that we’ve got a service until it’s actually running, and once that’s in place, things should behave more sanely.

As for pkg_binds_optional, this is likely an instance of “naming is hard” :slight_smile: These are optional in the sense that you are providing more than one way that your service can run: maybe you have a service that can run against PostgreSQL or MySQL, but those have different binding contracts, or your config is configured to respond differently based on whether you’re using the postgres bind or the mysql bind. You don’t need to bind to a MySQL server if you’re binding to a PostgreSQL server, and vice versa, so those binds would be optional. If you’ve bound to postgres, we’re just not going to care about trying to satisfy the mysql bind. This example is maybe a bit contrived, but hopefully it’s useful.