Yo,
I'm in the unfortunate position of having built many orchestrations
like this, around Chef, many of them in private organziations not to
be open sourced.
Many of them scoffed at by underwhelmed CxO who have spent too much
time reading the Wikipedia definition of Orchestration or having
"process" or "workflow" managers force-fed to them by VCs and Big
Enterprise. ()
WOT:
On 29 January 2011 09:26, Chris Walters cw@opscode.com wrote:
Ohai Chefs!
We're in the preliminary stages of designing possible solutions for
orchestration and would like to understand the community's
requirements.
I'm going to write down my thoughts and questions. Nothing is gospel,
so please feel free to comment on everything, including the framing.
Background:
Chef, as currently conceived, does a great job of exposing a model for
how to get a system from either an embryonic state or a slightly
misconfigured state to the desired state, mainly via the mechanism of
resource idempotence.
What I think is not yet well-modeled is how to go from one
well-configured state to a completely different well-configued
state. It also doesn't yet model synchronization of actions across
multiple boxes in that there isn't a first-class way to gate actions
that are dependent on the completion of steps on other servers. For
example, a complex migration or deployment might require bringing
boxes up or down, copying data, cleanly removing artifacts or services
installed by previous chef runs, not restarting load balancers until
some quorum of webservers have re-started, etc.
We'd like to collect the use cases, requirements, and thoughts that
best serve the community.
It would be great to have something built in for Chef, and that is the
road I had been walking with Pylon, a gem for chef that has a DCell
substrate running in the background; then you get actors and
messaging, and you can just build shit.
Obviously this approach doesn't work for most people because you have
to ship code, moderately complex, etc.. but it's what I've been
wanting to build to solve this.
- What do you think the scope of orchestration is and is not?
I didn't read or write any books on this shit, so yeah, ymmv:
when I have built to solve orchestration, our primary use case is
generally a directory service; the ability for a recipe to register a
service (with all of the parameters required to connect to the
service) in the directory. It's also the other half of that, client
recipes who need to use those components. they should either error and
relaunch with a fresh state, or block [if you like]
- What are the use cases that you would like to see an orchestration
system/DSL accommodate? The more specific and granular the steps of
the orchestration, the better. (If you would not like your use case
made public but would nonetheless like it considered during design,
validation, and testing, please send it to me directly at
cw@opscode.com.)
2x loadbalancer
4x webserver all launched
requirement: webservers added to loadbalancer table only when the
deploy is complete, not just node convergent
jenkins (ci, deploy) -> publishes packages, deploy messages, from/to version
loadbalancer -> talks to all active webservers via substrate
loadbalancer
webserver
webserver
webserver
webserver
requirement: binary packaged asset published by jenkins system is
rolling deployed to webservers with 0 downtime at the loadbalancer
layer
webserver 1-4 receive "deploy" message, agree on consensus, leader is
allocated for deploy slot; leader signals other workers, one-by-one,
to perform deploy, smoketest, and re-add to pool. no outage is visible
to the loadbalancer layer, as the connections are presented to
webservers through a consensus protocol FSM replicator (e.g. Paxos).
we could trigger an alert condition on one of the deploy slots failing
or even aggressively destroy and rebuild it.
You could do A/B style cut over with this too, would be another
signalling strategy locked down by a leader.
note: I'm currently trying to build this, I don't know what it will
look like or why I am trying to build it, but it's chock full of
science and shit: GitHub - fujin/pylon at feature/paxos --
the actor concurrency model has been great for prototyping
multi-decree paxos.
Here's the "search based" one we use for day to day, non crazy batman
shit: https://github.com/fujin/chef-discovery
- What generic primitives do you think would be useful in such a
system?
You probably want to have some hash values that the client, when
calling discover_service, can use to actually talk to it, right?
register_service :service, options = {}
Find the latest instantiation of this service? find the leader?
Restrict to environment? Get the ipaddress, get the options?
discover_service :service
How do you quantify which copy of the service you want, if multiple
are available? where is the conflict resolution handled?
Where is the state stored? What is the possibility that system
decisions will be made without consistent state?
I am super excited about this and would love to help out with
anything, feel free to ping at any time.
Robops mandates the creation of this software.
Cheers,
--AJ
Thanks!
Chris Walters