Request for input on orchestration

Ohai Chefs!

We’re in the preliminary stages of designing possible solutions for
orchestration and would like to understand the community’s
requirements.

I’m going to write down my thoughts and questions. Nothing is gospel,
so please feel free to comment on everything, including the framing.

Background:

Chef, as currently conceived, does a great job of exposing a model for
how to get a system from either an embryonic state or a slightly
misconfigured state to the desired state, mainly via the mechanism of
resource idempotence.

What I think is not yet well-modeled is how to go from one
well-configured state to a completely different well-configued
state. It also doesn’t yet model synchronization of actions across
multiple boxes in that there isn’t a first-class way to gate actions
that are dependent on the completion of steps on other servers. For
example, a complex migration or deployment might require bringing
boxes up or down, copying data, cleanly removing artifacts or services
installed by previous chef runs, not restarting load balancers until
some quorum of webservers have re-started, etc.

We’d like to collect the use cases, requirements, and thoughts that
best serve the community.

  1. What do you think the scope of orchestration is and is not?

  2. What are the use cases that you would like to see an orchestration
    system/DSL accommodate? The more specific and granular the steps of
    the orchestration, the better. (If you would not like your use case
    made public but would nonetheless like it considered during design,
    validation, and testing, please send it to me directly at
    cw@opscode.com.)

  3. What generic primitives do you think would be useful in such a
    system?

Thanks!
Chris Walters

Mandatory disclosure. We've been using Control Tier for over a year.

I don't know that I am interested in Chef being the orchestration engine. I
would like to see Chef more friendly with other existing orchestration
tools. Control Tier is great if you want to code much of your customized
commands into the tool. The spinoff RunDeck is great if you have your own
scripts but just need them orchestrated.

We envision a scenario where Chef is mainly configuration and Control Tier
is orchestration. We would go so far as having Chef configure Control Tier
which could then call back to Chef to initiate Chef actions.

Orchestration is a difficult problem and Control Tier/RunDeck is already
fairly mature. Now that they support arbitrary workflows in Jobcenter
Control Tier is't missing any feature we require for orchestrating our very
complex deployments.

I don't know that I have much to add to your questions above. We're fairly
new to Chef and it being Friday afternoon is making my head hurt thinking of
how to incorporate orchestration into Chef. All I can imagine is how we're
doing it today with Control Tier.

  1. You hit the high points in your overview of scope.
    2/3) All I can say is to read Control Tier's documentation. In Control Tier
    you define a "command" that performs a piece of work. You string a bunch of
    commands together into a workflow. Then you can also string different
    workflows into larger events. You add to that the error-handling and gating
    so that the next step only executes after the first one completed
    successfully. Threading concurrently across servers is mandatory to make it
    go fast.

Happy Friday,
Dan

On Fri, Jan 28, 2011 at 3:26 PM, Chris Walters cw@opscode.com wrote:

Ohai Chefs!

We're in the preliminary stages of designing possible solutions for
orchestration and would like to understand the community's
requirements.

I'm going to write down my thoughts and questions. Nothing is gospel,
so please feel free to comment on everything, including the framing.

Background:

Chef, as currently conceived, does a great job of exposing a model for
how to get a system from either an embryonic state or a slightly
misconfigured state to the desired state, mainly via the mechanism of
resource idempotence.

What I think is not yet well-modeled is how to go from one
well-configured state to a completely different well-configued
state. It also doesn't yet model synchronization of actions across
multiple boxes in that there isn't a first-class way to gate actions
that are dependent on the completion of steps on other servers. For
example, a complex migration or deployment might require bringing
boxes up or down, copying data, cleanly removing artifacts or services
installed by previous chef runs, not restarting load balancers until
some quorum of webservers have re-started, etc.

We'd like to collect the use cases, requirements, and thoughts that
best serve the community.

  1. What do you think the scope of orchestration is and is not?

  2. What are the use cases that you would like to see an orchestration
    system/DSL accommodate? The more specific and granular the steps of
    the orchestration, the better. (If you would not like your use case
    made public but would nonetheless like it considered during design,
    validation, and testing, please send it to me directly at
    cw@opscode.com.)

  3. What generic primitives do you think would be useful in such a
    system?

Thanks!
Chris Walters

On Sat, Jan 29, 2011 at 8:19 AM, Dan Nemec dan@nemecfamily.com wrote:

Mandatory disclosure. We've been using Control Tier for over a year.
I don't know that I am interested in Chef being the orchestration engine. I
would like to see Chef more friendly with other existing orchestration
tools. Control Tier is great if you want to code much of your customized
commands into the tool. The spinoff RunDeck is great if you have your own
scripts but just need them orchestrated.
We envision a scenario where Chef is mainly configuration and Control Tier
is orchestration. We would go so far as having Chef configure Control Tier
which could then call back to Chef to initiate Chef actions.
Orchestration is a difficult problem and Control Tier/RunDeck is already
fairly mature. Now that they support arbitrary workflows in Jobcenter
Control Tier is't missing any feature we require for orchestrating our very
complex deployments.
I don't know that I have much to add to your questions above. We're fairly
new to Chef and it being Friday afternoon is making my head hurt thinking of
how to incorporate orchestration into Chef. All I can imagine is how we're
doing it today with Control Tier.

  1. You hit the high points in your overview of scope.
    2/3) All I can say is to read Control Tier's documentation. In Control Tier
    you define a "command" that performs a piece of work. You string a bunch of
    commands together into a workflow. Then you can also string different
    workflows into larger events. You add to that the error-handling and gating
    so that the next step only executes after the first one completed
    successfully. Threading concurrently across servers is mandatory to make it
    go fast.
    Happy Friday,
    Dan

On Fri, Jan 28, 2011 at 3:26 PM, Chris Walters cw@opscode.com wrote:

Ohai Chefs!
We're in the preliminary stages of designing possible solutions for
orchestration and would like to understand the community's
requirements.
I'm going to write down my thoughts and questions. Nothing is gospel,
so please feel free to comment on everything, including the framing.
Background:
Chef, as currently conceived, does a great job of exposing a model for
how to get a system from either an embryonic state or a slightly
misconfigured state to the desired state, mainly via the mechanism of
resource idempotence.
What I think is not yet well-modeled is how to go from one
well-configured state to a completely different well-configued
state. It also doesn't yet model synchronization of actions across
multiple boxes in that there isn't a first-class way to gate actions
that are dependent on the completion of steps on other servers. For
example, a complex migration or deployment might require bringing
boxes up or down, copying data, cleanly removing artifacts or services
installed by previous chef runs, not restarting load balancers until
some quorum of webservers have re-started, etc.
We'd like to collect the use cases, requirements, and thoughts that
best serve the community.

  1. What do you think the scope of orchestration is and is not?
  2. What are the use cases that you would like to see an orchestration
    system/DSL accommodate? The more specific and granular the steps of
    the orchestration, the better. (If you would not like your use case
    made public but would nonetheless like it considered during design,
    validation, and testing, please send it to me directly at
    cw@opscode.com.)
  3. What generic primitives do you think would be useful in such a
    system?
    Thanks!
    Chris Walters

Orchestration is something I have yet to deal with. It will loom and
I had in mind to try and use Ruote[1].
RunDeck sounds like it might be a specialized (aka limited?) workflow engine.
Given ruote-kit (RESTful)[2], ruote-jig[3] decision tables[4],
persistence, ampq, etc. etc. I'd be surprised if solving Chef
orchestration didn't involve adopting/employing a subset of ruotes
existing functionality.
As mentioned any chef specifics could be placed in a ruote-chef project?
I think a route-chef project might give biggest bang-for-the-buck.

[1] http://openwferu.rubyforge.org/
[2] GitHub - kennethkalmer/ruote-kit: RESTish wrapper for ruote workflow engine
[3] GitHub - jmettraux/rufus-jig: A HTTP client, greedy with JSON content, GETting conditionally.
[4] decision tables | Search Results | processi

HTH

--
πόλλ' οἶδ ἀλώπηξ, ἀλλ' ἐχῖνος ἓν μέγα
[The fox knows many things, but the hedgehog knows one big thing.]
Archilochus, Greek poet (c. 680 BC – c. 645 BC)
http://wiki.hedgehogshiatus.com

Going over people cookbooks and this list, I've seen quite a few
scenarios where people built makeshift orchestration scripts. A few
example would be my own ec2-chef bridge to drive actions based on data
exchange between ec2 api calls and chef data (a node wen down, so update
an ec2_status attribute a force a chef run on other nodes, a chef run
failed so mark that node unhealthy in the autoscaling group), the Chef
cluster cookbooks using "cluster[:service][:timestamp]" attributes to
orchestrate service discovery, complex application deployments scripts
using chef attributes and so on.

So, I think many people need orchestration features so much they build
them from scratch, over an over again. Having a standard interface or
engine for building these solutions will definitely be a step forward,
and I would like to make some suggestions for chef features that may
promote such an engine and/or integration with existing orchestration tools:

* Chef server plugin/trigger framework - this will make writing
  external attribute processors, tools using chef's node status
  database, monitoring tools and most importantly - responding to
  events like chef run success/failure, attribute change etc.
* Provider status - I share the view that mis-configured node should
  be thrown away and built from scratch, however, as chef scripts
  become more complex and more platforms are supported, we are
  seeing more and more failure conditions that can be easily
  overcome by some generic brute error handler. Also, it's important
  to note that while Chef's philosophy tend towards idempotency,
  pure idempotency is seldom possible in practice as the "not_if"
  conditions well prove. I'm already making a lot of assumptions
  about OS status when writing cookbooks (a package can be
  installed, a user can be created) that cannot yet be described by
  chef.
* Resource status - most providers already check the status of
  resource, mainly to compare new_resource with current_resource.
  Why not expose the low level status of a resource to recipes?
* Resource status triggers - currently we can only trigger other
  resources if a resource has been updated. But what about
  triggering a resource on resource action failure? or perhaps even
  some other complex condition based on the provider status?

These are just my 2 cents, feel free to bash me for them.

Regards,
Avishai

On 01/30/2011 09:27 AM, Hedge Hog wrote:

On Sat, Jan 29, 2011 at 8:19 AM, Dan Nemec dan@nemecfamily.com wrote:

Mandatory disclosure. We've been using Control Tier for over a year.
I don't know that I am interested in Chef being the orchestration engine. I
would like to see Chef more friendly with other existing orchestration
tools. Control Tier is great if you want to code much of your customized
commands into the tool. The spinoff RunDeck is great if you have your own
scripts but just need them orchestrated.
We envision a scenario where Chef is mainly configuration and Control Tier
is orchestration. We would go so far as having Chef configure Control Tier
which could then call back to Chef to initiate Chef actions.
Orchestration is a difficult problem and Control Tier/RunDeck is already
fairly mature. Now that they support arbitrary workflows in Jobcenter
Control Tier is't missing any feature we require for orchestrating our very
complex deployments.
I don't know that I have much to add to your questions above. We're fairly
new to Chef and it being Friday afternoon is making my head hurt thinking of
how to incorporate orchestration into Chef. All I can imagine is how we're
doing it today with Control Tier.

  1. You hit the high points in your overview of scope.
    2/3) All I can say is to read Control Tier's documentation. In Control Tier
    you define a "command" that performs a piece of work. You string a bunch of
    commands together into a workflow. Then you can also string different
    workflows into larger events. You add to that the error-handling and gating
    so that the next step only executes after the first one completed
    successfully. Threading concurrently across servers is mandatory to make it
    go fast.
    Happy Friday,
    Dan

On Fri, Jan 28, 2011 at 3:26 PM, Chris Walters cw@opscode.com wrote:

Ohai Chefs!
We're in the preliminary stages of designing possible solutions for
orchestration and would like to understand the community's
requirements.
I'm going to write down my thoughts and questions. Nothing is gospel,
so please feel free to comment on everything, including the framing.
Background:
Chef, as currently conceived, does a great job of exposing a model for
how to get a system from either an embryonic state or a slightly
misconfigured state to the desired state, mainly via the mechanism of
resource idempotence.
What I think is not yet well-modeled is how to go from one
well-configured state to a completely different well-configued
state. It also doesn't yet model synchronization of actions across
multiple boxes in that there isn't a first-class way to gate actions
that are dependent on the completion of steps on other servers. For
example, a complex migration or deployment might require bringing
boxes up or down, copying data, cleanly removing artifacts or services
installed by previous chef runs, not restarting load balancers until
some quorum of webservers have re-started, etc.
We'd like to collect the use cases, requirements, and thoughts that
best serve the community.

  1. What do you think the scope of orchestration is and is not?
  2. What are the use cases that you would like to see an orchestration
    system/DSL accommodate? The more specific and granular the steps of
    the orchestration, the better. (If you would not like your use case
    made public but would nonetheless like it considered during design,
    validation, and testing, please send it to me directly at
    cw@opscode.com.)
  3. What generic primitives do you think would be useful in such a
    system?
    Thanks!
    Chris Walters
    Orchestration is something I have yet to deal with. It will loom and
    I had in mind to try and use Ruote[1].
    RunDeck sounds like it might be a specialized (aka limited?) workflow engine.
    Given ruote-kit (RESTful)[2], ruote-jig[3] decision tables[4],
    persistence, ampq, etc. etc. I'd be surprised if solving Chef
    orchestration didn't involve adopting/employing a subset of ruotes
    existing functionality.
    As mentioned any chef specifics could be placed in a ruote-chef project?
    I think a route-chef project might give biggest bang-for-the-buck.

[1] http://openwferu.rubyforge.org/
[2] GitHub - kennethkalmer/ruote-kit: RESTish wrapper for ruote workflow engine
[3] GitHub - jmettraux/rufus-jig: A HTTP client, greedy with JSON content, GETting conditionally.
[4] decision tables | Search Results | processi

HTH

On Sat, Jan 29, 2011 at 7:26 AM, Chris Walters cw@opscode.com wrote:

Ohai Chefs!
We're in the preliminary stages of designing possible solutions for
orchestration and would like to understand the community's
requirements.
I'm going to write down my thoughts and questions. Nothing is gospel,
so please feel free to comment on everything, including the framing.
Background:
Chef, as currently conceived, does a great job of exposing a model for
how to get a system from either an embryonic state or a slightly
misconfigured state to the desired state, mainly via the mechanism of
resource idempotence.
What I think is not yet well-modeled is how to go from one
well-configured state to a completely different well-configued
state. It also doesn't yet model synchronization of actions across
multiple boxes in that there isn't a first-class way to gate actions
that are dependent on the completion of steps on other servers. For
example, a complex migration or deployment might require bringing
boxes up or down, copying data, cleanly removing artifacts or services
installed by previous chef runs, not restarting load balancers until
some quorum of webservers have re-started, etc.
We'd like to collect the use cases, requirements, and thoughts that
best serve the community.

  1. What do you think the scope of orchestration is and is not?
  2. What are the use cases that you would like to see an orchestration
    system/DSL accommodate? The more specific and granular the steps of
    the orchestration, the better. (If you would not like your use case
    made public but would nonetheless like it considered during design,
    validation, and testing, please send it to me directly at
    cw@opscode.com.)
  3. What generic primitives do you think would be useful in such a
    system?

The following overview of Ruote might help people resolve some (all?)
their orchestartion needs:

http://www.engineyard.com/blog/2011/ruote-and-flow/

HTH

Thanks!
Chris Walters

--
πόλλ' οἶδ ἀλώπηξ, ἀλλ' ἐχῖνος ἓν μέγα
[The fox knows many things, but the hedgehog knows one big thing.]
Archilochus, Greek poet (c. 680 BC – c. 645 BC)
http://wiki.hedgehogshiatus.com

On Sun, Jan 30, 2011 at 6:27 PM, Hedge Hog hedgehogshiatus@gmail.com wrote:

On Sat, Jan 29, 2011 at 8:19 AM, Dan Nemec dan@nemecfamily.com wrote:

Mandatory disclosure. We've been using Control Tier for over a year.
I don't know that I am interested in Chef being the orchestration engine. I
would like to see Chef more friendly with other existing orchestration
tools. Control Tier is great if you want to code much of your customized
commands into the tool. The spinoff RunDeck is great if you have your own
scripts but just need them orchestrated.
We envision a scenario where Chef is mainly configuration and Control Tier
is orchestration. We would go so far as having Chef configure Control Tier
which could then call back to Chef to initiate Chef actions.
Orchestration is a difficult problem and Control Tier/RunDeck is already
fairly mature. Now that they support arbitrary workflows in Jobcenter
Control Tier is't missing any feature we require for orchestrating our very
complex deployments.
I don't know that I have much to add to your questions above. We're fairly
new to Chef and it being Friday afternoon is making my head hurt thinking of
how to incorporate orchestration into Chef. All I can imagine is how we're
doing it today with Control Tier.

  1. You hit the high points in your overview of scope.
    2/3) All I can say is to read Control Tier's documentation. In Control Tier
    you define a "command" that performs a piece of work. You string a bunch of
    commands together into a workflow. Then you can also string different
    workflows into larger events. You add to that the error-handling and gating
    so that the next step only executes after the first one completed
    successfully. Threading concurrently across servers is mandatory to make it
    go fast.
    Happy Friday,
    Dan

On Fri, Jan 28, 2011 at 3:26 PM, Chris Walters cw@opscode.com wrote:

Ohai Chefs!
We're in the preliminary stages of designing possible solutions for
orchestration and would like to understand the community's
requirements.
I'm going to write down my thoughts and questions. Nothing is gospel,
so please feel free to comment on everything, including the framing.
Background:
Chef, as currently conceived, does a great job of exposing a model for
how to get a system from either an embryonic state or a slightly
misconfigured state to the desired state, mainly via the mechanism of
resource idempotence.
What I think is not yet well-modeled is how to go from one
well-configured state to a completely different well-configued
state. It also doesn't yet model synchronization of actions across
multiple boxes in that there isn't a first-class way to gate actions
that are dependent on the completion of steps on other servers. For
example, a complex migration or deployment might require bringing
boxes up or down, copying data, cleanly removing artifacts or services
installed by previous chef runs, not restarting load balancers until
some quorum of webservers have re-started, etc.
We'd like to collect the use cases, requirements, and thoughts that
best serve the community.

  1. What do you think the scope of orchestration is and is not?
  2. What are the use cases that you would like to see an orchestration
    system/DSL accommodate? The more specific and granular the steps of
    the orchestration, the better. (If you would not like your use case
    made public but would nonetheless like it considered during design,
    validation, and testing, please send it to me directly at
    cw@opscode.com.)
  3. What generic primitives do you think would be useful in such a
    system?
    Thanks!
    Chris Walters

Orchestration is something I have yet to deal with. It will loom and
I had in mind to try and use Ruote[1].
RunDeck sounds like it might be a specialized (aka limited?) workflow engine.
Given ruote-kit (RESTful)[2], ruote-jig[3] decision tables[4],
persistence, ampq, etc. etc. I'd be surprised if solving Chef
orchestration didn't involve adopting/employing a subset of ruotes
existing functionality.
As mentioned any chef specifics could be placed in a ruote-chef project?
I think a route-chef project might give biggest bang-for-the-buck.

[1] http://openwferu.rubyforge.org/
[2] GitHub - kennethkalmer/ruote-kit: RESTish wrapper for ruote workflow engine
[3] GitHub - jmettraux/rufus-jig: A HTTP client, greedy with JSON content, GETting conditionally.
[4] decision tables | Search Results | processi

I know this is getting very long in the tooth...
In case anyone is still interested in this topic... RightScale have an
interesting blog on a Ruote+Amazon's Simple Workflow:

http://blog.rightscale.com/2012/02/22/rightscale-server-orchestration-and-amazon-swf-launch/

HTH

HTH

--
πόλλ' οἶδ ἀλώπηξ, ἀλλ' ἐχῖνος ἓν μέγα
[The fox knows many things, but the hedgehog knows one big thing.]
Archilochus, Greek poet (c. 680 BC – c. 645 BC)
http://wiki.hedgehogshiatus.com

--
πόλλ' οἶδ ἀλώπηξ, ἀλλ' ἐχῖνος ἓν μέγα
[The fox knows many things, but the hedgehog knows one big thing.]
Archilochus, Greek poet (c. 680 BC – c. 645 BC)
http://hedgehogshiatus.com

On Fri, Jan 28, 2011 at 3:26 PM, Chris Walters cw@opscode.com wrote:

Ohai Chefs!

We're in the preliminary stages of designing possible solutions for
orchestration and would like to understand the community's
requirements.

I'm going to write down my thoughts and questions. Nothing is gospel,
so please feel free to comment on everything, including the framing.

Background:

Chef, as currently conceived, does a great job of exposing a model for
how to get a system from either an embryonic state or a slightly
misconfigured state to the desired state, mainly via the mechanism of
resource idempotence.

What I think is not yet well-modeled is how to go from one
well-configured state to a completely different well-configued
state. It also doesn't yet model synchronization of actions across
multiple boxes in that there isn't a first-class way to gate actions
that are dependent on the completion of steps on other servers. For
example, a complex migration or deployment might require bringing
boxes up or down, copying data, cleanly removing artifacts or services
installed by previous chef runs, not restarting load balancers until
some quorum of webservers have re-started, etc.

We'd like to collect the use cases, requirements, and thoughts that
best serve the community.

  1. What do you think the scope of orchestration is and is not?

Orchestration is a pretty loaded term. In my VERY opinionated mind it
mainly revolves around the coordination of constituent parts of the
system as a whole. I think conductor when I think orchestration. The
brass section is my web farm, the woodwinds are the apis and the
percussion brings up data.

  1. What are the use cases that you would like to see an orchestration
    system/DSL accommodate? The more specific and granular the steps of
    the orchestration, the better. (If you would not like your use case
    made public but would nonetheless like it considered during design,
    validation, and testing, please send it to me directly at
    cw@opscode.com.)

Having taken my own swing at this problem space, there are a few key
things you need to be able to express. Some of these chef does already
in an abstract sense.

  • relationships/deps between nodes
  • relationships/deps between things running on the node
  • gating and phasing
  • unmanaged resources

As for a DSL it could be as simple as:

depends_external "foo"
type "role"
end

where foo is the result of a role search and the run will block until
that condition is met

or

triggers "haproxy" do
role "loadbalancer"
action :restart
end

to trigger a restart of the haproxy service on all nodes of type loadbalancer.

Obviously if you can express these in a more chef-appropriate
data-driven approach all's the better. I'll be rewritting the Noah
LWRP to do that before ChefConf.

Unmanaged resources, however, is the stickler.

This is the reason I wrote Noah. The Chef API is dead simple but it
doesn't do call backs (nor should it probably) and it's not as
friendly for application developers to interact with (you need a cert
and whatnot).

The Noah approach was to be a bridge between those two worlds -
unmanaged and managed - by providing a simple API with callbacks.

  1. What generic primitives do you think would be useful in such a
    system?

As a side note, if you want to look at the Noah LWRP in its current
state, feel free. I'm not implying that it has the answers but it's an
example of how I've taken to solve these problems:

I've successfully used this to do demos of most of the stack-based
chef cookbooks (wordpress, django) fully gated and phased (see the
BLOCKING section). I also was brainstorming with Seth Chisamore about
a pokeable chef-client daemon that would get callbacks from Noah. The
idea being that you push something to Noah from the cookbook at the
end of the run. There's a registered watcher in Noah that pokes the
chef-client on the node.

Thanks!
Chris Walters

Yo,

I'm in the unfortunate position of having built many orchestrations
like this, around Chef, many of them in private organziations not to
be open sourced.

Many of them scoffed at by underwhelmed CxO who have spent too much
time reading the Wikipedia definition of Orchestration or having
"process" or "workflow" managers force-fed to them by VCs and Big
Enterprise. ()

WOT:

On 29 January 2011 09:26, Chris Walters cw@opscode.com wrote:

Ohai Chefs!

We're in the preliminary stages of designing possible solutions for
orchestration and would like to understand the community's
requirements.

I'm going to write down my thoughts and questions. Nothing is gospel,
so please feel free to comment on everything, including the framing.

Background:

Chef, as currently conceived, does a great job of exposing a model for
how to get a system from either an embryonic state or a slightly
misconfigured state to the desired state, mainly via the mechanism of
resource idempotence.

What I think is not yet well-modeled is how to go from one
well-configured state to a completely different well-configued
state. It also doesn't yet model synchronization of actions across
multiple boxes in that there isn't a first-class way to gate actions
that are dependent on the completion of steps on other servers. For
example, a complex migration or deployment might require bringing
boxes up or down, copying data, cleanly removing artifacts or services
installed by previous chef runs, not restarting load balancers until
some quorum of webservers have re-started, etc.

We'd like to collect the use cases, requirements, and thoughts that
best serve the community.

It would be great to have something built in for Chef, and that is the
road I had been walking with Pylon, a gem for chef that has a DCell
substrate running in the background; then you get actors and
messaging, and you can just build shit.

Obviously this approach doesn't work for most people because you have
to ship code, moderately complex, etc.. but it's what I've been
wanting to build to solve this.

  1. What do you think the scope of orchestration is and is not?

I didn't read or write any books on this shit, so yeah, ymmv:

when I have built to solve orchestration, our primary use case is
generally a directory service; the ability for a recipe to register a
service (with all of the parameters required to connect to the
service) in the directory. It's also the other half of that, client
recipes who need to use those components. they should either error and
relaunch with a fresh state, or block [if you like]

  1. What are the use cases that you would like to see an orchestration
    system/DSL accommodate? The more specific and granular the steps of
    the orchestration, the better. (If you would not like your use case
    made public but would nonetheless like it considered during design,
    validation, and testing, please send it to me directly at
    cw@opscode.com.)

2x loadbalancer
4x webserver all launched

requirement: webservers added to loadbalancer table only when the
deploy is complete, not just node convergent

jenkins (ci, deploy) -> publishes packages, deploy messages, from/to version

loadbalancer -> talks to all active webservers via substrate
loadbalancer

webserver
webserver
webserver
webserver

requirement: binary packaged asset published by jenkins system is
rolling deployed to webservers with 0 downtime at the loadbalancer
layer

webserver 1-4 receive "deploy" message, agree on consensus, leader is
allocated for deploy slot; leader signals other workers, one-by-one,
to perform deploy, smoketest, and re-add to pool. no outage is visible
to the loadbalancer layer, as the connections are presented to
webservers through a consensus protocol FSM replicator (e.g. Paxos).
we could trigger an alert condition on one of the deploy slots failing
or even aggressively destroy and rebuild it.

You could do A/B style cut over with this too, would be another
signalling strategy locked down by a leader.

note: I'm currently trying to build this, I don't know what it will
look like or why I am trying to build it, but it's chock full of
science and shit: GitHub - fujin/pylon at feature/paxos --
the actor concurrency model has been great for prototyping
multi-decree paxos.

Here's the "search based" one we use for day to day, non crazy batman
shit: https://github.com/fujin/chef-discovery

  1. What generic primitives do you think would be useful in such a
    system?

You probably want to have some hash values that the client, when
calling discover_service, can use to actually talk to it, right?
register_service :service, options = {}

Find the latest instantiation of this service? find the leader?
Restrict to environment? Get the ipaddress, get the options?
discover_service :service

How do you quantify which copy of the service you want, if multiple
are available? where is the conflict resolution handled?

Where is the state stored? What is the possibility that system
decisions will be made without consistent state?

I am super excited about this and would love to help out with
anything, feel free to ping at any time.

Robops mandates the creation of this software.

Cheers,

--AJ

Thanks!
Chris Walters