I’m currently thinking about abusing the Chef startup handlers for more
stability in a compute cluster.
We are running a compute cluster for scientific analysis with a resource
management system where you have a master node responsible for scheduling
analysis jobs to execution nodes. For each of the execution nodes I would
like to do the following:
When they start a chef run, they send a “Don’t send any new workloads to
me” message to the master node
When the chef run was successful they send a “Ok. I’m up for some more
jobs now.” message to the master node.
My goal is to have nodes on which a chef run fails to be disabled in the
resource management system because a failed chef run most of the time means
that the node is in a bad/undefined state so chances are that new jobs
scheduled there will fail.
I could obviously have the node disabled with an execute resource in the
very first recipe, but what if one of the ohai plugins gets stuck and chef
never makes it to that first recipe?
Therefore I would like to disable the execution node at the very beginning
of the chef run. Is this possible with the startup handlers?
As I understood it, the start up handlers are run after ohai. Wouldn’t it
be reasonable to run them before ohai starts, thus making the startup
handlers the very first thing to run on a node?
You have highlighted multiple things here, if you plan to orchestrate cross
node services , where status of chef run should impact whether another chef
run will be triggered or not, you can try mcollective.
also, for custom solutions,
node.ohai_time stores the time of last successful chef run occurred , if
your chef runs are scheduled periodically (or with a fixed splay time) you
can exploit this attribute in chef startup handler to decide whether a node
has converged successfully or not (and also you can use the rabbitmq that
you are using with chef server to modle pub/sub like workflows).
I'm currently thinking about abusing the Chef startup handlers for more
stability in a compute cluster.
We are running a compute cluster for scientific analysis with a resource
management system where you have a master node responsible for scheduling
analysis jobs to execution nodes. For each of the execution nodes I would
like to do the following:
When they start a chef run, they send a "Don't send any new workloads
to me" message to the master node
When the chef run was successful they send a "Ok. I'm up for some more
jobs now." message to the master node.
My goal is to have nodes on which a chef run fails to be disabled in the
resource management system because a failed chef run most of the time means
that the node is in a bad/undefined state so chances are that new jobs
scheduled there will fail.
I could obviously have the node disabled with an execute resource in the
very first recipe, but what if one of the ohai plugins gets stuck and chef
never makes it to that first recipe?
Therefore I would like to disable the execution node at the very beginning
of the chef run. Is this possible with the startup handlers?
As I understood it, the start up handlers are run after ohai. Wouldn't it
be reasonable to run them before ohai starts, thus making the startup
handlers the very first thing to run on a node?