I’m currently thinking about abusing the Chef startup handlers for more
stability in a compute cluster.
We are running a compute cluster for scientific analysis with a resource
management system where you have a master node responsible for scheduling
analysis jobs to execution nodes. For each of the execution nodes I would
like to do the following:
- When they start a chef run, they send a “Don’t send any new workloads to
me” message to the master node
- When the chef run was successful they send a “Ok. I’m up for some more
jobs now.” message to the master node.
My goal is to have nodes on which a chef run fails to be disabled in the
resource management system because a failed chef run most of the time means
that the node is in a bad/undefined state so chances are that new jobs
scheduled there will fail.
I could obviously have the node disabled with an execute resource in the
very first recipe, but what if one of the ohai plugins gets stuck and chef
never makes it to that first recipe?
Therefore I would like to disable the execution node at the very beginning
of the chef run. Is this possible with the startup handlers?
As I understood it, the start up handlers are run after ohai. Wouldn’t it
be reasonable to run them before ohai starts, thus making the startup
handlers the very first thing to run on a node?