At the Bay Area Chef Users Group meeting tonight (see http://www.meetup.com/The-Bay-Area-Chef-User-Group/events/82878822/), I got the pleasure of listening to Daniel talk about the new “whyrun” mode and some of the other new features of Chef. I’m always pleasantly surprised by the speakers that BAChef manages to bring in.
We also had a question from the audience as to why we can’t ever get a good monitoring system that is able to work hand-in-hand with a good CM system (like Chef), and so we’re stuck with things like trying to write or implement Nagios NRPE modules and the monitoring system ends up being the heaviest thing we do – it takes the most work to manage, it generates the most crap noise, it takes the longest to converge, and is generally very … unsatisfactory.
Of course, there are other monitoring solutions out there, depending on how much information you want to monitor about each node, and how you want to go about gathering and using that information. But Zenoss doesn’t seem to be measurably better in this particular area, nor does any other monitoring system that I am personally acquainted with.
Now, I happen to know that Alan Robertson has been working on a new project called the Assimilation Monitoring Project (see http://assimmon.org/), and I believe that the architecture of AssimMon will scale better than any other monitoring system I know of. Of course, it is very much a work-in-progress, and there is still a lot left to do. But I think Alan is pretty well suited to the task, based on his work on the Linux-HA project and based on what I’ve seen of the talk he gave at LinuxCon 2012 about AssimMon (we’re hoping to get the edited video for that posted very soon).
However, it occurs to me that one of the things that a good monitoring system could make use of would be a relatively simple standardized API to be able to access things like Ohai discovered data regarding the nodes, as well as Chef-managed data regarding the nodes. There’s no sense re-inventing the wheels that Chef has already invented, if you can relatively easily make use of what is already there.
So, now I start to wonder what such a standardized API might look like, and what kinds of information might be useful for a monitoring system to be able to access regarding the nodes it should be monitoring?
Which leads me to the idea of posting such a question on this list, to see if anyone else had any ideas or thoughts?