CM/Chef API access for monitoring systems?


#1

Folks,

At the Bay Area Chef Users Group meeting tonight (see http://www.meetup.com/The-Bay-Area-Chef-User-Group/events/82878822/), I got the pleasure of listening to Daniel talk about the new “whyrun” mode and some of the other new features of Chef. I’m always pleasantly surprised by the speakers that BAChef manages to bring in.

We also had a question from the audience as to why we can’t ever get a good monitoring system that is able to work hand-in-hand with a good CM system (like Chef), and so we’re stuck with things like trying to write or implement Nagios NRPE modules and the monitoring system ends up being the heaviest thing we do – it takes the most work to manage, it generates the most crap noise, it takes the longest to converge, and is generally very … unsatisfactory.

Of course, there are other monitoring solutions out there, depending on how much information you want to monitor about each node, and how you want to go about gathering and using that information. But Zenoss doesn’t seem to be measurably better in this particular area, nor does any other monitoring system that I am personally acquainted with.

Now, I happen to know that Alan Robertson has been working on a new project called the Assimilation Monitoring Project (see http://assimmon.org/), and I believe that the architecture of AssimMon will scale better than any other monitoring system I know of. Of course, it is very much a work-in-progress, and there is still a lot left to do. But I think Alan is pretty well suited to the task, based on his work on the Linux-HA project and based on what I’ve seen of the talk he gave at LinuxCon 2012 about AssimMon (we’re hoping to get the edited video for that posted very soon).

However, it occurs to me that one of the things that a good monitoring system could make use of would be a relatively simple standardized API to be able to access things like Ohai discovered data regarding the nodes, as well as Chef-managed data regarding the nodes. There’s no sense re-inventing the wheels that Chef has already invented, if you can relatively easily make use of what is already there.

So, now I start to wonder what such a standardized API might look like, and what kinds of information might be useful for a monitoring system to be able to access regarding the nodes it should be monitoring?

Which leads me to the idea of posting such a question on this list, to see if anyone else had any ideas or thoughts?

Thanks!


Brad Knowles brad@shub-internet.org
LinkedIn Profile: http://tinyurl.com/y8kpxu


#2

A monitoring system that utilizes Ohai data?

Stop it, you’re teasing me. :slight_smile:

I’m keeping an eye on Assimilation.

On Sep 26, 2012, at 12:55 AM, Brad Knowles brad@shub-internet.org wrote:

Folks,

At the Bay Area Chef Users Group meeting tonight (see http://www.meetup.com/The-Bay-Area-Chef-User-Group/events/82878822/), I got the pleasure of listening to Daniel talk about the new “whyrun” mode and some of the other new features of Chef. I’m always pleasantly surprised by the speakers that BAChef manages to bring in.

We also had a question from the audience as to why we can’t ever get a good monitoring system that is able to work hand-in-hand with a good CM system (like Chef), and so we’re stuck with things like trying to write or implement Nagios NRPE modules and the monitoring system ends up being the heaviest thing we do – it takes the most work to manage, it generates the most crap noise, it takes the longest to converge, and is generally very … unsatisfactory.

Of course, there are other monitoring solutions out there, depending on how much information you want to monitor about each node, and how you want to go about gathering and using that information. But Zenoss doesn’t seem to be measurably better in this particular area, nor does any other monitoring system that I am personally acquainted with.

Now, I happen to know that Alan Robertson has been working on a new project called the Assimilation Monitoring Project (see http://assimmon.org/), and I believe that the architecture of AssimMon will scale better than any other monitoring system I know of. Of course, it is very much a work-in-progress, and there is still a lot left to do. But I think Alan is pretty well suited to the task, based on his work on the Linux-HA project and based on what I’ve seen of the talk he gave at LinuxCon 2012 about AssimMon (we’re hoping to get the edited video for that posted very soon).

However, it occurs to me that one of the things that a good monitoring system could make use of would be a relatively simple standardized API to be able to access things like Ohai discovered data regarding the nodes, as well as Chef-managed data regarding the nodes. There’s no sense re-inventing the wheels that Chef has already invented, if you can relatively easily make use of what is already there.

So, now I start to wonder what such a standardized API might look like, and what kinds of information might be useful for a monitoring system to be able to access regarding the nodes it should be monitoring?

Which leads me to the idea of posting such a question on this list, to see if anyone else had any ideas or thoughts?

Thanks!


Brad Knowles brad@shub-internet.org
LinkedIn Profile: http://tinyurl.com/y8kpxu


#3

Hello!

Good to hear Bay Area Chef Users Group is doing well!

Heavy Water has had great success with Sensu, developed at Sonian, now
open source, which has tremendously good open source integration with
Chef/Puppet:


http://blog.sonian.com/technology-blog/bid/77977/Sensu-A-Monitoring-Framework

Heavy Water’s Echelon project has Sensu integration, albeit a little
dated now; Echelon has some other useful monitoring and metrics
integration: collectd/sensu <-> graphite (AMQP) w/ gdash frontend. We
generally rig Sensu up to Pagerduty, Campfire and IRC. One of the
founders of Heavy Water operations gave a talk regarding Echelon at
ChefConf 2012:




https://github.com/heavywater/chef-echelon_sensu

Our gdash cookbook features LWRPs that make it pretty easy to
dynamically define a dashes/components based on search responses from
Chef. I know San Francisco DevOps (meetup) recently had a discussion
regarding Sensu with something like 60 attendees!
(http://www.meetup.com/San-Francisco-DevOps)

Cheers,

AJ

On 26 September 2012 20:01, John Martinez john@johnmartinez.com wrote:

A monitoring system that utilizes Ohai data?

Stop it, you’re teasing me. :slight_smile:

I’m keeping an eye on Assimilation.

On Sep 26, 2012, at 12:55 AM, Brad Knowles brad@shub-internet.org wrote:

Folks,

At the Bay Area Chef Users Group meeting tonight (see http://www.meetup.com/The-Bay-Area-Chef-User-Group/events/82878822/), I got the pleasure of listening to Daniel talk about the new “whyrun” mode and some of the other new features of Chef. I’m always pleasantly surprised by the speakers that BAChef manages to bring in.

We also had a question from the audience as to why we can’t ever get a good monitoring system that is able to work hand-in-hand with a good CM system (like Chef), and so we’re stuck with things like trying to write or implement Nagios NRPE modules and the monitoring system ends up being the heaviest thing we do – it takes the most work to manage, it generates the most crap noise, it takes the longest to converge, and is generally very … unsatisfactory.

Of course, there are other monitoring solutions out there, depending on how much information you want to monitor about each node, and how you want to go about gathering and using that information. But Zenoss doesn’t seem to be measurably better in this particular area, nor does any other monitoring system that I am personally acquainted with.

Now, I happen to know that Alan Robertson has been working on a new project called the Assimilation Monitoring Project (see http://assimmon.org/), and I believe that the architecture of AssimMon will scale better than any other monitoring system I know of. Of course, it is very much a work-in-progress, and there is still a lot left to do. But I think Alan is pretty well suited to the task, based on his work on the Linux-HA project and based on what I’ve seen of the talk he gave at LinuxCon 2012 about AssimMon (we’re hoping to get the edited video for that posted very soon).

However, it occurs to me that one of the things that a good monitoring system could make use of would be a relatively simple standardized API to be able to access things like Ohai discovered data regarding the nodes, as well as Chef-managed data regarding the nodes. There’s no sense re-inventing the wheels that Chef has already invented, if you can relatively easily make use of what is already there.

So, now I start to wonder what such a standardized API might look like, and what kinds of information might be useful for a monitoring system to be able to access regarding the nodes it should be monitoring?

Which leads me to the idea of posting such a question on this list, to see if anyone else had any ideas or thoughts?

Thanks!


Brad Knowles brad@shub-internet.org
LinkedIn Profile: http://tinyurl.com/y8kpxu


#4

On Wed, Sep 26, 2012 at 12:55 AM, Brad Knowles brad@shub-internet.org wrote:

So, now I start to wonder what such a standardized API might look like, and what kinds of information might be useful for a monitoring system to be able to access regarding the nodes it should be monitoring?

Which leads me to the idea of posting such a question on this list, to see if anyone else had any ideas or thoughts?

Let the fruit fly… But isn’t this more or less what SNMP is supposed
to do? (i.e. Standard Network Monitoring Protocol…)

I think you have the right concept, though. One of the main reason
"monitoring sucks" is that they are all vertically integrated. There
is no way to "mix and match"
pieces easily. There are no standard protocols for moving from one
level to the next.

  • Booker C. Bense

#5

On Sep 26, 2012, at 7:14 AM, Booker Bense bbense@gmail.com wrote:

Let the fruit fly… But isn’t this more or less what SNMP is supposed
to do? (i.e. Standard Network Monitoring Protocol…)

SNMP could address part of the problem, if there are SNMP agents on all the nodes to be monitored, and if the monitoring system itself supports SNMP. And of course, SNMP can actually be a pretty heavy weight protocol/service to support on either end, although there is much that it can do for you.

But with Chef we already have agents on each node to be monitored, and we already have a process of centralizing all the known information about a given node, and then putting that information into a central database that can be easily accessed and searched. Is there no way to leverage this existing infrastructure for the benefit of the monitoring system?

Contrariwise, is there no way for Chef to be able to leverage the additional information that the monitoring system could provide, which can then also be centralized to be easily accessed and searched? Or perhaps even an API to access that information live in near real-time?

I think you have the right concept, though. One of the main reason
"monitoring sucks" is that they are all vertically integrated. There
is no way to "mix and match"
pieces easily. There are no standard protocols for moving from one
level to the next.

I don’t have any answers. But the discussion brought up certain questions in my mind, which I am interested in pursuing.


Brad Knowles brad@shub-internet.org
LinkedIn Profile: http://tinyurl.com/y8kpxu


#6

chef agent is not really a on demand tool, its a daemon that periodically
polls the chef server, when we say agent, we generally mean something that
can cater to on demand requests. An mocllective/chef agent can act like
one, for example. That way, snmp with its traps, on demand request
processing and extensible with custom min/scripts can suffice. Even NRPE
can be configured to do the exact same thing , with only a single line in
its configuration file. What we really need is better api’s around them,
and that is already captured in this thread. In fact I 'll also add CI
servers in the mix. I find it really difficult to plug the CI systems with
rest of the tools. I want treat change as a common event , deployment,
software upgrades, migration, backup everything to be treated as change and
for me to model such generic notion of change I need chef api to be
available as first class domain objects in a CI server , or the reverse.
This thread has already highlighted the pain of not having any standard
api around monitoring systems. Nagios still been easy to work with as it
represent the entire configuration as raw text file, while sensu is even
more easier as it can eat json.

just my 2 cents.

On Wed, Sep 26, 2012 at 10:53 PM, Brad Knowles brad@shub-internet.orgwrote:

On Sep 26, 2012, at 7:14 AM, Booker Bense bbense@gmail.com wrote:

Let the fruit fly… But isn’t this more or less what SNMP is supposed
to do? (i.e. Standard Network Monitoring Protocol…)

SNMP could address part of the problem, if there are SNMP agents on all
the nodes to be monitored, and if the monitoring system itself supports
SNMP. And of course, SNMP can actually be a pretty heavy weight
protocol/service to support on either end, although there is much that it
can do for you.

But with Chef we already have agents on each node to be monitored, and we
already have a process of centralizing all the known information about a
given node, and then putting that information into a central database that
can be easily accessed and searched. Is there no way to leverage this
existing infrastructure for the benefit of the monitoring system?

Contrariwise, is there no way for Chef to be able to leverage the
additional information that the monitoring system could provide, which can
then also be centralized to be easily accessed and searched? Or perhaps
even an API to access that information live in near real-time?

I think you have the right concept, though. One of the main reason
"monitoring sucks" is that they are all vertically integrated. There
is no way to "mix and match"
pieces easily. There are no standard protocols for moving from one
level to the next.

I don’t have any answers. But the discussion brought up certain questions
in my mind, which I am interested in pursuing.


Brad Knowles brad@shub-internet.org
LinkedIn Profile: http://tinyurl.com/y8kpxu


#7

On Sep 26, 2012, at 1:37 AM, AJ Christensen aj@junglist.gen.nz wrote:

I know San Francisco DevOps (meetup) recently had a discussion
regarding Sensu with something like 60 attendees!
(http://www.meetup.com/San-Francisco-DevOps)

I don’t know about previous meetings on the topic, but there is a meeting coming up tonight on this very topic – see http://www.meetup.com/San-Francisco-DevOps/events/81251892/. I’m hoping to be there, because Sensu sounds like the kind of thing I’d like to see where I am now. And I’m certainly going to check into the various other things that Heavy Water has been doing with it.

Thanks again!


Brad Knowles brad@shub-internet.org
LinkedIn Profile: http://tinyurl.com/y8kpxu


#8

On Wed, Sep 26, 2012 at 10:38 AM, Ranjib Dey ranjibd@thoughtworks.com wrote:

chef agent is not really a on demand tool, its a daemon that periodically
polls the chef server, when we say agent, we generally mean something that
can cater to on demand requests. An mocllective/chef agent can act like one,
for example.

Monitoring is really two distinctly different problems.

  1. Record the state of the system

  2. Alert on “failure”

The real problem is that you want a “bubble up” architecture that
allows aggreation and pushes data from local systems (i.e. ganglia, et
al )
for 1. and a completely different “top down” pull a service request for 2.

In fact I 'll also add CI servers in the mix. I
find it really difficult to plug the CI systems with rest of the tools. I
want treat change as a common event , deployment, software upgrades,
migration, backup everything to be treated as change and for me to model
such generic notion of change I need chef api to be available as first class
domain objects in a CI server , or the reverse.

I think doing “chef client API as first class object” would be
possible in Jenkins.

If we look at CI as fundamentally an action system based on “change of state”,
I think you could easily use it to replace much of the current #2 use case of
monitoring. Both CI and alerting have essentially the same internal engine.

The more I look at Jenkins, the more I see “Personalized Alerting Dashboard”.

  • Booker C. Bense