On Aug 31, 2011, at 7:47 PM, Matt Palmer wrote:
Configuration management doesn't imply reliability on a giant central
server infrastructure that's going to have to be scaled and managed
itself.
You don't necessarily need a giant central server infrastructure to support a good Infrastructure CM, whether that's Puppet, Chef, or any other such tool. If these tools are doing their job, and if they are being used appropriately, they should be able to be used to manage large numbers of servers without themselves needing a great deal of horsepower to accomplish that job.
Of course, a lot depends on what you're asking them to do and how you're asking them to do that. But Chef is using RabbitMQ in Erlang to handle some of the most timing-critical message passing and work queue handling, and that is extremely efficient and very low-latency, in addition to being very high reliability. That kind of stuff can scale up about as big as anything on the Internet, and without a great deal of its own internal overhead.
My system configuration is all data... "this is what I want to
happen". A list of packages is no more code or data than the fact
that I want those packages installed, and not removed. I want to
revision control it all.
You're installing a large list of packages, each of which has it's own major and minor number that also need to be tracked. You want to update a big hairy long list of code every single time one of those packages has been updated and you want to push that out to all your machines? And do you maintain different versions of this code for different platforms that might need slightly different sets of versions of the packages that are installed?
If you want to do it that way, I guess that's possible.
Personally, I would consider that to be quite painful as compared to updating the information in a databag and have the remote chef clients figure out which systems are impacted by a major or minor version update for one of the packages it might or might not be using. And I'd have different lists of packages in the databag for each set of production, development, QA/test systems, etc.... So, my production list of packages would not change very often at all, but my development or QA/test sets of packages might change more often -- and I'd have the same recipe running on all these sets of machines, with each type of machine knowing that it needs to pull different data out of the databag based on the role that particular machine was filling.
But maybe that's just me.
The installation script is simple and easy to read, if the syntax is
appropriate.
If you've got a long hairy list of packages to install that is included inside the code that is supposed to install those packages, I wouldn't call that simple or easy to read. But again, maybe that's just me.
nd this is where I feel like I should stop listening to you, because
you assume I've never managed large scale systems. I've done 500+
nodes with Puppet, and 2,500+ systems under management in bodgy
semi-manual ways.
My large scale experience goes back to AOL in the mid-90s. At that time, tools like Chef and Puppet didn't exist. The only thing we had was a very early version of cfengine, and that was seriously painful. Fortunately for me, I didn't have to maintain it, and I was only personally responsible for maintaining 100+ systems that made minimal use of the CM system, versus the many thousands of other servers that we had throughout the service.
My more recent experience with Infrastructure CM systems comes from using cobbled together Kickstart/Jumpstart scripts front-ended with m4 pre-processing, when I was working at UT Austin a couple of years ago. We were replacing all that stuff with bcfg2, and again I was one of the early adopters for the projects I was working on, but again they were just going to be CM clients, and only a couple dozen at that. I was a few levels removed from having to support the 50K+ students, the 20K+ faculty & staff, and I didn't have much in the way of responsibilities for helping to do management on the other few hundred servers that we had based on those cobbled-together Jumpstart/Kickstart scripts.
My experience here is even smaller, at least to date. We're starting with a couple of small VMs for the next-generation back-end server infrastructure, but we want to be able to easily scale our systems up to supporting one or more hardware appliance (or software equivalent) in every single household throughout the country and ultimately the world, so I think that puts us on the scale of at least hundreds of millions of appliance installs. Chef would not be used to manage the appliance installs directly, at least not initially. We want to get experience with using it to support our back-end server infrastructure before we start looking at the really big fish.
If there are better ways to do it with Chef, I'm open to them, but I
do have plenty of experience in this field, and so far my
experiences are telling me that the way Chef does it is a monumental
pain in the arse at scale. However, I'm willing to learn that I'm
wrong, so point me at the documentation that explains clearly and
simply why the Chef way works better.
And this is where I have to step back myself, because I do not yet know enough about Chef in this particular respect to be able to provide any further guidance. I will be very interested to see/hear what you find out.
--
Brad Knowles bknowles@ihiji.com