Request for Comment: Chef Profiling


#1

Not in the “Chef RFC” sense of the term, but I would like to get a thread going about what kind of data would be useful to people trying to improve the performance of Chef cookbooks. I’ve got a framework set up in https://github.com/poise/poise-profiler based on the output from the original chef-handler-profiler but I would like to get some more ideas from others. I know Chef recently added the Ruby profiler integration, so my goal is to focus more on Chef-level metrics that can be hard to tease out of a Ruby-level profile. Right now I’ve got a breakdown by (resource type, resource name) pairs and by resource class. I think a first step would be to break those down further into “number of hits, total time, time per hit” a la more traditional profiling suites. Any thoughts beyond that?


#2

In my experience, one of the most important profiling dimensions for Chef
is the number of external network requests a cookbook makes, including but
not limited to object requests to the chef server.

For example, I had a cookbook loading databags in a loop. After analysis, I
realized that I could get a significant performance gain (and load
reduction on the chef-server side) by refactoring those databags (they all
were of the same “schema”) into a single databag and looping over the keys
instead.

This may be obvious to folks, but it was an “oh, oops” moment of the exact
nature I’d like a profiler to tell me about :slight_smile:


#3

I’m thinking about externalities as well - in this case, metrics on subprocesses. Number of times we shell out, time spent in subprocesses, etc.

Example use case: some (more obscure, less optimized) package providers shell out multiple times to get package status, one at a time. It would be good to detect that.


#4

Cool stuff!

I think one of the most critical parts of the current Chef Client run is the “parts you don’t see” - things that occur during compile phase, that aren’t added to the overall timings (at least last time I checked).

So to add to the current sentiments - timings on “how many calls were made to chef-server, how long did they take”.

Tangentially, to @brianhatfield’s scenario of “many data bag items” - that seems to be a common pattern, and this is largely due to Chef Server not implementing a bulk_get_data_bag_items API call - there seems to be scaffolding for it in Chef Server already, but it doesn’t appear “complete”. Might be an interesting weekend project for someone wanting to get some Erlang in.