Latency on search indicies


#1

Ohai,

What’s the expected latency between executing a command until the search
index is updated (we’re using hosted chef)?

I’ve seen several times now that it can take 15-20 seconds from moving a
node from one environment to another before it’s returned by search. Is
this normal?

/Jeppe


#2

That is probably right as solr needs to reindex the data. I’ve seen
similar when using hosted chef.

-Pete

On Mon, Jun 24, 2013 at 3:52 AM, Jeppe Nejsum Madsen jeppe@ingolfs.dkwrote:

Ohai,

What’s the expected latency between executing a command until the search
index is updated (we’re using hosted chef)?

I’ve seen several times now that it can take 15-20 seconds from moving a
node from one environment to another before it’s returned by search. Is
this normal?

/Jeppe


#3

Hi,

Just be warned that the lag can be much more significant. We run the
open source variant of chef for now and have noticed lags of anywhere
between <1s to ~120s between the data being inserted and it being
accessible via search. In most cases we rely on eventual consistency
to catch up but in the cases of directed changes - particularly in our
CD pipeline we insert data bag items and then poll until available
before progressing.

We use search extensively and have a large number of windows nodes
which seems to place pressure on the indexing system. To reduce the
impact we have started to strip out lots of windows ohai data and use
partial search cookbook where possible. That combined wiht a bit more
memory for the chef box has reduced the lag a little.

Not sure if that helps …

On Mon, Jun 24, 2013 at 8:52 PM, Jeppe Nejsum Madsen jeppe@ingolfs.dk wrote:

Ohai,

What’s the expected latency between executing a command until the search
index is updated (we’re using hosted chef)?

I’ve seen several times now that it can take 15-20 seconds from moving a
node from one environment to another before it’s returned by search. Is this
normal?

/Jeppe


Cheers,

Peter Donald


#4

On Jun 24, 2013, at 3:20 PM, Peter Donald peter@realityforge.org wrote:

Just be warned that the lag can be much more significant. We run the
open source variant of chef for now and have noticed lags of anywhere
between <1s to ~120s between the data being inserted and it being
accessible via search.

What version of chef-server are you running? Is it Chef 11.x, 10.x, or something older?

We use search extensively and have a large number of windows nodes
which seems to place pressure on the indexing system.

Windows nodes put a heavier load on the Chef server, but I would think that chef-server 11.x should be a lot more capable of handling large numbers of clients (even large numbers of Windows nodes) much better than the older 10.x-based versions.

After all, my understanding is that Hosted Chef is basically the world’s largest instance of Private Chef 11.x set up in a multi-tiered structure, and I believe that Private Chef has been proven by partners like Facebook, Etsy, Netflix, etc… to scale to at least tens of thousands of nodes on a single Private Chef 11.x cluster.

To reduce the
impact we have started to strip out lots of windows ohai data and use
partial search cookbook where possible. That combined wiht a bit more
memory for the chef box has reduced the lag a little.

I’ve always wondered why ohai generates such massive amounts of information per node (regardless of platform), and that all of this information is usually considered “important enough” that all of it should be saved and indexed after every single run. Windows nodes might be worse in this respect, but the problem isn’t all that much better on most *nix nodes.

It seems to me that Ohai data that is going to be saved should be minimized to start with, and then if there are extra bits of information you want/need to be available via search then you should be able to handle those appropriately. Or, at the very least, maybe give us levels of index priority, and some Ohai data should be considered “high priority” and available via search at very low latency, but 90-99% of the rest of the Ohai data should be considered “low priority”.


Brad Knowles brad@shub-internet.org
LinkedIn Profile: http://tinyurl.com/y8kpxu


#5

Hi,

On Tue, Jun 25, 2013 at 10:46 AM, Brad Knowles brad@shub-internet.org wrote:

Just be warned that the lag can be much more significant. We run the
open source variant of chef for now and have noticed lags of anywhere
between <1s to ~120s between the data being inserted and it being
accessible via search.

What version of chef-server are you running? Is it Chef 11.x, 10.x, or something older?

We actually use Chef 11 but to be honest most of the strategies we
used to address this were developed when we were on 10.x. There is
still a delay between when a data bag item is inserted/updates and
when it appears in the search results so I assume the lag is still
present for node data but have not tested.

Windows nodes put a heavier load on the Chef server, but I would think that chef-server 11.x should be a lot more capable of handling large numbers of clients (even large numbers of Windows nodes) much better than the older 10.x-based versions.

After all, my understanding is that Hosted Chef is basically the world’s largest instance of Private Chef 11.x set up in a multi-tiered structure, and I believe that Private Chef has been proven by partners like Facebook, Etsy, Netflix, etc… to scale to at least tens of thousands of nodes on a single Private Chef 11.x cluster.

To be fair the chef 11 server does not seem to be particularly loaded
even when there is lag.

Also most of those other sites can probably afford to rely on eventual
consistency. However it would be nice to have the parameter to the
rest endpoints to block until indexing of the changed data is
complete.


Cheers,

Peter Donald


#6

On Jun 24, 2013, at 7:42 PM, Peter Donald peter@realityforge.org wrote:

We actually use Chef 11 but to be honest most of the strategies we
used to address this were developed when we were on 10.x. There is
still a delay between when a data bag item is inserted/updates and
when it appears in the search results so I assume the lag is still
present for node data but have not tested.

Ahh, data bags. I always thought that they were kind of a hack, and maybe not as well implemented as a lot of other parts of Chef.

I would be very interested to find out if you have the same latency with node data as you do with data bags.

Also most of those other sites can probably afford to rely on eventual
consistency. However it would be nice to have the parameter to the
rest endpoints to block until indexing of the changed data is
complete.

Indeed, that sounds like a knob that would be very nice to be able to tweak.


Brad Knowles brad@shub-internet.org
LinkedIn Profile: http://tinyurl.com/y8kpxu


#7

Peter Donald peter@realityforge.org writes:

Hi,

On Tue, Jun 25, 2013 at 10:46 AM, Brad Knowles brad@shub-internet.org wrote:

Just be warned that the lag can be much more significant. We run the
open source variant of chef for now and have noticed lags of anywhere
between <1s to ~120s between the data being inserted and it being
accessible via search.

What version of chef-server are you running? Is it Chef 11.x, 10.x, or something older?

We actually use Chef 11 but to be honest most of the strategies we
used to address this were developed when we were on 10.x.

We’re using windows as well. Can you outline what strategies you used?

[…]

Also most of those other sites can probably afford to rely on eventual
consistency.

Probably depends on the workflow…

When we add nodes to e.g. an environment and kick of chef-client, it’s
unfortunate that only some of the nodes get hit.

However it would be nice to have the parameter to the rest endpoints
to block until indexing of the changed data is complete.

Indeed. Otherwise we have to implement this explicitly as you mentioned
by waiting to make sure indices are updated.

/Jeppe


#8

On Tue, Jun 25, 2013 at 6:00 PM, Jeppe Nejsum Madsen jeppe@ingolfs.dk wrote:

Just be warned that the lag can be much more significant. We run the
open source variant of chef for now and have noticed lags of anywhere
between <1s to ~120s between the data being inserted and it being
accessible via search.

What version of chef-server are you running? Is it Chef 11.x, 10.x, or something older?

We actually use Chef 11 but to be honest most of the strategies we
used to address this were developed when we were on 10.x.

We’re using windows as well. Can you outline what strategies you used?

Mostly it involved reducing load generated via search, reduce the
amount of data that went into the search indexes and support explicit
wait until constructs to wait until search is updated.

In no particular order we did the following;

We probably did a few other things as well. But most significant
change was the introduction of the last strategy. We now are driving
more and more of our config via data bags to try and limit the idea of
"exported" resources interacting across nodes.

HTH,

Peter Donald


#9

On Tuesday, June 25, 2013 at 4:45 PM, Peter Donald wrote:

On Tue, Jun 25, 2013 at 6:00 PM, Jeppe Nejsum Madsen <jeppe@ingolfs.dk (mailto:jeppe@ingolfs.dk)> wrote:

Just be warned that the lag can be much more significant. We run the
open source variant of chef for now and have noticed lags of anywhere
between <1s to ~120s between the data being inserted and it being
accessible via search.

Search is not immediately consistent. The exact latency you experience will depend on the commit frequency configured for Solr.

Newer versions of Solr have a soft commit feature which allows for much more frequent commits without adding disk IO load.Upgrading to Solr 4 is something that’s on our radar, but not being actively worked on at the moment.


Daniel DeLeo