I think I'm doing it wrong: DNS as an example


#1

Let’s take DNS (with route53) as an example:

Each node uses an LWRP (based on HW’s route53 cookbook) to check
route53 and add itself to DNS if needed. This seems like a common
patter and is all good.

However, what about when you have, say, 5000 nodes? It just seems
absolutely silly to have each node do this every hour. While it does
make sure that new nodes get added to DNS right away - it just seems
unnecessary to do this every chef run.

Now, imagine the above but with 3 or 4 services - API calls for
monitoring, load balancing, etc. The “LWRP every chef run” is easy
and makes sense when you have a relatively few number of nodes.

How are other large installs handling this?

I was thinking that a script that once every x minutes scraped route53
and chef and just applied the “diff” would be more suitable for
"large" installs.

Or am I just fretting over nothing?

–Brian


#2

You could make your recipe more intelligent.

Have an attribure default[:dns][:set] = false, then node.set[:dns][:set] = true in the first run.

Check that on the recipe and avoid hitting the API if true.

  • cassiano

On Monday, March 25, 2013 at 09:04, Brian Akins wrote:

Let’s take DNS (with route53) as an example:

Each node uses an LWRP (based on HW’s route53 cookbook) to check
route53 and add itself to DNS if needed. This seems like a common
patter and is all good.

However, what about when you have, say, 5000 nodes? It just seems
absolutely silly to have each node do this every hour. While it does
make sure that new nodes get added to DNS right away - it just seems
unnecessary to do this every chef run.

Now, imagine the above but with 3 or 4 services - API calls for
monitoring, load balancing, etc. The “LWRP every chef run” is easy
and makes sense when you have a relatively few number of nodes.

How are other large installs handling this?

I was thinking that a script that once every x minutes scraped route53
and chef and just applied the “diff” would be more suitable for
"large" installs.

Or am I just fretting over nothing?

–Brian


#3

in our environment, the ‘dns server’ node searches chef and generates DNS
configs based on the search (no LWRP, everything goes in there).
we use the same node for managing our load balancers… one data bag item
for each pool in the load balancer, which defines the pool configuration
and which role and environment should be used to search, then it just
searches chef for matching nodes and adds them.

its a different model than having each node update itself, but with 500
nodes we found it was much more efficient (that, and the load balancer
didn’t like getting hundreds of nodes every few minutes checking in)

-jesse

On Mon, Mar 25, 2013 at 8:45 AM, Cassiano Leal cassianoleal@gmail.comwrote:

You could make your recipe more intelligent.

Have an attribure default[:dns][:set] = false, then node.set[:dns][:set]
= true in the first run.

Check that on the recipe and avoid hitting the API if true.

  • cassiano

On Monday, March 25, 2013 at 09:04, Brian Akins wrote:

Let’s take DNS (with route53) as an example:

Each node uses an LWRP (based on HW’s route53 cookbook) to check
route53 and add itself to DNS if needed. This seems like a common
patter and is all good.

However, what about when you have, say, 5000 nodes? It just seems
absolutely silly to have each node do this every hour. While it does
make sure that new nodes get added to DNS right away - it just seems
unnecessary to do this every chef run.

Now, imagine the above but with 3 or 4 services - API calls for
monitoring, load balancing, etc. The “LWRP every chef run” is easy
and makes sense when you have a relatively few number of nodes.

How are other large installs handling this?

I was thinking that a script that once every x minutes scraped route53
and chef and just applied the “diff” would be more suitable for
"large" installs.

Or am I just fretting over nothing?

–Brian


#4

At this kind of scale I think you need to stop thinking of Chef as the
source of truth for the current state of your infrastructure, and start
looking at tools like Zookeeper.

A recent Food Fight
Showhttp://foodfightshow.org/2013/03/episode-46-zookeeper1.html had
a great discussion about Zookeeper.


#5

I agree with Mathieu. Chef for setup / maintenance, ZK for “what does the
system look like right now”. The tools for this are young in the ruby
community, but I deeply believe that a paxos based system like ZK is the
way to move forward when you’re talking about large scale devops.

-Kevin

On Mon, Mar 25, 2013 at 11:42 AM, Mathieu Martin webmat@gmail.com wrote:

At this kind of scale I think you need to stop thinking of Chef as the
source of truth for the current state of your infrastructure, and start
looking at tools like Zookeeper.

A recent Food Fight Showhttp://foodfightshow.org/2013/03/episode-46-zookeeper1.html had
a great discussion about Zookeeper.


#6

On Monday, March 25, 2013 at 5:04 AM, Brian Akins wrote:

Let’s take DNS (with route53) as an example:

Each node uses an LWRP (based on HW’s route53 cookbook) to check
route53 and add itself to DNS if needed. This seems like a common
patter and is all good.

However, what about when you have, say, 5000 nodes? It just seems
absolutely silly to have each node do this every hour. While it does
make sure that new nodes get added to DNS right away - it just seems
unnecessary to do this every chef run.

Now, imagine the above but with 3 or 4 services - API calls for
monitoring, load balancing, etc. The “LWRP every chef run” is easy
and makes sense when you have a relatively few number of nodes.

How are other large installs handling this?

I was thinking that a script that once every x minutes scraped route53
and chef and just applied the “diff” would be more suitable for
"large" installs.

Or am I just fretting over nothing?

–Brian
At Opscode, we use Dyn DNS. We at first used a naive approach, setting the DNS record every run, but eventually we ran into API throttling problems. For this case it was simple enough to verify that the host already had the CNAME it wanted and skip the API call.

Re: suggestions to use ZooKeeper: go for it if you get enough value to justify the management overhead.


Daniel DeLeo


#7

Thanks for all the feedback. We’ve been poking at using Zookeeper +
chef for a while, but haven’t had the nerve to just do it. I suppose
it’s time we looked more closely at it.


#8

im experimenting on something similar. Interfacing with zookeeper directly
is bit of pain. I am playing with mostly dcell which can use zookeeper or
redis for maintaining the node registry.

One thing to note is we have to keep data that are very dynamic out of
zookeeper as it performs best for read more write less scenarios. Also, in
certain cases I want to get the information from the other node directly
instead of a central registry.

On Mon, Mar 25, 2013 at 11:45 AM, Brian Akins brian@akins.org wrote:

Thanks for all the feedback. We’ve been poking at using Zookeeper +
chef for a while, but haven’t had the nerve to just do it. I suppose
it’s time we looked more closely at it.