Availability, redundancy, replication, etc

drsmithy · March 21, 2016, 10:50pm

Hi folks,

We’ve been doing some testing and evaluation and will soon be implementing Chef Server in anger, in production, and I had a few quick questions about ensuring availability to figure out where to focus attention first.

We have two datacentres, and our primary requirement is redundancy to cover planned failover and DR situations. High availability is not really a requirement at this stage.

I am not really worried about performance or needing the scale tiers independently, so I’m happy to use a standalone server (or servers).

I mainly wanted to check that replication will work as I think it will. If I setup a Chef master server, then replicas for each site, and register the nodes with the replicas, if the master server fails will the replicas continue to function “normally” from a client perspective until the master is restored (or rebuilt) ?

Is it possible to setup Chef replicas with a hosted Chef master server ?

Cheers,
CS

dreamnite · March 22, 2016, 1:48am

drsmithy:

I’ve previously set up Chef using Chef replication. I do not believe it is possible with hosted chef (since you don’t get access to the underlying OS to be able to configure that side of the replication).

My set up consisted of a primary chef server in one data center, and 4 other spread across the world for DR. Each data center was fairly self contained, so it worked ok. Not great, but it ok.
Even if the link for replication is broken, the chef server will continue to operate and deliver cookbooks to nodes.

(Lists incoming, sorry, I like lists).

The benefits:

Your cookbooks are automatically synced with one upload to the master server
Data bags are the same (but must be the exact same across servers)
Your environments likewise (but must be the same, include override attributes).

The downsides:

It sometimes loses sync (I think it was latency, but I’m not sure) and requires a manual restart to get it back to sync
Changes end up in both replicated environments, period. Version pinned? Doesn’t matter, because your environment is also duplicated. Be careful with your change planning and controls.
Users (and their keys) are not replicated. You have to address each server as a stand alone and maintain your user key on each (unless you manually sync the private keys… not the most trivial thing to do)
Node data is not shared. See point above about addressing each as a stand alone server.

For my situation, I overcame the downsides by:

Syncing user keys. Once you figure out how to do this, it can be automated with a little work.
Avoiding node attributes in roles and environments. I personally consider node attributes in these locations to be an anti-pattern and go with role cookbooks that if/case/unless off node.chef_environment where needed.(Easier version control, and you have one place to look when making changes, and can do so programmatically as needed.)

If you can live with the downsides, it can work for you. That being said, I would offer a better solution: Set up stand-alone chef servers and utilize a CI/CD pipeline (bamboo, jenkins, Chef Delivery) to deliver your cookbooks or other data into the servers.

This will allow you to keep things in sync for disaster recovery purposes, and will help improve your process. (TDD is a wonderful thing). Start with a simple pipeline that handles the cookbook upload and version pinning for you, then add simple tests (foodcritic, rubocop, chefspec, maybe even have it run test kitchen automatically) and it will help you out in the long run.

Trust me, I’ve converted to this method and it helps greatly.

–Jp Robinson

Topic		Replies	Views
Chef Replication vs HA - why would you choose to use replication? Chef Infra (archive)	5	614	October 19, 2015
Replication Chef Infra (archive)	1	237	June 3, 2011
Chef Server - How to do High Availability? Chef Infra (archive)	5	345	September 8, 2012
Mysql Master Slave Replication.community cookbooks? Chef Infra (archive)	0	492	December 4, 2016
Replicate two Chef Servers Chef Automate	3	845	July 25, 2018

Availability, redundancy, replication, etc

Related topics