Rebuilding Builder workers (live)

This should not be needed frequently, however there are times when the VMs could be in an unrecoverable or other undesired state for some reason, and the best solution is to rebuild them.

This document targets the Live environment - the steps for Acceptance are similar.

Pre-requisites

Set up the Terraform environment following the instructions here: https://github.com/habitat-sh/cloud-environments

This doc assumes the setup instructions above are up-to-date, and you have been able to successfully do a make init.

Steps

  1. Set up a maintenance window in status.io since active builds may be impacted
  2. Change directory to the cloud-environments/builder-live folder
  3. In the default.tf file, change the jobsrv_worker_count value to 0
  jobsrv_worker_count    = 0
  1. Run a terraform plan. This should show that the workers (and related networks) will be deleted. Double and triple check to make sure other services are not being deleted or changed (an exceptions is the aws_s3_bucket which seems to always want to update itself).
  2. If all looks good, run a terraform apply
  3. After all the instances are deleted, go back and change the jobsrv_worker_count back to 50 (or whatever the original value was).
  4. Repeat the terraform plan and terraform apply steps.
  5. Once all the worker instances are re-created, ssh into the builder-datastore node, and re-run the apply_config.sh script (this ensures that all the key files are properly sent over to the workers).
  6. Update the maintenance window to complete.

NOTE:

If you see the following error during worker creation, ignore it - it is verbiage from trying to clean up networks that don’t actually exist.

module.builder_environment.module.builder.null_resource.worker_studio_network (remote-exec): error: Invalid value for '--ns-dir <NS_DIR>': directory '/hab/svc/builder-worker/data/network/airlock-ns' cannot be found, must exist