An On Call Engineer’s guide to Habitat in production or “doing it live”
The Habitat team owns and operates our own services. We are responsible for their uptime and availability. To that end, we have a rotating, 72 hour on call period in which each Chef employed team member participates. If you cannot cover your assigned rotation, you are responsible for getting coverage.
PagerDuty alert? Stay calm, check the Known Issues section at the end of this doc.
What you need before going on call
- Access to the Habitat team 1Password Shared vault
- Access to PagerDuty - ask #helpdesk if you need an account
- Access to Datadog - ask any team member to add you
- Access to Sumologic - ask team members or account owner (currently @benr) to add you
- Access to StatusPage - ask any team member to add you
- Access to Pingdom (creds in Chef LastPass) - Pingdom currenty monitors - https://willem.habitat.sh/v1/depot/origins/core
- Access to Fastly - open a ticket with #helpdesk to grant access to Fastly for Habitat
- Access to the Habitat AWS console (via Okta - open a ticket with #helpdesk to grant access to Habitat AWS)
- Access to the Chef Jump host (see below)
- Ability to ssh to our production environment (see below)
- Basic familiarity with the services we run and how they are running (detailed below)
You may occasionally need to access these:
- Access to the Builder GitHub Application: https://github.com/organizations/habitat-sh/settings/apps/habitat-builder
- Access to Web Hooks: https://github.com/habitat-sh/habitat/settings/hooks
- Available to respond to PagerDuty alerts (if you are going to be away from a computer for an extended period, you are responsible for getting someone to take on call).
- Incident response does not mean you need to solve the problem yourself.
- You are expected to diagnose, to the best of your ability, and bring in help if needed.
- Communication while troubleshooting is important.
- Triage incoming GitHub issues and PRs (best effort - at least tag issues if possible and bring critical issues to the forefront).
- Monitor #general and #operations as time permits.
- Monitor Builder dashboards as time permits
More about Chef’s incident and learning review (post-mortems) can be found at https://chefio.atlassian.net/wiki/pages/viewpage.action?spaceKey=ENG&title=Incidents+and+Post+Mortems
During your on-call rotation, it is expected that you won’t have the same focus on your assigned cards and that you might be more interrupt-driven. This is a good thing because your teammates can stay focused knowing that their backs are being watched. You benefit likewise when you are off rotation.
Handing off coverage when you’re on call
If you’ll be unavailable during a portion of your on-call duty, hand off coverage to another member of the team by setting up a PagerDuty override. See the PagerDuty docs for instructions on how to do that.
Pulling in other team members
If you need to pull in other team members to help, and it’s after hours, you can use PagerDuty to add “Responders” to an incident; it will take care of notifying them for you, which should work better than you calling them directly, since every team member is not necessarily on our phone’s “emergency notification” list, but PagerDuty likely is.
See PagerDuty’s documentation on Adding Responders for further details.
The Builder dashboards contain most of the key metrics that indicate the health of the system. The dashboards should be used as the top level view to monitor the Builder API, Builder Jobs, as well as internals such as Memcache status.
Access to the VPN and Chef jump host
In order to ssh or rdp into the production or acceptance Builder environments, you will need access to the Chef jump host and the VPN. You will need to be logged in to the VPN in order to connect to any Builder instance.
Details on accessing the VPN can be found in the Chef Wiki. If you run into issues, please reach out to the team or the IT folks for help.
Access to the jump host is provided by the helpdesk. Details on requesting access to the jump host can be found in the Chef Wiki. If you run into issues, please reach out to the team or the IT folks for help.
SSH access to prod/acceptance nodes
You will need access to the 1Password shared vault (if you do not have this, ask a core team member) and to the Habitat AWS account (should be through an icon in Okta; if you don’t have this, ask the Chef internal help desk).
Copy the “habitat-srv-admin” key from the shared vault. I always put mine in my ~/.aws directory on my workstation.
There are a set of scripts in the
ssh_helpers directory in the habitat repo. You can use these to automatically populate all the production (and acceptance) nodes in your ssh config, and can then simply refer to the node by its friendly name, e.g.:
It is strongly recommended you use these scripts to manage your ssh configuration, as they will automatically set up the jump host connection for you. Please read the documentation on the scripts to understand how to use them
NOTE If your local workstation username is different then your
@chef.io username, you will also need the following config in your
~/.ssh/config. This config is unique per-user.
Host jump.chef.co User your_chef_username
SSH access to prod/acceptance windows-workers
Be aware when accessing Windows nodes via SSH, the initial command interpreter is
cmd.exe. Simply run
powershell to enter a powershell prompt inside of the SSH session.
Also, if you have a
~/.ssh/config file generated prior to August 2020, either regenerate the entries as shown in the above section or make sure to replace the
ubuntu user in the Windows worker entries to
RDP access to prod/acceptance windows-workers
In order to access the windows worker nodes via RDP, you will need to create an ssh tunnel to point your RDP client of choice at.
On Linux, MacOS, and WSL, you can run:
ssh -L 33389:<windows worker dns>:3389 jump.chef.co
And then point your RDP client at
You will need to leave the ssh session that is created open while you are connected to the Windows host.
Current state of Production
Each of the builder services runs in its own AWS instance. You can find these instances by logging in to the Habitat AWS portal. If you do not already have access, ask #helpdesk to add the Habitat AWS Portal app to your OKTA dashboard. Make sure to add the
X-Environment column to the search results and search for instances in the
Current state of Acceptance
The acceptance environment closely mirrors the live environment with the exception that it runs newer (the latest ‘unstable’ channel) service releases. Use the AWS portal as described above to locate the acceptance environment builder service instances. The acceptance environment also runs fewer build workers than the production environment.
Historically, trouble in production can be resolved by restarting services. However, it should not necessarily be a first resort - at least in production, services should rarely need to be restarted manually, and ideally should only be done if there is some evidence from reading the logs that a restart may be called for. Here are some generic pointers for exploring the status of production.
Sumologic currently aggregates logs from the builder API and build worker nodes. The Sumologic logs (and particularly its Live Tail ability) are invaluable for troubleshooting, and should generally be one of the first places to start the troubleshooting session.
The key Sumologic queries can be found in the
Builder Searches folder:
- Live API Errors - 15 min
- Live API Access
- Live Workers
- Live Syslog
- Acceptance API Errors - 15 min
- Acceptance API Access
- Acceptance Workers
- Acceptance Syslog
For getting the general lay of the land, the “Live Syslog” can be useful, as it is an aggregated set of the journalctl logs from all the services. For build issues, the “Live Workers” logs can provide useful hints of potential build-specific issues.
Make sure you can run these queries, and also be familiar with the ability to look at a Live Tail session for these queries. You can also install the livetail CLI, create an access key under preferences, and then tail directly from your terminal like so:
➤ livetail _sourceCategory=live/syslog
Supervisor logs (syslog)
You can read the supervisor output on any of the service instances by ssh-ing into a service node, and running
journalctl -fu hab-sup. If you find yourself needing to read production logs, the
-fu should roll quite naturally off the finger tips.
If there is a specific timeframe when a problem occurred, it is sometimes useful to get the logs from that specific time (UTC) - for example
journalctl --since '2017-11-13 10:00:00' -u hab-sup | more will show logs from the specified time.
Most instances just run a single service but there are a couple that run two (or more). Running
systemctl restart hab-sup will restart the supervisor itself and therefore all services it runs. You may of course run
sudo hab svc stop <IDENT> and
start <IDENT> to restart individual services. Run
sudo hab sup status to determine which services are loaded and what state each service is in.
Here is a brief synopsis of the builder services:
api- acts as the REST API gateway. Note that the builder-api service depends on bindings from the memcache service (which runs on the api node), as well as the datastore and the job server. When restarting services on the builder-api node, you should make sure that both the datastore and jobsrv have been successfully started, otherwise the bind may not be available.
api-proxy- is the NGINX front end for the REST API. It currently runs on the builder-api nodes.
jobsrv- Handles build jobs. If clicking the
Request a Buildbutton does nothing, or you get a popup message that the build was accepted but the build output is never displayed, you may need to restart this service. If package uploads fail, this service may also need to be restarted (since it manages a pre-check for packages to make sure there are no circular dependencies being introduced).
worker- Handles the build jobs. There are currently three different worker types - Linux, Linux2, and Windows.
Querying the database
The Builder database is currently a RDS instance. There are separate instances for Acceptance and Live.
IMPORTANT: DOUBLE CHECK THE INSTANCE YOU ARE CONNECTING TO
In order to connect, you should first SSH into a service instance that supports connecting to the DB - currently, builder-api or builder-jobsrv. It is recommended that you use builder-api-0 as the node from which to connect.
- SSH to the
- Find the RDS endpoint and password in the
/hab/svc/builder-api/config/config.tomlfile under the
- Export the
export PGPASSWORD=<RDS password>
- Run the following command:
hab pkg exec core/postgresql psql -U hab -h <RDS endpoint> builder
Canceling all dispatching jobs
If you find you need to cancel all dispatching jobs, please follow the instructions in this article.
If you are in a position where you need to deploy a fix, the builder services (assuming they are up) makes this easy. Once your fix is merged to master, a build hook will kick in and automatically re-build the packages that need to be updated. Those
unstable packages will be picked up and installed automatically in the acceptance environment. Once the fix is validated in acceptance, those packages can be promoted using either the Promote to stable button in Builder UI or via the CLI with
hab pkg promote or
hab bldr job promote commands. (Note:
hab bldr job promote can promote multiple packages in one go, and there is currently no UI equivalent to that.)
For Builder issues, please see Troubleshooting Builder Services.