Postmortem for DNS issues on 2016-07-22

2016-07-22 - Habitat DNS issue

Start every PM stating the following

  1. This is a blameless Post Mortem.
  2. We will not focus on the past events as they pertain to “could’ve”, “should’ve”, etc.
  3. All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don’t make it a follow up item.

Incident Leader: Dave Parfitt

Description DNS issues


All times in UTC

Searching for core/hab-pkg-dockerize in remote
» Installing core/hab-pkg-dockerize
✗✗✗ failed to lookup address information: Try again
  • 6:48 PM: Dave Parfitt checks, and are returning SERVFAIL from around the world.
  • 6:49 PM: Jamie Winsor asks if DNS entries were entered manually since the last incident.
    • Route53 DNS entries WERE entered manually, it was determined at the previous postmortem that no actions were need to update Terraform.
  • 6:51 PM: Dave Parfitt declares the incident, starts a zoom session.
  • 6:53 PM: Jamie Winsor updates the Terraform DNS info via
  • 6:54 PM: PR has been Terraform applied
  • 7:09 PM: Route53 NS records are correct
  • 7:13 PM: TTL is 172800 (2 days)
  • 7:28 PM: periodically checking DNS via
  • 7:28 PM: contacted Chef ops, including Ben Rockwood, Josh Brand, Mark Harrison
  • 7:30 PM: Mark Harrison suggests clicking the “Refresh Cache” button on the OpenDNS check page.
  • 7:32 PM: Josh Brand flushes the Google DNS cache
  • 7:33 PM: all OpenDNS checks return success
  • 7:36 PM: incident closed

Contributing Factor(s)

Changes applied manually during ChefConf Habitat DNS issue were not committed to the Terraform repo.

Stabilization Steps

Apply the correct DNS settings to Habitat Terraform repo.


  • Some users couldn’t access the site. Terraform apply has been run 3 days prior, with a 2 day TTL.
  • hooks from Github couldn’t hit the site

Corrective Actions

  • process for updating Terraform pinned to the Habichat #operations room
  • clarify what should and shouldn’t be applied through Terraform for Habitat:
    • everything to do with AWS is applied through TF
    • Fastly’s TF provider doesn’t suit our needs
  • if there are changes that need to be made manually, escalate to the owner of the project.
    • there shouldn’t be manual changes.

Link to meeting recording

Note: videos will soon move to Youtube, the following link will contain the latest:

Hello -

I’ve attached a permanent Youtube link for the postmortem on 2016-07-22.

Cheers -