Postmortem for habitat.sh DNS issues on 2016-07-11

2016-07-11 - habitat.sh DNS issues

Meeting

  1. This is a blameless Post Mortem.
  2. We will not focus on the past events as they pertain to “could’ve”, “should’ve”, etc.
  3. All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don’t make it a follow up item.

Incident Leader: Dave Parfitt (DP)

Description

habitat.sh DNS resolution issues, partial outage

Timeline

All times in UTC

  • 3:02 PM - sporadic reports that users can’t reach app.habitat.sh.

    • possibly ChefConf hotel wifi related
    • possibly Chef VPN related
  • 3:49 PM - incident declared

  • 3:52 PM - Joshua Timberman investigated Route53

  • 3:53 PM - The issue seems to affect people that aren’t using Google DNS, such as FreeDNS

  • 3:58 PM - team decision: we’d like to resolve the issue as quickly as possible for ChefConf demos, possibly doing things manually for now. We’ll circle back and automate what we need via Terraform etc

  • 4:07 PM - DP updating @opscode_status + Tumblr

  • 4:08 PM - Josh Brand, Steven Danna, Nathan Smith discuss removing the DEPRECATED hosted zone in Route53

  • 4:34 PM - (Josh Brand) the nameserver records for Gandi are actually pointed at the Chef Secure zone, not the Habitat zones

  • 4:34 PM - (Josh Brand)no, Gandi does’t point to Chef Secure either

  • 4:40 PM - DEPRECATED hosted zone has been removed (Joshua)

  • 4:43 PM - deleting habitat.sh zone from chef-secure account (Josh Brand)

  • 4:46 PM - ad hoc Pingdom DNS test still fails

    • we later remove this test as Pingdom doesn’t seem to cover our failure case
  • 4:48 PM - team runs https://cachecheck.opendns.com/ to see failures from around the world.

  • 4:50 PM - Ben Rockwood: Gandi looks good, but now the NS records on the zone don’t match the real DNS servers

      NS should be:
      ns-580.awsdns-08.net
      ns-233.awsdns-29.com
      ns-1057.awsdns-04.org
      ns-1793.awsdns-32.co.uk
    
  • 4:52 PM - Josh Brand - NS recorded updated, however it has a 172800s TTL

  • 4:55 PM - OpenDNS check seems happy

  • 5:15 PM - we think the root DNS issue has been resolved, but it may take awhile for the fix to propogate.

  • 5:15 PM - paging folks at ChefConf to check site availability

    • Ben Rockwood checking wifi outside of ChefConf hotel (Starbucks)
  • 5:33 PM - updating @opscode_status to declare the issue as resolved

  • 5:36 PM - incident closed

Contributing Factor(s)

  • NS records in Route53 zone didn’t match those of the real DNS servers
  • There were multiple hosted zones in Route53, one which was named DEPRECATED. While removing this may not have resolved the issue, it did help clarify the issues.

Stabilization Steps

  • removal of DEPRECATED Route53 hosted zone.
  • update Habitat Route53 zone to match real DNS servers

Impact

  • Unsure of the impact, DNS hasn’t been touched since the initial release of Habitat. We had a few mentions of DNS issues since the launch, but nothing that affected more than 1 person.

Corrective Actions

  • None at this time. The team has decided that the cost to enable monitoring for a situation as described in this PM would exceed the benefit gained.

Link to meeting recording

Link to #incident discussion

https://habitat-sh.slack.com/archives/incident/p1468252193000138

Hello -

I’ve attached a permanent Youtube link to the postmortem video for 2016-07-11.