2016-07-11 - habitat.sh DNS issues
Meeting
- This is a blameless Post Mortem.
- We will not focus on the past events as they pertain to “could’ve”, “should’ve”, etc.
- All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don’t make it a follow up item.
Incident Leader: Dave Parfitt (DP)
Description
habitat.sh DNS resolution issues, partial outage
Timeline
All times in UTC
-
3:02 PM - sporadic reports that users can’t reach
app.habitat.sh
.- possibly ChefConf hotel wifi related
- possibly Chef VPN related
-
3:49 PM - incident declared
-
3:52 PM - Joshua Timberman investigated Route53
-
3:53 PM - The issue seems to affect people that aren’t using Google DNS, such as FreeDNS
-
3:58 PM - team decision: we’d like to resolve the issue as quickly as possible for ChefConf demos, possibly doing things manually for now. We’ll circle back and automate what we need via Terraform etc
-
4:07 PM - DP updating @opscode_status + Tumblr
-
4:08 PM - Josh Brand, Steven Danna, Nathan Smith discuss removing the
DEPRECATED
hosted zone in Route53 -
4:34 PM - (Josh Brand) the nameserver records for Gandi are actually pointed at the Chef Secure zone, not the Habitat zones
-
4:34 PM - (Josh Brand)no, Gandi does’t point to Chef Secure either
-
4:40 PM - DEPRECATED hosted zone has been removed (Joshua)
-
4:43 PM - deleting habitat.sh zone from chef-secure account (Josh Brand)
-
4:46 PM - ad hoc Pingdom DNS test still fails
- we later remove this test as Pingdom doesn’t seem to cover our failure case
-
4:48 PM - team runs https://cachecheck.opendns.com/ to see failures from around the world.
-
4:50 PM - Ben Rockwood: Gandi looks good, but now the NS records on the zone don’t match the real DNS servers
NS should be: ns-580.awsdns-08.net ns-233.awsdns-29.com ns-1057.awsdns-04.org ns-1793.awsdns-32.co.uk
-
4:52 PM - Josh Brand - NS recorded updated, however it has a 172800s TTL
-
4:55 PM - OpenDNS check seems happy
-
5:15 PM - we think the root DNS issue has been resolved, but it may take awhile for the fix to propogate.
-
5:15 PM - paging folks at ChefConf to check site availability
- Ben Rockwood checking wifi outside of ChefConf hotel (Starbucks)
-
5:33 PM - updating @opscode_status to declare the issue as resolved
-
5:36 PM - incident closed
Contributing Factor(s)
- NS records in Route53 zone didn’t match those of the real DNS servers
- There were multiple hosted zones in Route53, one which was named
DEPRECATED
. While removing this may not have resolved the issue, it did help clarify the issues.
Stabilization Steps
- removal of
DEPRECATED
Route53 hosted zone. - update Habitat Route53 zone to match real DNS servers
Impact
- Unsure of the impact, DNS hasn’t been touched since the initial release of Habitat. We had a few mentions of DNS issues since the launch, but nothing that affected more than 1 person.
Corrective Actions
- None at this time. The team has decided that the cost to enable monitoring for a situation as described in this PM would exceed the benefit gained.
Link to meeting recording
Link to #incident discussion
https://habitat-sh.slack.com/archives/incident/p1468252193000138