2016-06-23 - Habitat app.habitat.sh depot upload/download errors
Meeting
- This is a blameless Post Mortem.
- We will not focus on the past events as they pertain to “could’ve”, “should’ve”, etc.
- All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don’t make it a follow up item.
Incident Leader: Dave Parfitt
Description
There were 2 issues that the Habitat team dealt with during this incident.
- the Habitat depot was returning HTTP 503 and 504 errors upon package download.
- the Habitat depot was returning an HTTP 503 after a large package upload.
Timeline
All times UTC.
- 5:02 PM: Adam Jacob declares the incident, Dave Parfitt is incident commander
new incident: disk space full on the depot
ubuntu@ip-10-0-0-190:~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 0 2.0G 0% /dev
tmpfs 396M 41M 355M 11% /run
/dev/xvda1 7.8G 7.2G 175M 98% /
tmpfs 2.0G 0 2.0G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
tmpfs 396M 0 396M 0% /run/user/1000
-
5:06 PM: Adam Jacob creates a new EBS volume w/ 1.5 TB
-
5:09 PM: Adam Jacob attaches new volume to depot server
-
5:12 PM: brief outage announced in Habitat Slack #general channel.
- nginx, director stopped
- files copied from
/hab
to/mnt/hab
- removed all files from
/hab
- update fstab
- unmound
/mnt/hab
- mount
/hab
-
5:15 PM: from an internal discussion w/ Jamie Winsor “the aws instance resource in Terraform for the monolith doesn’t have the ebs_block_device stanza that the original gateway has”
-
5:18 PM: successful login to app.habitat.sh
-
5:21 PM: successful package install via
hab pkg install core/ruby
-
5:21 PM: disk space incident resolved.
ubuntu@ip-10-0-0-190:~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 0 2.0G 0% /dev
tmpfs 396M 41M 355M 11% /run
/dev/xvda1 7.8G 3.0G 4.4G 41% /
tmpfs 2.0G 0 2.0G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
tmpfs 396M 0 396M 0% /run/user/1000
/dev/xvdf 1.5T 4.9G 1.4T 1% /hab
- 5:22 PM: reports of upload errors on larger artifacts, unrelated to disk space issue:
root@3f47518d40f6:/src/plans/results# hab pkg upload core-jruby-9.1.2.0-20160622160900-x86_64-linux.hart
» Uploading core-jruby-9.1.2.0-20160622160900-x86_64-linux.hart
→ Exists core/bash/4.3.42/20160612075613
→ Exists core/gcc-libs/5.2.0/20160612075020
→ Exists core/glibc/2.22/20160612063629
→ Exists core/jdk8/8u92/20160620143238
→ Exists core/linux-headers/4.3/20160612063537
→ Exists core/ncurses/6.0/20160612075116
→ Exists core/readline/6.3.8/20160612075601
↑ Uploading core-jruby-9.1.2.0-20160622160900-x86_64-linux.hart
83.00 MB / 83.00 MB \ [==========================================================================================================] 100.00 % 5.70 MB/s
Unexpected response from remote
✗✗✗
✗✗✗ 503 Service Unavailable
✗✗✗
- 5:30 PM: nginx/ELB seems healthy
- 5:50 PM: wiresharking processing request/response data in wireshark
- 5:53 PM: uploaded files appear on disk even though a 503 was returned
- 5:55 PM: nginx
keepalive_timeout
is set to20s
, bumped to60s
- 5:58 PM: ELB
Idle timeout
is60
seconds, setting to300
seconds - 5:58 PM: changing nginx
keepalive_timeout
to300s
to match ELB - 6:00 PM: nginx restarted, unsuccessfully upload a new jdk8 package to the
metadave
origin - 6:08 PM: confirmed that our HTTP upload responses from Hyper are correct
- 6:12 PM: Uploading directly to the ELB instead of through Fastly works:
[13][default:/src:0]# export HAB_DEPOT_URL="https://builder-api-690653005.us-west-2.elb.amazonaws.com/v1/depot"
[14][default:/src:0]# hab pkg upload ./results/metadave-jdk8-8u92-20160622180115-x86_64-linux.hart
» Uploading ./results/metadave-jdk8-8u92-20160622180115-x86_64-linux.hart
→ Exists core/glibc/2.22/20160612063629
→ Exists core/linux-headers/4.3/20160612063537
↑ Uploading ./results/metadave-jdk8-8u92-20160622180115-x86_64-linux.hart
143.06 MB / 143.06 MB \ [========================================================================================================] 100.00 % 8.61 MB/s ✓ Uploaded metadave/jdk8/8u92/20160622180115
★ Upload of metadave/jdk8/8u92/20160622180115 complete.
- 6:45 PM: “between bytes” setting in Fastly set to 5 minutes
- 7:10 PM: ping Fastly support in IRC
- 7:13 PM: changing Fastly “connection time” =
300000
, “first byte” =300000
- 7:15 PM: successful upload with new Fastly settings:
[28][default:/src:0]# time hab pkg upload ./results/metadave-jdk8-8u92-20160622190929-x86_64-linux.hart
» Uploading ./results/metadave-jdk8-8u92-20160622190929-x86_64-linux.hart
→ Exists core/glibc/2.22/20160612063629
→ Exists core/linux-headers/4.3/20160612063537
↑ Uploading ./results/metadave-jdk8-8u92-20160622190929-x86_64-linux.hart
143.07 MB / 143.07 MB | [========================================================================================================================] 100.00 % 6.03 MB/s ✓ Uploaded metadave/jdk8/8u92/20160622190929
★ Upload of metadave/jdk8/8u92/20160622190929 complete.
- 7:17 PM: uploaded multiple large (140MB) artifacts successfully
- 7:24 PM: upload several packages to try and determine exactly which setting fixes the issue
- 7:29 PM: upload success, the tweak to Fastly’s
time to first byte
resolves the issue. - 7:41 PM: 2nd incident resolved, incident closed
Contributing Factor(s)
-
The depot filesystem had 175M of disk space free. This prevented large file uploads and other misc errors in the builder-api service.
-
Fastly wasn’t configured for large file uploads. Tweaking “time to first byte” in Fastly resolved large file upload issues.
-
The depot server doesn’t have 5xx monitoring.
Stabilization Steps
- Added 1.5TB of disk space to the depot.
- Set “time to first byte” in Fastly to 300000 milliseconds.
Impact
- some uploads/downloads returned 503/504’s over ~1 hour.
- Replacing the disk with a new volume caused a depot outage of ~6 minutes.
- uploading large artifacts would result in a 503, but retrying the upload would resolve the issue. This has been an issue since Habitat was released.
Corrective Actions
- Add 5xx monitoring to Fastly or something that tests the route through Fastly -> EBS -> EC2 (Dave Parfitt)
- Update Terraform - the monolith doesn’t have the ebs_block_device stanza that the original gateway has (Joshua Timberman)