Postmortem for app.habitat.sh depot upload/download errors 2016-06-23

metadave · June 23, 2016, 9:39pm

2016-06-23 - Habitat app.habitat.sh depot upload/download errors

Meeting

Video

This is a blameless Post Mortem.
We will not focus on the past events as they pertain to “could’ve”, “should’ve”, etc.
All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don’t make it a follow up item.

Incident Leader: Dave Parfitt

Description

There were 2 issues that the Habitat team dealt with during this incident.

the Habitat depot was returning HTTP 503 and 504 errors upon package download.
the Habitat depot was returning an HTTP 503 after a large package upload.

Timeline

All times UTC.

5:02 PM: Adam Jacob declares the incident, Dave Parfitt is incident commander
new incident: disk space full on the depot

ubuntu@ip-10-0-0-190:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           396M   41M  355M  11% /run
/dev/xvda1      7.8G  7.2G  175M  98% /
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
tmpfs           396M     0  396M   0% /run/user/1000

5:06 PM: Adam Jacob creates a new EBS volume w/ 1.5 TB
5:09 PM: Adam Jacob attaches new volume to depot server
5:12 PM: brief outage announced in Habitat Slack #general channel.
- nginx, director stopped
- files copied from /hab to /mnt/hab
- removed all files from /hab
- update fstab
- unmound /mnt/hab
- mount /hab
5:15 PM: from an internal discussion w/ Jamie Winsor “the aws instance resource in Terraform for the monolith doesn’t have the ebs_block_device stanza that the original gateway has”
5:18 PM: successful login to app.habitat.sh
5:21 PM: successful package install via hab pkg install core/ruby
5:21 PM: disk space incident resolved.

ubuntu@ip-10-0-0-190:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           396M   41M  355M  11% /run
/dev/xvda1      7.8G  3.0G  4.4G  41% /
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
tmpfs           396M     0  396M   0% /run/user/1000
/dev/xvdf       1.5T  4.9G  1.4T   1% /hab

5:22 PM: reports of upload errors on larger artifacts, unrelated to disk space issue:

root@3f47518d40f6:/src/plans/results# hab pkg upload core-jruby-9.1.2.0-20160622160900-x86_64-linux.hart
» Uploading core-jruby-9.1.2.0-20160622160900-x86_64-linux.hart
→ Exists core/bash/4.3.42/20160612075613
→ Exists core/gcc-libs/5.2.0/20160612075020
→ Exists core/glibc/2.22/20160612063629
→ Exists core/jdk8/8u92/20160620143238
→ Exists core/linux-headers/4.3/20160612063537
→ Exists core/ncurses/6.0/20160612075116
→ Exists core/readline/6.3.8/20160612075601
↑ Uploading core-jruby-9.1.2.0-20160622160900-x86_64-linux.hart
   83.00 MB / 83.00 MB \ [==========================================================================================================] 100.00 % 5.70 MB/s
Unexpected response from remote
✗✗✗
✗✗✗ 503 Service Unavailable
✗✗✗

5:30 PM: nginx/ELB seems healthy
5:50 PM: wiresharking processing request/response data in wireshark
5:53 PM: uploaded files appear on disk even though a 503 was returned
5:55 PM: nginx keepalive_timeout is set to 20s, bumped to 60s
5:58 PM: ELB Idle timeout is 60 seconds, setting to 300 seconds
5:58 PM: changing nginx keepalive_timeout to 300s to match ELB
6:00 PM: nginx restarted, unsuccessfully upload a new jdk8 package to the metadave origin
6:08 PM: confirmed that our HTTP upload responses from Hyper are correct
6:12 PM: Uploading directly to the ELB instead of through Fastly works:

[13][default:/src:0]# export HAB_DEPOT_URL="https://builder-api-690653005.us-west-2.elb.amazonaws.com/v1/depot"
[14][default:/src:0]# hab pkg upload ./results/metadave-jdk8-8u92-20160622180115-x86_64-linux.hart
» Uploading ./results/metadave-jdk8-8u92-20160622180115-x86_64-linux.hart
→ Exists core/glibc/2.22/20160612063629
→ Exists core/linux-headers/4.3/20160612063537
↑ Uploading ./results/metadave-jdk8-8u92-20160622180115-x86_64-linux.hart
   143.06 MB / 143.06 MB \ [========================================================================================================] 100.00 % 8.61 MB/s  ✓ Uploaded metadave/jdk8/8u92/20160622180115
★ Upload of metadave/jdk8/8u92/20160622180115 complete.

6:45 PM: “between bytes” setting in Fastly set to 5 minutes
7:10 PM: ping Fastly support in IRC
7:13 PM: changing Fastly “connection time” = 300000, “first byte” = 300000
7:15 PM: successful upload with new Fastly settings:

[28][default:/src:0]# time hab pkg upload ./results/metadave-jdk8-8u92-20160622190929-x86_64-linux.hart
» Uploading ./results/metadave-jdk8-8u92-20160622190929-x86_64-linux.hart
→ Exists core/glibc/2.22/20160612063629
→ Exists core/linux-headers/4.3/20160612063537
↑ Uploading ./results/metadave-jdk8-8u92-20160622190929-x86_64-linux.hart
   143.07 MB / 143.07 MB | [========================================================================================================================] 100.00 % 6.03 MB/s  ✓ Uploaded metadave/jdk8/8u92/20160622190929
★ Upload of metadave/jdk8/8u92/20160622190929 complete.

7:17 PM: uploaded multiple large (140MB) artifacts successfully
7:24 PM: upload several packages to try and determine exactly which setting fixes the issue
7:29 PM: upload success, the tweak to Fastly’s time to first byte resolves the issue.
7:41 PM: 2nd incident resolved, incident closed

Contributing Factor(s)

The depot filesystem had 175M of disk space free. This prevented large file uploads and other misc errors in the builder-api service.
Fastly wasn’t configured for large file uploads. Tweaking “time to first byte” in Fastly resolved large file upload issues.
The depot server doesn’t have 5xx monitoring.

Stabilization Steps

Added 1.5TB of disk space to the depot.
Set “time to first byte” in Fastly to 300000 milliseconds.

Impact

some uploads/downloads returned 503/504’s over ~1 hour.
Replacing the disk with a new volume caused a depot outage of ~6 minutes.
uploading large artifacts would result in a 503, but retrying the upload would resolve the issue. This has been an issue since Habitat was released.

Corrective Actions

Add 5xx monitoring to Fastly or something that tests the route through Fastly -> EBS -> EC2 (Dave Parfitt)
Update Terraform - the monolith doesn’t have the ebs_block_device stanza that the original gateway has (Joshua Timberman)

Topic		Replies	Views
Postmortem for app.habitat.sh depot upload/download errors 2016-06-23 Chef Habitat	0	890	June 23, 2016
On-prem Depot: 404 "Not Found" for core/node, but visible in GUI Habitat Help depot	26	1190	August 15, 2018
Problem uploading key to depot Habitat Help	9	846	July 19, 2018
Cannot download packages from on-prem builder Habitat Help	7	999	August 13, 2018
Habitat 0.39.0 Released! Habitat Announcements	0	992	November 17, 2017

Postmortem for app.habitat.sh depot upload/download errors 2016-06-23

2016-06-23 - Habitat app.habitat.sh depot upload/download errors

Meeting

Incident Leader: Dave Parfitt

Description

Timeline

Contributing Factor(s)

Stabilization Steps

Impact

Corrective Actions

Related topics