Post Mortem of Berkshelf/Faraday issues on Oct 5


#1

Hello all,

On Oct 5 users of the Berkshelf gem saw an error with Faraday. We are conducting a live, public post mortem now and I will include a link to the recording as soon as we are completed. If you’d like to watch it live, go to http://youtu.be/NcD7BPV4yUU


#2

Post mortem is complete, you can watch the video here https://www.youtube.com/watch?v=NcD7BPV4yUU&feature=youtu.be

And see the document we were working on (unfortunately the recording did not capture my screen share for some reason…) here https://gist.github.com/nellshamrell/967c162503efd2fdc9c4


#3

Hi Nell,
Thanks for having an open postmortem.
Is there a written set of notes one might be able to read?
Thanks!


#4

Mike,

The incident is written up in the gist that Nell linked: https://gist.github.com/nellshamrell/967c162503efd2fdc9c4

Is that what you’re looking for?

Thanks,
Nathen


#5

That is indeed what I was looking for. An edit to a post on Discourse doesn’t update the prior email in my inbox. :smile:


#6

Thanks for providing the link, the edit to the post doesn’t make it out to email.

Might I ask that the file be renamed to FILENAME.md so GitHub’s Gist renderer will prettify it so I don’t have to side-scroll in the browser as much?


#7

One of the biggest concerns and questions I have coming out of this is that the reporting mechanism used relied on the reporting user (Noah) having a lot of context and personal connections that aren’t always going to be available during an incident. (https://twitter.com/kantrn/status/651187991688286208)

Is reaching out to an individual the best approach? Where / how else would this issue have been reported or raised if Noah hadn’t reached out directly to Nell?


#8

Given that this wasn’t in a Chef packaged version of Berks and is in more of the “best effort” class of support from the Berks team, I think it would be filed as a bug against berks or whatever repo. The reason it went to Nell was that Noah initially believed it was a configuration change on the public supermarket instance.


#9

Agree but that’s a statement filled with post-incident context.

Knowing only what was know at the time of the incident, Would a tweet to @opscode_status have been more appropriate to get the issue raised? A support request?


#10

I think the primary route to reporting (suspected) issues with supermarket is to email support.