Blameless Postmorterm Guideline

Intro

A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.

Why Have Them

“Only by analyzing our shortcomings can we learn to do better”

With our large scale, complex and distributed systems, its inevitable that incidents and outages will occur. Left unchecked, incidents can multiply in complexity which could overwhelm a system and its operators. Performing a post-mortem shows commitment to reducing technical debt in your solution and shows a will to improve and do better.

They help with the following:

When to have them

Having a postmortem is not punishment—it is a learning opportunity for the entire organization. The postmortem process does present an inherent cost in terms of time and effort, so you can be deliberate in choosing when to write one. However certain triggers can be used to determine at a minimum when one should occur. It is important to define your postmortem criteria before an incident occurs so that everyone knows when a post-mortem is necessary.

Components of a Post-mortem

Planning

Meeting

Documenting

Documenting the post-mortem will contribute to the knowledge base and allow us to share the lesson learned. Key contents include:

Review

Publication

Postmortem Templates and Samples

Google’s Postmortem Example

Do’s

Don’ts

Templates & Tools

Google’s Postmortem Example

Etsy Morgue

References

https://en.wikipedia.org/wiki/Postmortem_documentation

https://sre.google/sre-book/postmortem-culture/

https://sre.google/workbook/postmortem-analysis/

https://www.freecodecamp.org/news/what-is-a-software-post-mortem/