Skip to content
Anya Petrova

Anya Petrova

Vancouver, BC · she/her

A short biography

I was born in 1990 in Toronto, to a mother who taught high-school physics and a father who fixed photocopiers for a living. The photocopier matters more than the physics, although I would not have said that at twenty.

I went to UBC for computer science between 2008 and 2012, mostly because my high-school guidance counsellor told me that people like me should consider it, and because I was, then, easily flattered. I left with a degree, an enormous amount of debt, and almost no idea what a real engineering job actually looked like.

My first job was at a streaming company in 2013, on what was then called operations and is now called platform. My second was at a payments company between 2016 and 2019, where I learned, expensively, what PCI compliance meant and why my mother stopped asking me what I did at work. Since 2020 I have been at a fintech you have probably used, in some form of site reliability or platform role. I am not going to name it here, because I would like to keep being able to tell the truth in these essays.

In total: about eleven years of carrying a pager, four companies, three on-call rotations I would describe as brutal, and one that I would describe as humane and rare. The humane one is the one I am still on.

What this publication is

This is a personal essay column about reliability engineering, written for other practitioners and the people who manage them. New essays land on most Tuesdays. Sometimes I miss a Tuesday. Sometimes a piece takes three weeks because I am still arguing with it.

The pieces are longer than is fashionable. I do not write listicles, I do not produce checklists you could have found elsewhere, and I try very hard not to write anything I could have generated by stringing the right LinkedIn nouns together.

If you want a TLDR, the publication has, broadly, three concerns.

The first is the cost of on-call as an unmeasured externality. Our industry has decided that the cost of a 3 a.m. page lives somewhere other than on a budget sheet, and I think we are quietly paying for it in burnout, in resignations, in marriages, and in a generation of engineers who have learned that the right answer to how are you? is always fine, just tired.

The second is the gap between the document and the system. The runbook that no longer matches reality. The post-mortem that says all the right words and fires someone anyway. The architecture diagram that is six years out of date and is still, somehow, the one in the onboarding deck. I am interested in this gap because it is where most of our actual incidents live.

The third is the slow craft. Reliability is not, despite the industry's affection for war stories, a series of dramatic saves. It is a slow, quiet, mostly invisible practice of writing things down, deleting things that should not exist, and giving the next on-call engineer a slightly easier shift than you had. I am writing this publication because I think we do not talk about that part enough.

What I actually use

People ask me this often enough that I will save us all some time.

ThingWhat I use, as of May 2026
ObservabilityHoneycomb for tracing, Grafana + Mimir for metrics, Loki for logs. We left Datadog in 2024 over the bill.
On-callPagerDuty. I have tried the competitors. PagerDuty is fine.
Incident managementIncident.io for the workflow, a private Slack channel for the bridge.
InfrastructureA Kubernetes cluster I did not pick, and would not pick again. AWS. Terraform.
NotesObsidian, with a sync vault that holds eleven years of incident notes.
WritingiA Writer, on a 2021 MacBook Pro, on a desk that faces a window.

A few things I believe, with some confidence

  • A page that does not require a human action is not a page; it is an apology to a graph.
  • The runbook is for the version of you that is barely conscious. Write accordingly.
  • Most blameless post-mortems are blameless in language and brutal in consequence. Watch what gets done, not what gets said.
  • Resilience is a verb. Robustness is a noun. The industry uses them interchangeably and shouldn't.
  • Boring infrastructure is a moral good. The best incident is one you do not have to write about.
  • The two-day rest after a brutal rotation is not generosity. It is engineering.

Outside of work

I run ultramarathons, badly. I keep a pager that has, between 2017 and now, gone off during a wedding (mine), a funeral (my grandmother's), and a llama trek in the Sacred Valley near Cusco (a story for another essay). I read more fiction than is reasonable. I do not drink coffee after 2 p.m., not because I am virtuous, but because at thirty-six the body keeps a more honest set of books than it used to.

I live with my partner J., who is a librarian, and an elderly dog named Modem who was named, against my better judgement, after a piece of equipment I was actively replacing in 2019. The name has stuck. He answers to it. We are all making do.

Get in touch

I read every email that lands at anya@logic-loom.dev. I will not always reply quickly, but I will read it. The best disagreements I have had on this publication have come in over email, and a few have become their own essay (with permission, and a small payment of dignity to the person who pushed me to think harder).

If you want to subscribe, the form lives on the homepage. If you want to support the work, there is a tip jar on every essay. If you want to argue with something I have written, please do — I would rather be sharpened than agreed with.

Anya