Solution · Incidents
Incidents that close themselves.
An incident opens when probes cross threshold and resolves when a fast recovery probe catches the first healthy response. Acknowledgement, notes, and deploy correlation live alongside it. No paste from chat.
“The graph spiked at 02:11. The deploy landed at 02:09. Engager said so before we did.”
From a postmortem template you'll never have to write again
The lifecycle
Open. Acknowledge. Resolve. Replay.
Every incident moves through four states. Each state is recorded with a timestamp, an actor, and an audit row. You can move backwards if you got it wrong. The system never traps you in a misclick.
Acknowledgement does not close the incident. Resolution does. The two are independent because they answer two different questions.
Engager <reports@team.realmrook.com>
api.rookhq.com is unavailable
Detected 02:11 IST. No response.
api.rookhq.com stopped responding at 02:11 IST. Response: connection refused. Observed on 3 consecutive checks.
Possibly related
Deploy 4f2c1a8b by AravAVR · 2 minute(s) earlier.
Tighten CSP for marketing pages
Deploy correlation
The deploy that probably broke it, surfaced before you ask.
If a Github push from a tracked repository lands within fifteen minutes of an incident opening, Engager pins the commit, the author, and the message to the incident.
You never have to ask “what changed”. The answer is in the email body and the dashboard row.
Deploys vs incidents
Last 24 hour overlap
Inside an incident
Everything you need on one row.
Severity
Critical, warning, info. Defaults are sensible. Override per incident if your context changes.
Opened, acknowledged, resolved
Three timestamps, three actors, one durable trail. Replay any of them at any time.
Notes
Markdown body for the on call writeup. Visible in the next status digest. Compounding documentation, not a Slack copy.
Deploy correlation
Linked Github commit, author, minutes elapsed. Confirm or dismiss with one click.
Sample probes
Last twenty failing requests with status, latency, and cause. The forensics you would otherwise paste from a terminal.
Replay
Trigger the alert again to a single channel without resending the rest. Useful when a bot dropped the message.
What never happens
No flapping. No twin tickets. No silent recoveries.
No flapping
Threshold uses N consecutive failures, not single ping noise. Cold start guard waits through the recovery window.
No twin tickets
A second outage on the same host within the cooldown extends the existing incident, never duplicates.
No silent recoveries
A healthy probe always emits the recovery event. The incident closes, the email lands, the audit row lands.
No buried context
The acknowledgement note shows up inline on the next status report. The forensics travel with the incident.
Threshold · cooldown · recovery
Three guards, zero false alarms
Stop pasting outages into Slack.
Engager owns the lifecycle. You own the postmortem.