Solution · Incidents

Incidents that close themselves.

An incident opens when probes cross threshold and resolves when a fast recovery probe catches the first healthy response. Acknowledgement, notes, and deploy correlation live alongside it. No paste from chat.

“The graph spiked at 02:11. The deploy landed at 02:09. Engager said so before we did.”

From a postmortem template you'll never have to write again

The lifecycle

Open. Acknowledge. Resolve. Replay.

Every incident moves through four states. Each state is recorded with a timestamp, an actor, and an audit row. You can move backwards if you got it wrong. The system never traps you in a misclick.

Acknowledgement does not close the incident. Resolution does. The two are independent because they answer two different questions.

Engager <reports@team.realmrook.com>

Engager · Outage

api.rookhq.com is unavailable

Detected 02:11 IST. No response.

api.rookhq.com stopped responding at 02:11 IST. Response: connection refused. Observed on 3 consecutive checks.

Possibly related

Deploy 4f2c1a8b by AravAVR · 2 minute(s) earlier.

Tighten CSP for marketing pages

Open api.rookhq.com

Deploy correlation

The deploy that probably broke it, surfaced before you ask.

If a Github push from a tracked repository lands within fifteen minutes of an incident opening, Engager pins the commit, the author, and the message to the incident.

You never have to ask “what changed”. The answer is in the email body and the dashboard row.

Deploys vs incidents

Last 24 hour overlap

Inside an incident

Everything you need on one row.

Severity

Critical, warning, info. Defaults are sensible. Override per incident if your context changes.

Opened, acknowledged, resolved

Three timestamps, three actors, one durable trail. Replay any of them at any time.

Notes

Markdown body for the on call writeup. Visible in the next status digest. Compounding documentation, not a Slack copy.

Deploy correlation

Linked Github commit, author, minutes elapsed. Confirm or dismiss with one click.

Sample probes

Last twenty failing requests with status, latency, and cause. The forensics you would otherwise paste from a terminal.

Replay

Trigger the alert again to a single channel without resending the rest. Useful when a bot dropped the message.

What never happens

No flapping. No twin tickets. No silent recoveries.

  • No flapping

    Threshold uses N consecutive failures, not single ping noise. Cold start guard waits through the recovery window.

  • No twin tickets

    A second outage on the same host within the cooldown extends the existing incident, never duplicates.

  • No silent recoveries

    A healthy probe always emits the recovery event. The incident closes, the email lands, the audit row lands.

  • No buried context

    The acknowledgement note shows up inline on the next status report. The forensics travel with the incident.

Threshold · cooldown · recovery

Three guards, zero false alarms

Stop pasting outages into Slack.

Engager owns the lifecycle. You own the postmortem.