Solution · Performance
P99 you can trust enough to alert on.
Latency is lognormal. A spike on top of a stable baseline matters. A spike during a known cold boot does not. Engager understands the difference, so the people on call do not have to.
24h
Rolling baseline
Per host, per region
1.5×
Anomaly threshold
Configurable per host
3
Sustained samples
Single spikes never page
200ms
Floor
Below floor, never anomalous
The math
Percentile, not mean.
An average is the mean of the lognormal pile. The number is misleading the moment one outlier lands. Engager keeps the full distribution and reads off P95 and P99 directly.
Reports show both. Alerts evaluate against P99. The mean is there for context, not for paging.
api.rookhq.com
Response time · last 24h
P95
412 ms
P99
520 ms
Baseline P95
220 ms
Solid line · live P95. Dashed baseline · 24h rolling P95. Filled red · anomalous samples.
Anomaly detection
A spike on top of stable beats a spike on top of noise.
Engager remembers the rolling P99 baseline for the last twenty four hours, per host, per region. A current ping qualifies as anomalous when it exceeds 1.5× the baseline AND the absolute floor.
And it has to happen three times in a row. Single bumps are noise, sustained drift is signal.
Three samples, one anomaly
Cold start aware
A free tier wake up is not an outage.
Render free tier dynos take fifteen to thirty seconds to wake. Engager stages re verification at ten, twenty five, and fifty five seconds before declaring a failure. That includes the latency budget.
The result: a dyno that just boots is never paged as slow.
Stage 1 · 10s · Stage 2 · 25s · Stage 3 · 55s
What lands in the report
Every section is a number you can defend.
Average
For context only. Reports show it next to the percentile pair so the difference is obvious.
P95 and P99
Computed from the live distribution, not bucketed approximations.
Anomaly count
How many sustained anomalous windows fired in the last twenty four hours.
Min and max
The bookends. Useful when a sample run is an obvious outlier.
Protocol
HTTP/2 by default, HTTP/3 when offered. Downgrades surface as a security event.
Deploy correlation
When latency drifts on a host within fifteen minutes of a tracked Github push, the commit is tagged.