Skip to Content

Policies

Policies turn events into alert conditions.

Policies API
See the exact policy routes, allowed enum values, request bodies, and lifecycle behavior.
Alert Runtime
Follow the path from a policy definition to triggered alerts, deliveries, and operator actions.

Core inputs

FieldValuesDescription
event_idUUIDWhich event definition to monitor
titlestringHuman-readable label for the policy
channelstringRouting namespace (e.g., default, ops, market)
periodminute, hour, dayRollup window to evaluate
aggregatesum, count, average, min, max, p95_est, p99_estHow to reduce the window
conditiongt, lt, eq, gte, lteComparison operator
thresholdnumberValue to compare against
severityinfo, warning, error, criticalAlert severity
enabledbooleanWhether the policy is active

Choosing the right aggregate

The correct aggregate depends on what the raw event value represents. Picking the wrong aggregate leads to alerts that never fire or fire on noise.

Count and occurrence events

Events like tool.errors.count, api.requests.count, job.retry.count where each log has value: 1 (or a small integer representing “this happened N times”).

AggregateWhen to useExample policy
sumTotal occurrences in the period”Alert if error count exceeds 50 per hour”
countNumber of logs (not value-weighted)“Alert if more than 100 error logs per hour”

For most count events, sum is the right choice.

Latency and duration events

Events like task.duration_ms, api.response.duration_ms where each log carries a timing measurement.

AggregateWhen to useExample policy
averageTypical performance”Alert if average response time exceeds 500ms per hour”
p95_estTail latency affecting many users”Alert if p95 latency exceeds 2s per day”
p99_estWorst-case tail latency”Alert if p99 latency exceeds 5s per day”
maxAbsolute worst case”Alert if any request exceeds 10s per hour”

For latency, start with average or p95_est. Use p99_est or max for strict SLA monitoring.

Additive and gauge events

Events like tokens.used, queue.backlog.gauge, market.bitcoin.price_usd where the value is a measurement or running total.

AggregateWhen to useExample policy
sumCumulative spend or volume”Alert if token spend exceeds 10,000 per day”
averageTypical level”Alert if average queue depth exceeds 500 per minute”
minFloor monitoring”Alert if price drops below 60,000”
maxCeiling monitoring”Alert if backlog exceeds 1,000 per minute”

Binary and heartbeat events

Events like heartbeat.missed.count where any non-zero value is a problem.

AggregateWhen to useExample policy
sumAny occurrence is bad”Alert if missed heartbeats > 0 per day” with condition: gt, threshold: 0

Period selection

The period determines how often the policy is evaluated:

  • minute — fastest alerting, fires within ~1-2 minutes. Use for critical operational signals.
  • hour — fires after the hour boundary. Good for rate-based alerts (errors per hour, requests per hour).
  • day — fires after the day boundary (in the event’s timezone). Use for daily budgets, SLA reporting, and trend monitoring.

Channel routing

The policy channel field is a routing namespace for alerts. Typical values are:

  • default
  • ops
  • market
  • sentiment

Keep this separate from destination provider type. Destination transport uses values like slack, discord, telegram, webhook, or email in the destination channel field.

Lifecycle

Policies are created and updated through:

  • POST /v1.0/policies
  • GET /v1.0/policies
  • GET /v1.0/policies/:policyID
  • PUT /v1.0/policies/:policyID

Triggered output appears in Alerts.

Last updated on