Documentation
Frequently asked questions
Honest answers to the questions SRE teams and enterprises ask before adopting Burnless.
Safety & reliability
Can Burnless make incidents worse?
This is the most important question โ and we designed around it. Burnless has three modes: dry-run (detects, never acts), semi-auto (posts proposed steps to Slack for human approval before executing), and auto (executes immediately). Enterprise teams start with semi-auto. Auto mode is opt-in per runbook, and you define blast radius limits in sre.yaml โ maximum replicas, allowed operations, off-limits services. Burnless cannot exceed what you declare.
What stops it from triggering the wrong runbook during a complex failure?
Each burn_rate_alert maps to one specific named runbook. Burnless does not infer intent โ it only runs what you declared. If the runbook's assert step fails (e.g. availability is still below target after scaling), it stops and escalates to PagerDuty instead of retrying blindly. We prefer doing nothing over doing the wrong thing.
Is Burnless rule-based or does it understand system context?
Currently rule-based via sre.yaml. This is intentional โ SRE teams should not trust black-box automation they cannot inspect. Every decision Burnless makes is traceable to a specific rule in your sre.yaml. In Phase 2 we are adding context-aware suggestions (not autonomous actions) that flag when a runbook may not address the root cause.
Control & approval flow
Who approves actions before they run?
In semi-auto mode, Burnless posts the proposed runbook steps to a Slack channel with Approve and Reject buttons. The on-call engineer reviews and approves. Auto mode requires explicit opt-in in sre.yaml per runbook. No runbook ever runs automatically unless you have written mode: auto in that specific runbook definition.
Can we enforce human-in-the-loop for all production changes?
Yes. Set mode: semi-auto on every runbook. Burnless will never touch production without a human approval. You can also restrict which operations are allowed โ for example, allow scale-up but never allow scale-down or database operations.
How does RBAC work?
In the open source CLI, RBAC is managed by your existing Git access controls โ sre.yaml changes require a PR and reviewer approval, same as code. In the enterprise SaaS tier, role-based access control (Admin, Editor, Viewer) is enforced at the API level. SAML/SSO integration means Burnless inherits your organisation's existing access policies.
Data flow & security
What data does Burnless actually access?
Burnless only reads metric numbers from Prometheus (e.g. 99.85% uptime) and executes the shell commands you define in sre.yaml. It never accesses your database, application code, customer records, or any business data. The only credentials it holds are your Prometheus URL, Slack webhook, and PagerDuty routing key โ all stored in environment variables, never in sre.yaml.
How does data flow through the system?
Your service emits metrics โ Prometheus scrapes them โ Burnless agent queries Prometheus every 60 seconds โ compares against SLO targets in sre.yaml โ if burn rate threshold crossed, runs the declared runbook โ logs every action with timestamp, rule triggered, and outcome. Nothing is stored outside your infrastructure in the open source tier.
Where are secrets stored?
Never in sre.yaml and never in Git. Credentials (Prometheus URL, Slack webhook, PagerDuty key) are passed as environment variables. In the enterprise SaaS tier they are stored encrypted at rest (AES-256) in an isolated vault per customer.
Integrations
We already use Grafana, Datadog, and PagerDuty. Does Burnless replace them?
No โ Burnless sits on top of your existing tools and connects them. Prometheus stays as your metrics source. Grafana stays as your dashboard. PagerDuty stays as your on-call platform. Burnless adds the coordination layer: it reads from Prometheus, auto-generates the Grafana dashboard from sre.yaml, and configures PagerDuty escalation policies. You keep your existing tools; Burnless makes them talk to each other.
How hard is migration from our existing setup?
You do not migrate โ you add. Run burnless validate on a new sre.yaml alongside your existing setup. Start in dry-run mode so Burnless detects but never acts. When you trust the detections, enable semi-auto for one service. Migration is incremental and reversible. Delete sre.yaml and Burnless has zero footprint.
Can we use Datadog instead of Prometheus?
Prometheus is the supported metrics source in v0.1.0. Datadog support is on the roadmap for Phase 2. The metrics layer is behind a Go interface โ adding a new source means writing one new file without touching core logic.
Auditability & compliance
Is every action traceable for compliance?
Yes. Every action Burnless takes is logged with: timestamp, which SLO triggered it, which burn rate threshold was crossed, which runbook ran, each step executed, whether it succeeded or failed, and who approved it (in semi-auto mode). Logs are append-only. In the enterprise tier they are exportable for SOC 2 and audit purposes.
We are in fintech/healthcare โ do you have compliance certifications?
GDPR compliance is built in by design โ no PII is processed. SOC 2 Type II is planned for Q4 2026. HIPAA support is available on request for the enterprise tier. ISO 27001 is planned for Q1 2027.
Can we see exactly why a decision was made?
Yes. Every runbook execution includes a decision record: the exact sre.yaml rule that triggered it, the metric value that crossed the threshold, the burn rate calculated, and the window of data used. There is no hidden logic โ if you want to understand a decision, you read the sre.yaml rule that produced it.
Long-term trust
What happens if the Burnless project stops being maintained?
Burnless is Apache 2.0 open source. If the project stops, you keep working software you can fork, maintain, and modify. Your sre.yaml files are plain YAML โ no vendor lock-in. The runbooks are shell commands โ they run without Burnless. We have also committed in the BSL license that all SaaS code converts to Apache 2.0 after 4 years.
Is Burnless stable enough for production?
v0.1.0 is an initial release. We recommend starting in dry-run or semi-auto mode, on non-critical services first, to build confidence before enabling auto mode on production. The SLO math engine (internal/slo) and config parser (internal/config) are fully tested. The agent main loop is in active development.
We already have PagerDuty + Runbooks + custom scripts. Why change?
You do not have to change anything. Burnless's value is not replacing those tools โ it is making the configuration of all of them reviewable in Git. Right now your Runbooks are in Confluence, your alerts are in Grafana UI, and your PagerDuty policies are configured manually. Burnless puts all of that in one sre.yaml that is versioned, reviewed in PRs, and deployed from CI/CD. The question is not 'should we replace our tools' but 'should our reliability config be in Git like our application code?'