Engineering · Governance

Most AI agent frameworks have the safety model backwards

How I built a governed agent runtime: risk-gated execution and a hash-chained audit log.

July 1, 2026 · ~7 min read · ← All posts

The trade I refused to make

I've tried most of the popular agent frameworks. They're genuinely impressive at turning a sentence into a sequence of tool calls. But the first time I watched one decide, unprompted, to run a shell command that touched files outside the directory I'd pointed it at, I closed the tab.

Not because it did damage — it didn't. Because I realized I'd been given a choice I never wanted to make: I could have an agent that was useful (real filesystem, real shell, real git) or one that was safe (sandboxed to uselessness), but not both. Safety, in these tools, is a flag you remember to set — or a wrapper you bolt on afterward. The default is "trust me."

For a demo, "trust me" is fine. For anything you'd run against a real repository, on a real machine, it's a non-starter. So I built MachinaOS around the opposite default: an agent may do real work, but every risky thing it does is gated and recorded. This post is about how that actually works, because the interesting part isn't the idea — it's the mechanics that make it not annoying.

Risk as a first-class property of every tool

The foundation is boring on purpose: every tool in the registry declares a risk level.

LOW — read-only or trivially reversible. List a directory, read a file, get system info.
MEDIUM — writes that stay inside the workspace. Edit a file, stage a git change.
HIGH — escapes the easy-undo boundary. The real ones in the registry: shell.run, git.push / git.merge, filesystem.delete, process.start / process.stop.
CRITICAL — a reserved hard-block tier the policy engine refuses outright. Nothing in the shipped registry is tagged CRITICAL today; it's the "never, not even with an approval" backstop.

The policy that sits on top of those levels is deliberately simple:

LOW + MEDIUM  -> run
HIGH          -> pause for human approval
CRITICAL      -> blocked outright (hard stop)

The reason this matters more than it looks: the policy lives at the runtime boundary, not in the agent's reasoning. The model doesn't get to talk its way past a gate, because the gate isn't part of the prompt — it's enforced after the plan is produced and before the tool executes. A jailbroken or confused model can produce a HIGH-risk step all it wants; it still hits the same wall.

It also means the policy binds the caller, not just the model — enforced at the layer that fits each caller. The UI and the REST API both submit work into the same runtime, so a HIGH-risk step hits the same policy engine either way: it pauses for approval, and a CRITICAL step is blocked outright. An external MCP client driving the tool registry is held to the same taxonomy from the other direction — by default it can't even see HIGH-risk tools (exposure is opt-in via MACHINA_MCP_ALLOW_HIGH), and CRITICAL tools are never exposed to it at all. No caller gets a privileged bypass.

Approvals that a human can actually act on

A gate is only as good as the decision it forces. When a HIGH-risk step needs approval, the runtime pauses the task and surfaces exactly what's about to happen: the tool, the resolved arguments, and the risk level. Not "the agent wants to do something" — the literal command, post-resolution, before it runs.

Two design choices made this usable rather than tedious:

Plan first, execute second. MachinaOS splits planning from execution. The agent produces a full, inspectable plan; you see the whole shape before any step fires. Approvals then happen against a concrete plan, not a stream of surprises.
Approvals carry identity. When someone approves a step, who approved it is recorded (resolved_by) — along with the source IP. An approval isn't an anonymous "yes"; it's an attributable decision.

That second point is what turns a convenience feature into a governance one, which brings us to the part I'm most happy with.

The audit log is hash-chained

Every HIGH/CRITICAL tool invocation and every approval decision is written to an append-only audit log. The twist: each entry commits to the hash of the previous one.

entry_n.hash = H(entry_n.payload + entry_{n-1}.hash)

This is the same trick a blockchain uses, minus the consensus theater. The property it buys you is tamper-evidence: you can't quietly alter or delete an entry in the middle of the log without breaking every hash after it. If someone edits history, verification fails at the exact point of tampering.

For a tool that runs real commands on real machines, this is the difference between "we have logs" (which can be edited) and "we have a record" (which can't, silently). It's the bones of a SOC 2-style control, built in from the start rather than retrofitted when a customer's security team asks for it.

Why local-first is part of the safety story

MachinaOS runs entirely on your machine and works fully offline — the UI and all its dependencies are vendored, no CDN calls, and you can point it at a local model and pull the network cable.

I used to think of "local-first" and "governed" as two unrelated features. They aren't. Where the execution happens determines who can audit it. If your agent's risky steps run on someone else's infrastructure, your audit log is their audit log, and your governance model is a request you file with their support team. Running locally means the gate, the approval, and the tamper-evident record are all yours, on hardware you control. Governance you don't own isn't governance.

What I'm still unsure about

I don't think this design is obviously correct, and I'd rather be honest about the open questions than pretend they're solved:

Is risk-gating the right abstraction? Four levels is a guess. Maybe risk is better modeled as capabilities (filesystem, network, exec) than a linear scale.
Approval fatigue is real. Gate too much and people click "approve" without reading — which is worse than no gate, because it launders recklessness as oversight. Calibrating what deserves a gate is the hard part, and I'm not sure I've got it right yet.
Audit logs are only as good as what you log. A tamper-evident record of the wrong events is just confident noise.

Try it / tear it apart

There's a live demo where you can watch the full loop — request, plan, gated execution, audit timeline — on a seeded repo without installing anything.

If you build agent tooling, I'd genuinely like to know where you think this model breaks. The safety conversation around agents is mostly still at the "should we let it run code" stage. I think the more useful question is "when it runs code, can you prove what it did?" — and that's the question MachinaOS is built to answer.

Launch the live demo →