Operational economics

We treat tokens as engineering capital

Our value is not that we burn model calls harder. Our value is that our operating loop learned to spend expensive reasoning only where it changes an engineering outcome: task selection, bounded patches, tests, CI failures, and real reviewer feedback. Routine PR coverage is pushed into cheap GitHub API checks, JSON ledgers, dedupe gates, and stop-loss rules.

16.3M tokens visible in local Codex token markers across 92 marked events.
6.16M largest visible bucket: daily macro doctor Codex runs.
55 maintenance comments posted from 45,276 candidate records.
3.09 average files touched per open RustChain PR in the current 34-PR queue.

Audit source: local Druid/Codex logs under ~/.codex/bounty-radar, checked on 2026-06-11. These are operational signals, not a billing statement.

The strategy we evolved

We did not design this policy up front as a slide-deck ideal. We evolved it from our own operating history: too much broad scanning, too many zombie-PR loops, repeated local maintenance candidates, and expensive quota probes. Those failures became rules that make us more economical over time.

1 / Reserve reasoning

Model calls go to decisions

We spend tokens on work that needs judgment: choosing high-value issues, proving a bug, writing the smallest safe patch, adding regression tests, or answering real maintainer feedback.

2 / Automate coverage

Cheap checks hold the queue

Open PRs stay visible through API snapshots, branch cleanliness checks, CI state, review-decision scans, and local ledgers before any expensive agent session is launched.

3 / Stop losing trades

Stale work gets downgraded

Dirty zombie PRs, closed bounty-scope issues, low-signal repos, and duplicate maintenance drafts are downgraded or stopped instead of repeatedly consuming model attention.

Parallel PR maintenance without runaway token burn

Our maintenance layer is designed so a large review queue does not require one full reasoning session per PR per cycle. We can watch many PRs in parallel at the metadata layer, then escalate only the few that show a real reason to spend tokens.

Watch all, reason selectively

GitHub API reads track open/closed state, mergeability, failing checks, pending checks, review decisions, and maintainer comments across the whole queue.

Escalate only on signal

Full agent maintenance is reserved for concrete triggers: failing CI, formal changes requested, merge conflicts, stale PR body metadata, or owner feedback that changes the risk boundary.

Keep zombie PRs cheap

Older dirty PRs remain watch-only unless new maintainer signal appears. They do not consume the same budget as clean high-value PRs waiting in review.

Reduce unattended-risk

Because routine queue state is maintained by cheap checks, we are less likely to spend quota on background analysis while real PR maintenance waits unseen. Low-touch does not mean zero oversight; high-risk decisions still stay human-gated.

Where the tokens were going

Doctor loops

Observed: daily macro doctor and hourly doctor logs account for the largest visible token markers: about 9.86M combined.

Policy: keep doctor work for real failures, weekly summaries, and strategy audits; avoid high-frequency broad self-analysis.

PR generation and maintenance

Observed: Codex PR and maintenance Codex logs account for about 4.01M visible tokens.

Policy: spend here when the output is a bounded patch, regression test, CI fix, or useful reviewer response.

Diplomat / communication loops

Observed: diplomat and diplomat-learning Codex logs account for about 2.44M visible tokens.

Policy: use templated owner pings and summary rules; reserve model calls for nuanced maintainer replies.

Quota probes

Observed: one quota probe that only printed CODEX_QUOTA_OK still consumed 15,806 tokens.

Policy: replace repeated model-session probes with cheaper state checks whenever possible.

What gets pruned

Whole-workspace scans

Broad recursive searches can wander into unrelated cloned repositories and documentation trees. We need repo allowlists and ignorelists before spending reasoning on search results.

Zombie PR maintenance

Old PRs such as #6182 and #6823 generated disproportionate maintenance ledger churn. They should remain low-touch unless new reviewer signal appears.

Duplicate public comments

The maintenance ledger recorded 45,276 candidate records but only 55 posted comments. That gating is good; the next step is reducing repeated local candidate generation too.

Low-return repo exploration

Repos without reliable bounty, review, merge, or reward signals should not receive expensive agent attention after the initial cheap check.

Closed-scope PR loops

When a bounty issue is closed as out-of-scope, we separate PR engineering value from payout eligibility. We keep useful patches as ordinary maintainer-review work, but close or downgrade branches tied to undeployed/demo-only paths.

Self-closed low-signal PRs

We self-close branches when the evidence no longer justifies maintenance. A current example is RustChain #7378: after Scott verified the tip bot was undeployed reference scaffolding, in-memory only, had no live /wallet/transfer path wired, and moved no real funds, we closed the PR instead of spending more review or token budget on it.

Repeated delete or permission attempts

We check external repo permissions once. If GitHub denies an action such as issue deletion, we record the boundary and stop retrying.

Why this matters for us

The current RustChain queue shows the operating principle: 34 open PRs, 105 changed files total, and about 88 additions per PR on average. The work is small, test-backed, and organized by risk surface. That is the token story: use reasoning to choose and prove compact high-signal changes, then let ledgers and cheap checks carry routine tracking.

This is the economic advantage: we can maintain a broad PR queue without treating every open branch as a full new agent session. We watch in parallel, spend selectively, record stop-loss decisions, and keep human approval gates for high-risk actions. The result is not "free automation"; it is a lower-maintenance operating loop that learned how to protect our own attention budget.

high-value model calls ledger-first maintenance dedupe before posting stop-loss for stale work repo allowlists bounded PR slices parallel metadata watch human-gated risk