Fleet Routing Matrix & Cheap-Model Offload

Recommended model routing for preserving Claude Max headroom while using Codex, Spark, and free APIs aggressively where they fit.

Updated 2026-06-18 · Owner: Analyst · Refreshes weekly (cheap-model-sweep, Wednesdays) · Source: 107,338 real Claude Code requests over 30 days

Bottom line: the primary offload lane is Codex + Spark, not paid APIs.

Zach's v2 backbone is right: Sonnet stays the orchestration/review workhorse, Opus is gated high-effort only, and Codex absorbs bounded execution. The win is Claude Max headroom. Opus and Sonnet share the same Max buckets; moving eligible Sonnet execution to Codex removes it from those buckets and uses currently idle Codex/Spark capacity.

Recommended backbone

Default coordinator

Sonnet

interactive bus/tool orchestration and review

Gated escalation

Opus

high-effort only: hard reasoning, brand, troubleshooting

Primary offload

Codex

bounded implementation, code, mechanical repo work

Small packets

Spark

stateless extraction, cleanup, classifiers, summaries

The earlier Opus-low-effort-default idea is rejected. Lower effort can reduce thinking/output tokens, but the dominant Max pressure is context/cache-read tokens; effort does not remove that context. The James/Skool signal also points to Sonnet-default, Opus behind gates, and Codex for bounded execution.

Headroom lever

Codex capacity remaining

72%

recent 5h bucket remaining

Codex weekly remaining

83%

recent weekly bucket remaining

Target offload proxy

~25%

~7.1B full-input-equivalent tokens/month moved off Max if boundaries hold

The biggest lever is Sonnet->Codex, not Opus->Sonnet. Opus->Sonnet makes Claude turns lighter. Sonnet->Codex removes high-volume execution turns from the Claude Max bucket entirely.

GLM 5.2 monthly cost if we offloaded normal work

Based on 30-day run-rate · GLM priced at $1.40/M input, $4.40/M output

Low (9% offload)

$3,593

per month

Expected (18% offload)

$7,185

per month

High (27% offload)

$10,679

per month

On a hotter week the run-rate projects higher: expected ~$10,831/mo, up to ~$16,080/mo. "Offload %" is the share of total fleet input tokens routed to GLM; the rest stays on Claude.

Why the naive estimate is a mirage

$58/mo — the trap

If you only count fresh input tokens (20.5M/mo), GLM looks almost free. This is the number that makes cheap-model offload sound like an obvious win.

$7,185/mo — the reality

Every multi-turn request re-sends the full conversation context. That is 27,899M input tokens/mo, of which 27,173M (97%) is cache-read. Claude Max bills that re-sent context at near zero. GLM (and every per-token API) bills it at full $1.40/M, every turn.

The real cost advantage of Claude Max is its caching, not the model. Our work is extremely context-heavy: the median request carries 134k tokens of context, the 90th percentile carries 726k. Re-billing that per turn on any external API is what drives the cost.

The corrected offload strategy

Keep on Sonnet / gated Opus

Interactive multi-turn coordination, approvals, prioritization, final judgment, brand-critical voice, and ambiguous troubleshooting. Codex should receive clear briefs from this layer, not replace the live coordinator.

Route to Codex, Spark, and free APIs

Bounded implementation, mechanical file work, tests, lint fixes, small extraction, JSON cleanup, transcript formatting, and public-source summaries. Each packet should have explicit inputs, constraints, output schema, and verification.

The center line: Codex is strong at bounded, well-specified execution and code. It is not a drop-in replacement for Sonnet as the live multi-turn bus/tool coordinator.

Routing matrix

Agent x task-type x driver x effort

Agent	Task type	Driver	Effort	Now -> target	Rationale
Sage	Live orchestration, approvals, external-action judgment	Sonnet; Opus-gated	Medium; Opus high	Claude-heavy -> Sonnet default	Interactive coordinator stays Claude; Codex gets packets.
Sage	Research prework, table drafts, static artifact edits	Codex-5.5 / Spark-5.3	Low/medium	Sonnet execution -> Codex/Spark	Clear briefs can exit after patch/verify.
Apollo	Queue orchestration, worker coordination, triage decisions	Sonnet; Opus-gated for hard gameplanning	Medium/high	Sonnet -> Sonnet	Stateful multi-agent judgment is not Codex's lane.
Apollo	Worker packet execution, QA, report formatting	Codex-5.5	Medium	Sonnet workers -> Codex workers	Bounded execution moves off Max.
Scribe	Final Zach-facing voice, brand-critical content	Sonnet; Opus-gated final pass	Medium/high	Claude -> Sonnet + rare Opus	Final voice and reputation stay Claude.
Scribe	Outline expansion, transcript cleanup, nonfinal drafts	Spark-5.3 / free API; Codex for files	Low	Sonnet drafting -> Spark/free/Codex	Stateless drafts are cheap and reviewable.
Forge	Implementation, tests, lint fixes, mechanical debugging	Codex-5.5	Medium/high	Claude Code -> Codex majority	Prime offload lane: local code execution with verification.
Forge	Architecture/security review, unclear failures	Sonnet; Opus-gated rare	Medium/high	Mixed Claude -> Sonnet default	High-blast-radius judgment stays Claude.
Mercury	Workflow/config edits, diff application, integration checks	Codex-5.5	Medium	Sonnet -> Codex patches	Bounded repo/config work fits Codex.
Mercury	External automation planning, deployment judgment	Sonnet	Medium	Sonnet -> Sonnet	External actions and coordination stay gated.
Analyst	Audits, cost models, metric scripts, static HTML	Codex-5.5	Medium	Sonnet-heavy -> Codex execution	Evidence/artifact work can be packetized and verified.
Analyst	Fleet recommendations, anomaly interpretation	Sonnet; Opus-gated rare	Medium/high	Claude -> Sonnet + Codex evidence	Final recommendation needs context and judgment.
Librarian	KB mining, corpus extraction, indexing, summaries	Spark-5.3 / free API / Codex-5.5	Low/medium	Sonnet -> Spark/free/Codex	Extraction is often stateless or file-bounded.
Librarian	Source adjudication, routing playbook synthesis	Sonnet	Medium	Sonnet -> Sonnet	Freshness/provenance conflicts need Claude review.
Hippocrates	Non-PHI scripts, freshness checks, docs/reports	Codex-5.5	Medium	Sonnet -> Codex checks	Local bounded checks are safe when privacy guardrails hold.
Hippocrates	PHI-adjacent or clinical/business risk review	Sonnet; Opus-gated only if high-stakes	Medium/high	Claude -> Claude local	Privacy and medical/business stakes override offload.
Felix	Calendar/task hygiene, simple status transforms	Spark-5.3 / Codex-5.5	Low	Sonnet -> Spark/Codex where bounded	Routine transforms should not burn Max context.
Felix	Human-facing prioritization, ambiguous personal planning	Sonnet	Medium	Sonnet -> Sonnet	Preference handling and nuance stay Claude.

Where each agent's Codex/Spark target sits

30-day full-input volume · conservative target share moved off Claude Max

Agent	Requests	Full input (30d)	Cache-read share	Target to Codex/Spark
Sage	24,982	11,001M	99%	8%
Forge	28,653	4,471M	97%	55%
Scribe	9,704	3,179M	94%	20%
Apollo	6,549	2,548M	97%	25%
Mercury	10,334	1,920M	97%	45%
Analyst	6,707	1,802M	98%	50%
Librarian	6,850	924M	96%	45%
Hippocrates	5,149	701M	95%	20%
Felix	3,237	480M	95%	25%

These are not API-cost shares; they are a conservative headroom proxy. The largest practical gain is Forge/Analyst/Mercury/Librarian bounded execution moving from Sonnet to Codex/Spark while Sage/Apollo/Scribe retain interactive judgment.

Free-API catalog: single-shot only

Use for public, stateless packets. Never PHI, secrets, credentials, or sensitive strategy.

Provider	Best fleet lane	Use when	Avoid when
NVIDIA NIM	Higher-quality public reasoning/code pre-pass	Need serious bounded synthesis or bug-hunt draft	Quota-critical scheduling or private data
OpenRouter free	Model diversity, code/test drafts, classifiers	Retry-safe public work under daily free caps	Reliability, privacy, or deterministic workflows
Groq free	Fast small packets and speech-to-text	Classification, JSON cleanup, tiny code snippets, Whisper	Large context or large output
Cerebras free trial	Fast medium bug-hunt or reasoning draft	Output is tightly capped	Persistent loops or long generations
Gemini free	Commodity extraction, summaries, multimodal, embeddings	Public/high-volume cleanup	Sensitive data or workflows needing fixed public quota
Cloudflare Workers AI	Edge microtasks, embeddings, classifiers	Task already lives near Cloudflare and needs hard cost caps	Heavy reasoning or unpredictable neuron accounting
Hugging Face providers	Experiments and model availability checks	Occasional probes	Meaningful production volume
Mistral free mode	Mistral-specific coding/doc extraction	Provider diversity or EU lane matters	Quota-critical automation without account-limit checks

Operating rule: one-shot packet, explicit schema, small output cap, one retry, provider quota logging, then fail over or queue.

Model ranking — quality first

Cost/availability are badges, not the sort key

#	Model	Best fleet route	Cost
1	z-ai/glm-5.2	Serious reasoning, long-context synthesis, code-review pre-pass	$1.40/$4.40 per M
2	qwen/qwen3.7-plus	Agentic execution drafts, queue triage, multi-step planning	$0.32/$1.28 per M
3	moonshotai/kimi-k2.7-code	Coding agents, implementation sketches, patch-plan drafts	$0.74/$3.50 per M
4	nvidia/nemotron-3-ultra (free)	Long-context public-doc summarization, plan expansion	free
5	google/gemma-4-31b (free)	Bounded code/test drafts, public doc extraction, rewrites	free
6	cohere/north-mini-code (free)	Small code edits, test stubs, lint-fix suggestions	free
7	google/gemini-3.1-flash-lite	High-volume summaries, transcript cleanup, extraction	$0.25/$1.50 per M
8	anthropic/claude-haiku-4.5	Claude-compatible low-risk execution	$1/$5 per M

For single-shot offload, prefer the free tier first (Nemotron, Gemma). Reach for cheap-paid only when free-model retries cost more time than they save. Reserve Claude Opus/Sonnet for frontier orchestration.

Execution routing — what goes where

Work type	Default route	Escalate to Claude when
Public web/source triage	gemini-flash-lite or nemotron (free)	medical/legal/financial stakes, final recommendation
First-pass code	Codex-5.5; optional Spark/free model for small draft snippets	unclear architecture, security-sensitive code
Code review pre-pass	Codex-5.5 first; compressed GLM/free bug-hunt packet only as auxiliary	auth/session logic, data deletion, prod deploy
Test generation	Codex-5.5; Spark/free API for trivial stubs	deep domain fixtures, brittle async behavior
Transcript cleanup	gemini-flash-lite or deepseek-v3	voice nuance or Zach-facing brand copy
Classify / JSON extract	gpt-oss-120b (free), gemini-flash-lite	schema errors persist after one retry
Frontier orchestration	Claude Opus/Sonnet only	always — cheap models draft, never own

Hard rule: never send secrets, PHI, credentials, or private patient data through free providers (some log prompts). And every cheap-model output gets a Claude/Codex verify step before it counts.

Cost model computed by Analyst from ~/.claude/projects transcripts (107,338 unique 30-day requests, deduped by request id). GLM billing models cache-read as full resent input, since OpenRouter has no Anthropic-style cache discount. Living source: orgs/riles-fleet/shared/cheap-model-offload-ranking.md. This page refreshes with the weekly cheap-model-sweep.