Fleet Routing Matrix & Cheap-Model Offload

Recommended model routing for preserving Claude Max headroom while using Codex, Spark, and free APIs aggressively where they fit.
Updated 2026-06-18 · Owner: Analyst · Refreshes weekly (cheap-model-sweep, Wednesdays) · Source: 107,338 real Claude Code requests over 30 days

Bottom line: the primary offload lane is Codex + Spark, not paid APIs.

Zach's v2 backbone is right: Sonnet stays the orchestration/review workhorse, Opus is gated high-effort only, and Codex absorbs bounded execution. The win is Claude Max headroom. Opus and Sonnet share the same Max buckets; moving eligible Sonnet execution to Codex removes it from those buckets and uses currently idle Codex/Spark capacity.

Recommended backbone

Default coordinator
Sonnet
interactive bus/tool orchestration and review
Gated escalation
Opus
high-effort only: hard reasoning, brand, troubleshooting
Primary offload
Codex
bounded implementation, code, mechanical repo work
Small packets
Spark
stateless extraction, cleanup, classifiers, summaries

The earlier Opus-low-effort-default idea is rejected. Lower effort can reduce thinking/output tokens, but the dominant Max pressure is context/cache-read tokens; effort does not remove that context. The James/Skool signal also points to Sonnet-default, Opus behind gates, and Codex for bounded execution.

Headroom lever

Codex capacity remaining
72%
recent 5h bucket remaining
Codex weekly remaining
83%
recent weekly bucket remaining
Target offload proxy
~25%
~7.1B full-input-equivalent tokens/month moved off Max if boundaries hold

The biggest lever is Sonnet->Codex, not Opus->Sonnet. Opus->Sonnet makes Claude turns lighter. Sonnet->Codex removes high-volume execution turns from the Claude Max bucket entirely.

GLM 5.2 monthly cost if we offloaded normal work

Based on 30-day run-rate · GLM priced at $1.40/M input, $4.40/M output

Low (9% offload)
$3,593
per month
Expected (18% offload)
$7,185
per month
High (27% offload)
$10,679
per month

On a hotter week the run-rate projects higher: expected ~$10,831/mo, up to ~$16,080/mo. "Offload %" is the share of total fleet input tokens routed to GLM; the rest stays on Claude.

Why the naive estimate is a mirage

$58/mo — the trap

If you only count fresh input tokens (20.5M/mo), GLM looks almost free. This is the number that makes cheap-model offload sound like an obvious win.

$7,185/mo — the reality

Every multi-turn request re-sends the full conversation context. That is 27,899M input tokens/mo, of which 27,173M (97%) is cache-read. Claude Max bills that re-sent context at near zero. GLM (and every per-token API) bills it at full $1.40/M, every turn.

The real cost advantage of Claude Max is its caching, not the model. Our work is extremely context-heavy: the median request carries 134k tokens of context, the 90th percentile carries 726k. Re-billing that per turn on any external API is what drives the cost.

The corrected offload strategy

Keep on Sonnet / gated Opus

Interactive multi-turn coordination, approvals, prioritization, final judgment, brand-critical voice, and ambiguous troubleshooting. Codex should receive clear briefs from this layer, not replace the live coordinator.

Route to Codex, Spark, and free APIs

Bounded implementation, mechanical file work, tests, lint fixes, small extraction, JSON cleanup, transcript formatting, and public-source summaries. Each packet should have explicit inputs, constraints, output schema, and verification.

The center line: Codex is strong at bounded, well-specified execution and code. It is not a drop-in replacement for Sonnet as the live multi-turn bus/tool coordinator.

Routing matrix

Agent x task-type x driver x effort

AgentTask typeDriverEffortNow -> targetRationale
SageLive orchestration, approvals, external-action judgmentSonnet; Opus-gatedMedium; Opus highClaude-heavy -> Sonnet defaultInteractive coordinator stays Claude; Codex gets packets.
SageResearch prework, table drafts, static artifact editsCodex-5.5 / Spark-5.3Low/mediumSonnet execution -> Codex/SparkClear briefs can exit after patch/verify.
ApolloQueue orchestration, worker coordination, triage decisionsSonnet; Opus-gated for hard gameplanningMedium/highSonnet -> SonnetStateful multi-agent judgment is not Codex's lane.
ApolloWorker packet execution, QA, report formattingCodex-5.5MediumSonnet workers -> Codex workersBounded execution moves off Max.
ScribeFinal Zach-facing voice, brand-critical contentSonnet; Opus-gated final passMedium/highClaude -> Sonnet + rare OpusFinal voice and reputation stay Claude.
ScribeOutline expansion, transcript cleanup, nonfinal draftsSpark-5.3 / free API; Codex for filesLowSonnet drafting -> Spark/free/CodexStateless drafts are cheap and reviewable.
ForgeImplementation, tests, lint fixes, mechanical debuggingCodex-5.5Medium/highClaude Code -> Codex majorityPrime offload lane: local code execution with verification.
ForgeArchitecture/security review, unclear failuresSonnet; Opus-gated rareMedium/highMixed Claude -> Sonnet defaultHigh-blast-radius judgment stays Claude.
MercuryWorkflow/config edits, diff application, integration checksCodex-5.5MediumSonnet -> Codex patchesBounded repo/config work fits Codex.
MercuryExternal automation planning, deployment judgmentSonnetMediumSonnet -> SonnetExternal actions and coordination stay gated.
AnalystAudits, cost models, metric scripts, static HTMLCodex-5.5MediumSonnet-heavy -> Codex executionEvidence/artifact work can be packetized and verified.
AnalystFleet recommendations, anomaly interpretationSonnet; Opus-gated rareMedium/highClaude -> Sonnet + Codex evidenceFinal recommendation needs context and judgment.
LibrarianKB mining, corpus extraction, indexing, summariesSpark-5.3 / free API / Codex-5.5Low/mediumSonnet -> Spark/free/CodexExtraction is often stateless or file-bounded.
LibrarianSource adjudication, routing playbook synthesisSonnetMediumSonnet -> SonnetFreshness/provenance conflicts need Claude review.
HippocratesNon-PHI scripts, freshness checks, docs/reportsCodex-5.5MediumSonnet -> Codex checksLocal bounded checks are safe when privacy guardrails hold.
HippocratesPHI-adjacent or clinical/business risk reviewSonnet; Opus-gated only if high-stakesMedium/highClaude -> Claude localPrivacy and medical/business stakes override offload.
FelixCalendar/task hygiene, simple status transformsSpark-5.3 / Codex-5.5LowSonnet -> Spark/Codex where boundedRoutine transforms should not burn Max context.
FelixHuman-facing prioritization, ambiguous personal planningSonnetMediumSonnet -> SonnetPreference handling and nuance stay Claude.

Where each agent's Codex/Spark target sits

30-day full-input volume · conservative target share moved off Claude Max

AgentRequestsFull input (30d)Cache-read shareTarget to Codex/Spark
Sage24,98211,001M99%8%
Forge28,6534,471M97%55%
Scribe9,7043,179M94%20%
Apollo6,5492,548M97%25%
Mercury10,3341,920M97%45%
Analyst6,7071,802M98%50%
Librarian6,850924M96%45%
Hippocrates5,149701M95%20%
Felix3,237480M95%25%

These are not API-cost shares; they are a conservative headroom proxy. The largest practical gain is Forge/Analyst/Mercury/Librarian bounded execution moving from Sonnet to Codex/Spark while Sage/Apollo/Scribe retain interactive judgment.

Free-API catalog: single-shot only

Use for public, stateless packets. Never PHI, secrets, credentials, or sensitive strategy.

ProviderBest fleet laneUse whenAvoid when
NVIDIA NIMHigher-quality public reasoning/code pre-passNeed serious bounded synthesis or bug-hunt draftQuota-critical scheduling or private data
OpenRouter freeModel diversity, code/test drafts, classifiersRetry-safe public work under daily free capsReliability, privacy, or deterministic workflows
Groq freeFast small packets and speech-to-textClassification, JSON cleanup, tiny code snippets, WhisperLarge context or large output
Cerebras free trialFast medium bug-hunt or reasoning draftOutput is tightly cappedPersistent loops or long generations
Gemini freeCommodity extraction, summaries, multimodal, embeddingsPublic/high-volume cleanupSensitive data or workflows needing fixed public quota
Cloudflare Workers AIEdge microtasks, embeddings, classifiersTask already lives near Cloudflare and needs hard cost capsHeavy reasoning or unpredictable neuron accounting
Hugging Face providersExperiments and model availability checksOccasional probesMeaningful production volume
Mistral free modeMistral-specific coding/doc extractionProvider diversity or EU lane mattersQuota-critical automation without account-limit checks

Operating rule: one-shot packet, explicit schema, small output cap, one retry, provider quota logging, then fail over or queue.

Model ranking — quality first

Cost/availability are badges, not the sort key

#ModelBest fleet routeCost
1z-ai/glm-5.2Serious reasoning, long-context synthesis, code-review pre-pass$1.40/$4.40 per M
2qwen/qwen3.7-plusAgentic execution drafts, queue triage, multi-step planning$0.32/$1.28 per M
3moonshotai/kimi-k2.7-codeCoding agents, implementation sketches, patch-plan drafts$0.74/$3.50 per M
4nvidia/nemotron-3-ultra (free)Long-context public-doc summarization, plan expansionfree
5google/gemma-4-31b (free)Bounded code/test drafts, public doc extraction, rewritesfree
6cohere/north-mini-code (free)Small code edits, test stubs, lint-fix suggestionsfree
7google/gemini-3.1-flash-liteHigh-volume summaries, transcript cleanup, extraction$0.25/$1.50 per M
8anthropic/claude-haiku-4.5Claude-compatible low-risk execution

For single-shot offload, prefer the free tier first (Nemotron, Gemma). Reach for cheap-paid only when free-model retries cost more time than they save. Reserve Claude Opus/Sonnet for frontier orchestration.

Execution routing — what goes where

Work typeDefault routeEscalate to Claude when
Public web/source triagegemini-flash-lite or nemotron (free)medical/legal/financial stakes, final recommendation
First-pass codeCodex-5.5; optional Spark/free model for small draft snippetsunclear architecture, security-sensitive code
Code review pre-passCodex-5.5 first; compressed GLM/free bug-hunt packet only as auxiliaryauth/session logic, data deletion, prod deploy
Test generationCodex-5.5; Spark/free API for trivial stubsdeep domain fixtures, brittle async behavior
Transcript cleanupgemini-flash-lite or deepseek-v3voice nuance or Zach-facing brand copy
Classify / JSON extractgpt-oss-120b (free), gemini-flash-liteschema errors persist after one retry
Frontier orchestrationClaude Opus/Sonnet onlyalways — cheap models draft, never own

Hard rule: never send secrets, PHI, credentials, or private patient data through free providers (some log prompts). And every cheap-model output gets a Claude/Codex verify step before it counts.

Cost model computed by Analyst from ~/.claude/projects transcripts (107,338 unique 30-day requests, deduped by request id). GLM billing models cache-read as full resent input, since OpenRouter has no Anthropic-style cache discount. Living source: orgs/riles-fleet/shared/cheap-model-offload-ranking.md. This page refreshes with the weekly cheap-model-sweep.