Bottom line: the primary offload lane is Codex + Spark, not paid APIs.
Zach's v2 backbone is right: Sonnet stays the orchestration/review workhorse, Opus is gated high-effort only, and Codex absorbs bounded execution. The win is Claude Max headroom. Opus and Sonnet share the same Max buckets; moving eligible Sonnet execution to Codex removes it from those buckets and uses currently idle Codex/Spark capacity.
The earlier Opus-low-effort-default idea is rejected. Lower effort can reduce thinking/output tokens, but the dominant Max pressure is context/cache-read tokens; effort does not remove that context. The James/Skool signal also points to Sonnet-default, Opus behind gates, and Codex for bounded execution.
The biggest lever is Sonnet->Codex, not Opus->Sonnet. Opus->Sonnet makes Claude turns lighter. Sonnet->Codex removes high-volume execution turns from the Claude Max bucket entirely.
On a hotter week the run-rate projects higher: expected ~$10,831/mo, up to ~$16,080/mo. "Offload %" is the share of total fleet input tokens routed to GLM; the rest stays on Claude.
$58/mo — the trap
If you only count fresh input tokens (20.5M/mo), GLM looks almost free. This is the number that makes cheap-model offload sound like an obvious win.
$7,185/mo — the reality
Every multi-turn request re-sends the full conversation context. That is 27,899M input tokens/mo, of which 27,173M (97%) is cache-read. Claude Max bills that re-sent context at near zero. GLM (and every per-token API) bills it at full $1.40/M, every turn.
The real cost advantage of Claude Max is its caching, not the model. Our work is extremely context-heavy: the median request carries 134k tokens of context, the 90th percentile carries 726k. Re-billing that per turn on any external API is what drives the cost.
Keep on Sonnet / gated Opus
Interactive multi-turn coordination, approvals, prioritization, final judgment, brand-critical voice, and ambiguous troubleshooting. Codex should receive clear briefs from this layer, not replace the live coordinator.
Route to Codex, Spark, and free APIs
Bounded implementation, mechanical file work, tests, lint fixes, small extraction, JSON cleanup, transcript formatting, and public-source summaries. Each packet should have explicit inputs, constraints, output schema, and verification.
The center line: Codex is strong at bounded, well-specified execution and code. It is not a drop-in replacement for Sonnet as the live multi-turn bus/tool coordinator.
| Agent | Task type | Driver | Effort | Now -> target | Rationale |
|---|---|---|---|---|---|
| Sage | Live orchestration, approvals, external-action judgment | Sonnet; Opus-gated | Medium; Opus high | Claude-heavy -> Sonnet default | Interactive coordinator stays Claude; Codex gets packets. |
| Sage | Research prework, table drafts, static artifact edits | Codex-5.5 / Spark-5.3 | Low/medium | Sonnet execution -> Codex/Spark | Clear briefs can exit after patch/verify. |
| Apollo | Queue orchestration, worker coordination, triage decisions | Sonnet; Opus-gated for hard gameplanning | Medium/high | Sonnet -> Sonnet | Stateful multi-agent judgment is not Codex's lane. |
| Apollo | Worker packet execution, QA, report formatting | Codex-5.5 | Medium | Sonnet workers -> Codex workers | Bounded execution moves off Max. |
| Scribe | Final Zach-facing voice, brand-critical content | Sonnet; Opus-gated final pass | Medium/high | Claude -> Sonnet + rare Opus | Final voice and reputation stay Claude. |
| Scribe | Outline expansion, transcript cleanup, nonfinal drafts | Spark-5.3 / free API; Codex for files | Low | Sonnet drafting -> Spark/free/Codex | Stateless drafts are cheap and reviewable. |
| Forge | Implementation, tests, lint fixes, mechanical debugging | Codex-5.5 | Medium/high | Claude Code -> Codex majority | Prime offload lane: local code execution with verification. |
| Forge | Architecture/security review, unclear failures | Sonnet; Opus-gated rare | Medium/high | Mixed Claude -> Sonnet default | High-blast-radius judgment stays Claude. |
| Mercury | Workflow/config edits, diff application, integration checks | Codex-5.5 | Medium | Sonnet -> Codex patches | Bounded repo/config work fits Codex. |
| Mercury | External automation planning, deployment judgment | Sonnet | Medium | Sonnet -> Sonnet | External actions and coordination stay gated. |
| Analyst | Audits, cost models, metric scripts, static HTML | Codex-5.5 | Medium | Sonnet-heavy -> Codex execution | Evidence/artifact work can be packetized and verified. |
| Analyst | Fleet recommendations, anomaly interpretation | Sonnet; Opus-gated rare | Medium/high | Claude -> Sonnet + Codex evidence | Final recommendation needs context and judgment. |
| Librarian | KB mining, corpus extraction, indexing, summaries | Spark-5.3 / free API / Codex-5.5 | Low/medium | Sonnet -> Spark/free/Codex | Extraction is often stateless or file-bounded. |
| Librarian | Source adjudication, routing playbook synthesis | Sonnet | Medium | Sonnet -> Sonnet | Freshness/provenance conflicts need Claude review. |
| Hippocrates | Non-PHI scripts, freshness checks, docs/reports | Codex-5.5 | Medium | Sonnet -> Codex checks | Local bounded checks are safe when privacy guardrails hold. |
| Hippocrates | PHI-adjacent or clinical/business risk review | Sonnet; Opus-gated only if high-stakes | Medium/high | Claude -> Claude local | Privacy and medical/business stakes override offload. |
| Felix | Calendar/task hygiene, simple status transforms | Spark-5.3 / Codex-5.5 | Low | Sonnet -> Spark/Codex where bounded | Routine transforms should not burn Max context. |
| Felix | Human-facing prioritization, ambiguous personal planning | Sonnet | Medium | Sonnet -> Sonnet | Preference handling and nuance stay Claude. |
| Agent | Requests | Full input (30d) | Cache-read share | Target to Codex/Spark |
|---|---|---|---|---|
| Sage | 24,982 | 11,001M | 99% | 8% |
| Forge | 28,653 | 4,471M | 97% | 55% |
| Scribe | 9,704 | 3,179M | 94% | 20% |
| Apollo | 6,549 | 2,548M | 97% | 25% |
| Mercury | 10,334 | 1,920M | 97% | 45% |
| Analyst | 6,707 | 1,802M | 98% | 50% |
| Librarian | 6,850 | 924M | 96% | 45% |
| Hippocrates | 5,149 | 701M | 95% | 20% |
| Felix | 3,237 | 480M | 95% | 25% |
These are not API-cost shares; they are a conservative headroom proxy. The largest practical gain is Forge/Analyst/Mercury/Librarian bounded execution moving from Sonnet to Codex/Spark while Sage/Apollo/Scribe retain interactive judgment.
| Provider | Best fleet lane | Use when | Avoid when |
|---|---|---|---|
| NVIDIA NIM | Higher-quality public reasoning/code pre-pass | Need serious bounded synthesis or bug-hunt draft | Quota-critical scheduling or private data |
| OpenRouter free | Model diversity, code/test drafts, classifiers | Retry-safe public work under daily free caps | Reliability, privacy, or deterministic workflows |
| Groq free | Fast small packets and speech-to-text | Classification, JSON cleanup, tiny code snippets, Whisper | Large context or large output |
| Cerebras free trial | Fast medium bug-hunt or reasoning draft | Output is tightly capped | Persistent loops or long generations |
| Gemini free | Commodity extraction, summaries, multimodal, embeddings | Public/high-volume cleanup | Sensitive data or workflows needing fixed public quota |
| Cloudflare Workers AI | Edge microtasks, embeddings, classifiers | Task already lives near Cloudflare and needs hard cost caps | Heavy reasoning or unpredictable neuron accounting |
| Hugging Face providers | Experiments and model availability checks | Occasional probes | Meaningful production volume |
| Mistral free mode | Mistral-specific coding/doc extraction | Provider diversity or EU lane matters | Quota-critical automation without account-limit checks |
Operating rule: one-shot packet, explicit schema, small output cap, one retry, provider quota logging, then fail over or queue.
| # | Model | Best fleet route | Cost |
|---|---|---|---|
| 1 | z-ai/glm-5.2 | Serious reasoning, long-context synthesis, code-review pre-pass | $1.40/$4.40 per M |
| 2 | qwen/qwen3.7-plus | Agentic execution drafts, queue triage, multi-step planning | $0.32/$1.28 per M |
| 3 | moonshotai/kimi-k2.7-code | Coding agents, implementation sketches, patch-plan drafts | $0.74/$3.50 per M |
| 4 | nvidia/nemotron-3-ultra (free) | Long-context public-doc summarization, plan expansion | free |
| 5 | google/gemma-4-31b (free) | Bounded code/test drafts, public doc extraction, rewrites | free |
| 6 | cohere/north-mini-code (free) | Small code edits, test stubs, lint-fix suggestions | free |
| 7 | google/gemini-3.1-flash-lite | High-volume summaries, transcript cleanup, extraction | $0.25/$1.50 per M |
| 8 | anthropic/claude-haiku-4.5 | Claude-compatible low-risk execution | $1/$5 per M |
For single-shot offload, prefer the free tier first (Nemotron, Gemma). Reach for cheap-paid only when free-model retries cost more time than they save. Reserve Claude Opus/Sonnet for frontier orchestration.
| Work type | Default route | Escalate to Claude when |
|---|---|---|
| Public web/source triage | gemini-flash-lite or nemotron (free) | medical/legal/financial stakes, final recommendation |
| First-pass code | Codex-5.5; optional Spark/free model for small draft snippets | unclear architecture, security-sensitive code |
| Code review pre-pass | Codex-5.5 first; compressed GLM/free bug-hunt packet only as auxiliary | auth/session logic, data deletion, prod deploy |
| Test generation | Codex-5.5; Spark/free API for trivial stubs | deep domain fixtures, brittle async behavior |
| Transcript cleanup | gemini-flash-lite or deepseek-v3 | voice nuance or Zach-facing brand copy |
| Classify / JSON extract | gpt-oss-120b (free), gemini-flash-lite | schema errors persist after one retry |
| Frontier orchestration | Claude Opus/Sonnet only | always — cheap models draft, never own |
Hard rule: never send secrets, PHI, credentials, or private patient data through free providers (some log prompts). And every cheap-model output gets a Claude/Codex verify step before it counts.
~/.claude/projects transcripts (107,338 unique 30-day requests, deduped by request id). GLM billing models cache-read as full resent input, since OpenRouter has no Anthropic-style cache discount. Living source: orgs/riles-fleet/shared/cheap-model-offload-ranking.md. This page refreshes with the weekly cheap-model-sweep.