Weekly Review: The Stack Around the Model

The week's curated tutorials, tools, and news on memory layers, harness frameworks, and the production plumbing around AI agents

May 11, 2026

Welcome back to Altered Craft’s weekly AI review for developers. Grateful you’re sharing your week with AC. The throughline this edition is plumbing: production-grade agent work has moved into the stack around the model. Memory turns into a lifecycle, SRE agent earns its place at the back of a deterministic funnel, and harness frameworks turn last week’s editorial argument into shipped code. The model is the cheapest part of the system.

TUTORIALS & CASE STUDIES

Here we map the production stack around the agent, from memory architecture down through temporal retrieval, deterministic pipelines, alignment training, and the economics behind it all.

Agent Memory, From Context Window to Production System

Estimated read time: 14 min

Cobanov’s interactive essay walks through agent memory from naive FIFO context windows to hybrid retrieval, multi-agent permissions, and production latency budgets. The argument: agent memory is a retrieval product, not a feature flag. Live demos make every tradeoff concrete.

The takeaway: Treat memory as a lifecycle problem with write, age, supersede, and forget semantics, not a vector DB you bolt on at the end. The interactive demos let you feel each tradeoff before committing to one.

Building an AI SRE Agent That Doesn’t Cry Wolf

Estimated read time: 9 min

Extending production architecture to ops, Sam details an open-source AI SRE agent that watches production logs without flooding Slack. The core principle: never put the AI at the front of the funnel. Cheap deterministic filters handle 99% of logs before the LLM sees anything.

Key point: When wiring AI into production systems, treat the LLM as the last resort in a layered pipeline, not the first responder. Deterministic filters do the heavy lifting; the model only sees the cases that earned its attention.

Teaching Claude Why: Anthropic’s Lessons in Alignment Training

Estimated read time: 11 min

Shifting from runtime architecture to training-time architecture, Anthropic cut Claude’s blackmail rate on agentic misalignment evals from 96% to zero. The key insight: teaching the reasoning behind aligned behavior beats training on aligned actions alone. Constitutional documents and diverse environments generalized far better than training directly against evaluations.

What this enables: When fine-tuning models for behavior, train on the principles behind good decisions, not just examples of correct outputs. The same lesson scales down to any domain-specific fine-tune you might run.

The Real Cost Per Token: Coding Plan Pricing, Decoded

Estimated read time: 5 min

Zooming out from architecture to economics, a proxy-logged comparison of six coding subscriptions reveals Claude Pro costs roughly 185x more per token than MiniMax on identical Claude Code workloads. Opus 4.7 still leads on speed and intent-following, complicating any “pick the cheapest” instinct.

Worth noting: Subscription headline prices hide huge differences in delivered tokens, so measure your actual workload before picking a plan. Speed and intent-following can still justify the premium for daily-driver work.

TOOLS

The tooling layer hardens around the same idea this week: frameworks, loops, and infrastructure that wrap the model rather than replace it.

Flue: A Framework for Building Agents Like Claude Code

Estimated read time: 2 min

Building on our coverage of Chris Parsons’ harness-over-prompts argument[1] from last week, Flue is a programmable agent framework built on the principle that Agent = Model + Harness, the architecture behind Claude Code and Codex. It composes models, harnesses, sandboxes, and filesystem tools, then ships agents as HTTP servers or CLI commands.

The opportunity: If off-the-shelf AI tools don’t fit your workflows, owning the harness layer is what unlocks agents that actually match your product and data. Flue gives you a runtime for that ownership.

[1] Coding With AI in 2026: From Approver to Trainer

Codex CLI Adds /goal: A Built-In Ralph Loop

Estimated read time: 2 min

Taking the harness idea further at the loop level, OpenAI’s Codex CLI 0.128.0 introduces a /goal command that loops until completion or token exhaustion, baking the Ralph loop pattern into the agent. The implementation lives in two injected prompts, goals/continuation.md and goals/budget_limit.md, appended at each turn.

Why now: Persistent goal loops are moving from prompt patterns into shipped agent features, so set token budgets deliberately before letting one run unattended overnight.

DeepSeek TUI: A Terminal Coding Agent with Auto-Routed Reasoning

Estimated read time: 9 min

Also in the terminal-agent space, DeepSeek TUI is a keyboard-driven coding agent built around DeepSeek V4, with 1M-token context, streaming reasoning, and Plan/Agent/YOLO modes. Its standout feature: auto mode picks both model and thinking level per turn via a cheap routing call.

What’s interesting: Per-turn model routing, workspace rollback, and LSP diagnostics in one Rust binary give you a Claude Code-style agent loop without leaving the terminal or the open-weights ecosystem.

ds4.c: A Narrow Bet on One Model, Done End-to-End

Estimated read time: 9 min

Zooming in on local inference, Antirez releases ds4.c, a Metal-only inference engine built specifically for DeepSeek V4 Flash. The project makes a deliberately narrow bet on one model at a time, with 1M token context, disk-resident KV cache, and 2-bit quants reliable under coding agents on 128GB MacBooks.

The context: When local inference is treated as engine plus model plus agent validation working together, “runnable” finally starts to feel like “finished” on a personal machine.

Agents Can Now Provision Their Own Cloud Infrastructure

Estimated read time: 7 min

Pushing the harness outward to the cloud, Cloudflare and Stripe co-designed a protocol where coding agents handle account creation, domains, payment, and deployment with minimal human input. Through Stripe Projects acting as identity and payment orchestrator, agents discover services via catalog APIs within a default $100/month spending cap.

Where to invest: If you’re building a coding agent or developer platform, this protocol offers a standard way to let agents ship to production without signup mazes or credit card handoffs.

NEWS & EDITORIALS

This week’s editorials sketch the architecture from above, then close on a sharp dissent about how much we should hand off in the first place.

A Mental Model for Agentic Work: Five Components, One Architecture

Estimated read time: 6 min

Architecture diagram of the agentic work mental model

Basti argues every agentic system follows the same five-component architecture: LLM, host, agentic loop, context, and shared workspace. Using OpenClaw, Cursor, and Notion as examples, he shows host choice and context depth drive real leverage, while models grow commoditized.

The principle: Treat your host and context layer as strategic decisions; the model underneath is increasingly interchangeable, so spend design effort where switching costs actually live.

Your Codebase Isn’t a Factory, It’s a Company

Estimated read time: 12 min

Extending the architecture metaphor up to the organizational level, Noah Brier argues software is Warhol’s factory, not Ford’s. Borrowing Stewart Brand’s pace layers, he offers a framework, standards, architecture, specs, plans, code, for keeping humans and agents aligned around shared vision rather than optimizing throughput.

Practical tip: Treat your AI agents like new hires who need onboarding documents and enforced standards, not like machines on an assembly line. Pace layers give you a vocabulary for what changes slowly versus quickly.

Subquadratic’s 12M-Token Context Window Takes Aim at Attention’s Quadratic Wall

Estimated read time: 9 min

Shifting from architecture to infrastructure, Miami startup Subquadratic launched a model claiming linear scaling with context length. Benchmarks show 92.1% needle-in-a-haystack at 12M tokens and 83 on MRCR v2, beating OpenAI by nine points. Caveats: single-run evals and a category with unfulfilled promises.

Heads up: ⚠If the benchmarks hold up in production, the workarounds developers rely on today (RAG, agentic decomposition) may start looking like scaffolding around a problem solved at the architecture level.

Claude Raises Usage Limits as Compute Capacity Expands

Estimated read time: 2 min

Continuing the infrastructure beat, Anthropic is doubling Claude Code’s five-hour rate limits for Pro, Max, Team, and Enterprise plans, removing peak-hours throttling, and raising Claude Opus API limits. A new SpaceX partnership adds 300+ megawatts and 220,000 GPUs this month.

When this fits: If you’ve been hitting Claude Code rate ceilings during long coding sessions, the headroom just doubled and the peak-hours penalty is gone, so plan longer agent runs without throttling anxiety.

Agentic Coding Is a Trap

Estimated read time: 15 min

Closing on a dissenting note, and continuing our coverage of Koshy John’s case against outsourcing reasoning[2] from last week, Lars Faye challenges the Spec Driven Development hype by naming a paradox of supervision where the skills needed to review AI output atrophy from overuse. He proposes inverting the workflow: use LLMs for planning while staying hands-on with implementation.

The counterpoint: Treat coding agents as the Ship’s Computer, not Data. Delegate selectively while staying hands-on to preserve the critical thinking that makes you a capable reviewer.

[2] AI Should Elevate Your Thinking, Not Replace It

Discussion about this post

Ready for more?